Download presentation
Presentation is loading. Please wait.
1
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin
2
Technology Trends Core i7 Pentium D Pentium 4 Pentium Xeon Nehalem-EX 4004 286 386 486 8086 Year of introduction Transistor count 2
3
Technology Applications 3
4
Networks-on-Chip (NOCs) 4 The backbone of highly integrated chips Transport of memory, operand, and control traffic Structured, packet-based, multi-hop networks Increasing importance with greater levels of integration Major impact on chip performance, energy, and area TRIPS: 28% performance loss on SPEC 2K in NOC Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10
5
On-chip vs Off-chip Interconnects 5 Topology Routing Flow control Pins Bandwidth Power Area
6
Future NOC Requirements 100’s to 1000’s of network clients Cores, caches, accelerators, I/O ports, … Efficient topologies High performance, small footprint Intelligent routing Performance through better load balance Light-weight flow control High performance, low buffer requirements Service Guarantees cloud computing, real-time apps demand QOS support 6 HPCA ‘09 HPCA ‘08 MICRO ‘09 under submission
7
Outline 7 Introduction Service Guarantees in Networks-on-Chip Motivation Desiderata, prior work Preemptive Virtual Clock Evaluation highlights Efficient Topologies for On-chip Interconnects Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work
8
Why On-chip Quality-of-Service? 8 Shared on-chip resources Memory controllers, accelerators, network-on-chip … require QOS support fairness, service differentiation, performance isolation End-point QOS solutions are insufficient Data has to traverse the on-chip network Need QOS support at the interconnect level Hard guarantees in NOCs
9
NOC QOS Desiderata 9 Fairness Isolation of flows Bandwidth efficiency Low overhead: delay area energy
10
Conventional QOS Disciplines 10 Fixed schedule Pros: algorithmic and implementation simplicity Cons: inefficient BW utilization; per-flow queuing Example: Round Robin Rate-based Pros: fine-grained scheduling; BW efficient Cons: complex scheduling; per-flow queuing Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89] Frame-based Pros: good throughput at modest complexity Cons: throughput-complexity trade-off; per-flow queuing Example: Rotating Combined Queuing (RCQ) [ ISCA ’96] Per-flow queuing o Area overhead o Energy overhead o Delay overhead o Scheduling complexity
11
Preemptive Virtual Clock (PVC) [HPCA ‘09] 11 Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs. Full QOS support Fairness, prioritization, performance isolation Modest area and energy overhead Minimal buffering in routers & source nodes High Performance Low latency, good BW efficiency
12
PVC: Scheduling 12 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Flow X
13
PVC: Scheduling 13 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Framing: PVC’s solution to history effect Frame rollover clears all BW counters Fixed frame duration
14
PVC: Scheduling 14 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Flow X Frame roller - BW counters reset - Priorities reset
15
PVC: Freedom from Priority Inversion 15 PVC: simple routers w/o per-flow buffering and no BW reservation Problem: high priority packets may be blocked by lower priority packets (priority inversion) x
16
PVC: Freedom from Priority Inversion 16 PVC: simple routers w/o per-flow buffering and no BW reservation Problem: high priority packets may be blocked by lower priority packets (priority inversion) Solution: preemption of lower priority packets `
17
PVC: Preemption Recovery 17 Retransmission of dropped packets Buffer outstanding packets at the source node ACK/NACK protocol via a dedicated network All packets acknowledged Narrow, low-complexity network Lower overhead than timeout-based recovery 64 node network: 30-flit backup buffer per node suffices
18
PVC: Preemption Throttling 18 Relaxed definition of priority inversion Reduces preemption frequency Small fairness penalty Per-flow bandwidth reservation Flits within the reserved quota are non-preemptible Reserved quota is a function of rate and frame size Coarsened priority classes Mask out lower-order bits of each flow’s BW counter Induces coarser priority classes Enables a fairness/throughput trade-off
19
PVC: Guarantees 19 Minimum Bandwidth Based on reserved quota Fairness Subject to BW counter resolution Worst-case Latency Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1
20
Performance Isolation 20
21
Performance Isolation 21 Baseline NOC No QOS support Globally Synchronized Frames (GSF) J. Lee, et al. ISCA 2008 Frame-based scheme adapted for on-chip implementation Source nodes enforce bandwidth quotas via self-throttling Multiple frames in-flight for performance Network prioritizes packets based on frame number Preemptive Virtual Clock (PVC) Highest fairness setting (unmasked bandwidth counters)
22
Performance Isolation 22
23
PVC Summary 23 Full QOS support Fairness & service differentiation Strong performance isolation High performance Inelaborate routers low latency Good bandwidth efficiency Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet
24
PVC Summary 24 Full QOS support Fairness & service differentiation Strong performance isolation High performance Inelaborate routers low latency Good bandwidth efficiency Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet Will it scale to 1000 nodes?
25
Outline 25 Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Mesh-based networks Toward low-diameter topologies Multidrop Express Channels Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work
26
NOC Topologies 26 Topology is the principal determinant of network performance, cost, and energy efficiency Topology desiderata Rich connectivity reduces router traversals High bandwidth reduces latency and contention Low router complexity reduces area and delay On-chip constraints 2D substrates limit implementable topologies Logic area/energy constrains use of wire resources Power constrains restrict routing choices
27
2-D Mesh 27 Pros Low design & layout complexity Simple, fast routers
28
Pros Low design & layout complexity Simple, fast routers Cons Large diameter Energy & latency impact 2-D Mesh 28
29
Pros Multiple terminals at each node Fast nearest-neighbor communication via the crossbar Hop count reduction proportional to concentration degree Cons Benefits limited by crossbar complexity 29 Concentrated Mesh (Balfour & Dally, ICS ‘06)
30
Objectives: Improve connectivity Exploit the wire budget 30 Flattened Butterfly (Kim et al., Micro ‘07)
31
Point-to-point links Nodes fully connected in each dimension 31 Flattened Butterfly
32
Pros Excellent connectivity Low diameter: 2 hops Cons High channel count: k 2 /2 per row/column Low channel utilization Control complexity 32 Flattened Butterfly
33
Objectives: Connectivity More scalable channel count Better channel utilization 33 [Grot et al., Micro ‘09] Multidrop Express Channels (MECS)
34
34 Multidrop Express Channels (MECS) Point-to-multipoint channels Single source Multiple destinations Drop points: Propagate further -OR- Exit into a router
35
35 Multidrop Express Channels (MECS)
36
36 Pros One-to-many topology Low diameter: 2 hops k channels row/column I/O asymmetry Cons I/O asymmetry Control complexity Multidrop Express Channels (MECS)
37
MECS Summary MECS: a novel one-to-many topology Excellent connectivity Effective wire utilization Good fit for planar substrates Results summary MECS: lowest latency, high energy efficiency Mesh-based topologies: best throughput Flattened butterfly: smallest router area 37
38
Outline 38 Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Kilo-NOC: A Networks for 1000+ Nodes Requirements and obstacles Topology-centric Kilo-NOC architecture Evaluation highlights Summary and Future Work
39
Scaling to a kilo-node NOC 39 Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees MECS scalability obstacles Buffer requirements: more ports, deeper buffers area, energy, latency overheads PVC scalability obstacles Flow state, other storage area, energy overheads Preemption overheads energy, latency overheads Prioritization and arbitration latency overheads
40
Scaling to a kilo-node NOC 40 Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees MECS scalability obstacles Buffer requirements: more ports, deeper buffers area, energy, latency overheads PVC scalability obstacles Flow state, other storage area, energy overheads Preemption overheads energy, latency overheads Prioritization and arbitration latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads
41
NOC QOS: Conventional Approach 41 Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)
42
NOC QOS: Conventional Approach 42 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
43
NOC QOS: Conventional Approach 43 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
44
NOC QOS: Conventional Approach 44 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
45
NOC QOS: Conventional Approach 45 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing Network-wide guarantees without network-wide QOS support
46
Kilo-NOC QOS: Topology-centric Approach 46 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom QOS-free
47
Kilo-NOC QOS: Topology-centric Approach 47 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
48
Kilo-NOC QOS: Topology-centric Approach 48 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
49
Kilo-NOC QOS: Topology-centric Approach 49 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
50
Kilo-NOC QOS: Topology-centric Approach 50 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
51
Performance Isolation 51 Stream PVC-enabled MECS topology
52
Performance Isolation 52 MECS topology With & without network-wide PVC QOS
53
Performance Isolation 53
54
Summary: Scaling NOCs to 1000+ nodes 54 Objectives: good performance, high energy- and area- efficiency, service guarantees MECS topology Point-to-multipoint interconnect fabric Rich connectivity: improves performance and efficiency PVC QOS scheme Preemptive architecture: reduces buffer requirements Strong guarantees, performance isolation
55
Summary: Scaling NOCs to 1000+ nodes 55 Topology-aware QOS architecture Limits the extent of QOS support to a fraction of the die Reduces network cost, improves performance Enables efficiency-boosting optimizations in QOS-free regions of the chip Kilo-NOC compared to MECS+PVC: NOC area reduction of 47% NOC energy reduction of 26-53%
56
Acknowledgement 56 Faculty Steve Keckler (advisor) Doug Burger Onur Mutlu Emmett Witchel Collaborators Paul Gratz Joel Hestness Special Thanks The awesome CART group
57
57
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.