Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin
Technology Trends Core i7 Pentium D Pentium 4 Pentium Xeon Nehalem-EX Year of introduction Transistor count 2
Technology Applications 3
Networks-on-Chip (NOCs) 4 The backbone of highly integrated chips Transport of memory, operand, and control traffic Structured, packet-based, multi-hop networks Increasing importance with greater levels of integration Major impact on chip performance, energy, and area TRIPS: 28% performance loss on SPEC 2K in NOC Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10
On-chip vs Off-chip Interconnects 5 Topology Routing Flow control Pins Bandwidth Power Area
Future NOC Requirements 100’s to 1000’s of network clients Cores, caches, accelerators, I/O ports, … Efficient topologies High performance, small footprint Intelligent routing Performance through better load balance Light-weight flow control High performance, low buffer requirements Service Guarantees cloud computing, real-time apps demand QOS support 6 HPCA ‘09 HPCA ‘08 MICRO ‘09 under submission
Outline 7 Introduction Service Guarantees in Networks-on-Chip Motivation Desiderata, prior work Preemptive Virtual Clock Evaluation highlights Efficient Topologies for On-chip Interconnects Kilo-NOC: A Network for Nodes Summary and Future Work
Why On-chip Quality-of-Service? 8 Shared on-chip resources Memory controllers, accelerators, network-on-chip … require QOS support fairness, service differentiation, performance isolation End-point QOS solutions are insufficient Data has to traverse the on-chip network Need QOS support at the interconnect level Hard guarantees in NOCs
NOC QOS Desiderata 9 Fairness Isolation of flows Bandwidth efficiency Low overhead: delay area energy
Conventional QOS Disciplines 10 Fixed schedule Pros: algorithmic and implementation simplicity Cons: inefficient BW utilization; per-flow queuing Example: Round Robin Rate-based Pros: fine-grained scheduling; BW efficient Cons: complex scheduling; per-flow queuing Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89] Frame-based Pros: good throughput at modest complexity Cons: throughput-complexity trade-off; per-flow queuing Example: Rotating Combined Queuing (RCQ) [ ISCA ’96] Per-flow queuing o Area overhead o Energy overhead o Delay overhead o Scheduling complexity
Preemptive Virtual Clock (PVC) [HPCA ‘09] 11 Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs. Full QOS support Fairness, prioritization, performance isolation Modest area and energy overhead Minimal buffering in routers & source nodes High Performance Low latency, good BW efficiency
PVC: Scheduling 12 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Flow X
PVC: Scheduling 13 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Framing: PVC’s solution to history effect Frame rollover clears all BW counters Fixed frame duration
PVC: Scheduling 14 Combines rate-based and frame-based features Rate-based: evolved from Virtual Clock [SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation f (provisioned rate, consumed BW) Problem: history effect Flow X Frame roller - BW counters reset - Priorities reset
PVC: Freedom from Priority Inversion 15 PVC: simple routers w/o per-flow buffering and no BW reservation Problem: high priority packets may be blocked by lower priority packets (priority inversion) x
PVC: Freedom from Priority Inversion 16 PVC: simple routers w/o per-flow buffering and no BW reservation Problem: high priority packets may be blocked by lower priority packets (priority inversion) Solution: preemption of lower priority packets `
PVC: Preemption Recovery 17 Retransmission of dropped packets Buffer outstanding packets at the source node ACK/NACK protocol via a dedicated network All packets acknowledged Narrow, low-complexity network Lower overhead than timeout-based recovery 64 node network: 30-flit backup buffer per node suffices
PVC: Preemption Throttling 18 Relaxed definition of priority inversion Reduces preemption frequency Small fairness penalty Per-flow bandwidth reservation Flits within the reserved quota are non-preemptible Reserved quota is a function of rate and frame size Coarsened priority classes Mask out lower-order bits of each flow’s BW counter Induces coarser priority classes Enables a fairness/throughput trade-off
PVC: Guarantees 19 Minimum Bandwidth Based on reserved quota Fairness Subject to BW counter resolution Worst-case Latency Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1
Performance Isolation 20
Performance Isolation 21 Baseline NOC No QOS support Globally Synchronized Frames (GSF) J. Lee, et al. ISCA 2008 Frame-based scheme adapted for on-chip implementation Source nodes enforce bandwidth quotas via self-throttling Multiple frames in-flight for performance Network prioritizes packets based on frame number Preemptive Virtual Clock (PVC) Highest fairness setting (unmasked bandwidth counters)
Performance Isolation 22
PVC Summary 23 Full QOS support Fairness & service differentiation Strong performance isolation High performance Inelaborate routers low latency Good bandwidth efficiency Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet
PVC Summary 24 Full QOS support Fairness & service differentiation Strong performance isolation High performance Inelaborate routers low latency Good bandwidth efficiency Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet Will it scale to 1000 nodes?
Outline 25 Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Mesh-based networks Toward low-diameter topologies Multidrop Express Channels Kilo-NOC: A Network for Nodes Summary and Future Work
NOC Topologies 26 Topology is the principal determinant of network performance, cost, and energy efficiency Topology desiderata Rich connectivity reduces router traversals High bandwidth reduces latency and contention Low router complexity reduces area and delay On-chip constraints 2D substrates limit implementable topologies Logic area/energy constrains use of wire resources Power constrains restrict routing choices
2-D Mesh 27 Pros Low design & layout complexity Simple, fast routers
Pros Low design & layout complexity Simple, fast routers Cons Large diameter Energy & latency impact 2-D Mesh 28
Pros Multiple terminals at each node Fast nearest-neighbor communication via the crossbar Hop count reduction proportional to concentration degree Cons Benefits limited by crossbar complexity 29 Concentrated Mesh (Balfour & Dally, ICS ‘06)
Objectives: Improve connectivity Exploit the wire budget 30 Flattened Butterfly (Kim et al., Micro ‘07)
Point-to-point links Nodes fully connected in each dimension 31 Flattened Butterfly
Pros Excellent connectivity Low diameter: 2 hops Cons High channel count: k 2 /2 per row/column Low channel utilization Control complexity 32 Flattened Butterfly
Objectives: Connectivity More scalable channel count Better channel utilization 33 [Grot et al., Micro ‘09] Multidrop Express Channels (MECS)
34 Multidrop Express Channels (MECS) Point-to-multipoint channels Single source Multiple destinations Drop points: Propagate further -OR- Exit into a router
35 Multidrop Express Channels (MECS)
36 Pros One-to-many topology Low diameter: 2 hops k channels row/column I/O asymmetry Cons I/O asymmetry Control complexity Multidrop Express Channels (MECS)
MECS Summary MECS: a novel one-to-many topology Excellent connectivity Effective wire utilization Good fit for planar substrates Results summary MECS: lowest latency, high energy efficiency Mesh-based topologies: best throughput Flattened butterfly: smallest router area 37
Outline 38 Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Kilo-NOC: A Networks for Nodes Requirements and obstacles Topology-centric Kilo-NOC architecture Evaluation highlights Summary and Future Work
Scaling to a kilo-node NOC 39 Goal: a NOC architecture that scales to clients with good efficiency and strong guarantees MECS scalability obstacles Buffer requirements: more ports, deeper buffers area, energy, latency overheads PVC scalability obstacles Flow state, other storage area, energy overheads Preemption overheads energy, latency overheads Prioritization and arbitration latency overheads
Scaling to a kilo-node NOC 40 Goal: a NOC architecture that scales to clients with good efficiency and strong guarantees MECS scalability obstacles Buffer requirements: more ports, deeper buffers area, energy, latency overheads PVC scalability obstacles Flow state, other storage area, energy overheads Preemption overheads energy, latency overheads Prioritization and arbitration latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads
NOC QOS: Conventional Approach 41 Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)
NOC QOS: Conventional Approach 42 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
NOC QOS: Conventional Approach 43 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
NOC QOS: Conventional Approach 44 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing
NOC QOS: Conventional Approach 45 NOC contention scenarios: Shared resource accesses memory access Intra-VM traffic shared cache access Inter-VM traffic VM page sharing Network-wide guarantees without network-wide QOS support
Kilo-NOC QOS: Topology-centric Approach 46 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom QOS-free
Kilo-NOC QOS: Topology-centric Approach 47 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach 48 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach 49 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach 50 Dedicated, QOS-enabled regions Rest of die: QOS-free A richly-connected topology (MECS) Traffic isolation Special routing rules Ensure interference freedom
Performance Isolation 51 Stream PVC-enabled MECS topology
Performance Isolation 52 MECS topology With & without network-wide PVC QOS
Performance Isolation 53
Summary: Scaling NOCs to nodes 54 Objectives: good performance, high energy- and area- efficiency, service guarantees MECS topology Point-to-multipoint interconnect fabric Rich connectivity: improves performance and efficiency PVC QOS scheme Preemptive architecture: reduces buffer requirements Strong guarantees, performance isolation
Summary: Scaling NOCs to nodes 55 Topology-aware QOS architecture Limits the extent of QOS support to a fraction of the die Reduces network cost, improves performance Enables efficiency-boosting optimizations in QOS-free regions of the chip Kilo-NOC compared to MECS+PVC: NOC area reduction of 47% NOC energy reduction of 26-53%
Acknowledgement 56 Faculty Steve Keckler (advisor) Doug Burger Onur Mutlu Emmett Witchel Collaborators Paul Gratz Joel Hestness Special Thanks The awesome CART group
57