Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin.

Similar presentations


Presentation on theme: "Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin."— Presentation transcript:

1 Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin

2 Technology Trends Core i7 Pentium D Pentium 4 Pentium Xeon Nehalem-EX 4004 286 386 486 8086 Year of introduction Transistor count 2

3 Technology Applications 3

4 Networks-on-Chip (NOCs) 4  The backbone of highly integrated chips  Transport of memory, operand, and control traffic  Structured, packet-based, multi-hop networks  Increasing importance with greater levels of integration  Major impact on chip performance, energy, and area  TRIPS: 28% performance loss on SPEC 2K in NOC  Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10

5 On-chip vs Off-chip Interconnects 5  Topology  Routing  Flow control  Pins  Bandwidth  Power  Area

6 Future NOC Requirements  100’s to 1000’s of network clients  Cores, caches, accelerators, I/O ports, …  Efficient topologies  High performance, small footprint  Intelligent routing  Performance through better load balance  Light-weight flow control  High performance, low buffer requirements  Service Guarantees  cloud computing, real-time apps demand QOS support 6 HPCA ‘09 HPCA ‘08 MICRO ‘09 under submission

7 Outline 7  Introduction  Service Guarantees in Networks-on-Chip  Motivation  Desiderata, prior work  Preemptive Virtual Clock  Evaluation highlights  Efficient Topologies for On-chip Interconnects  Kilo-NOC: A Network for 1000+ Nodes  Summary and Future Work

8 Why On-chip Quality-of-Service? 8  Shared on-chip resources  Memory controllers, accelerators, network-on-chip  … require QOS support  fairness, service differentiation, performance isolation  End-point QOS solutions are insufficient  Data has to traverse the on-chip network  Need QOS support at the interconnect level Hard guarantees in NOCs

9 NOC QOS Desiderata 9  Fairness  Isolation of flows  Bandwidth efficiency  Low overhead:  delay  area  energy

10 Conventional QOS Disciplines 10  Fixed schedule  Pros: algorithmic and implementation simplicity  Cons: inefficient BW utilization; per-flow queuing  Example: Round Robin  Rate-based  Pros: fine-grained scheduling; BW efficient  Cons: complex scheduling; per-flow queuing  Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89]  Frame-based  Pros: good throughput at modest complexity  Cons: throughput-complexity trade-off; per-flow queuing  Example: Rotating Combined Queuing (RCQ) [ ISCA ’96] Per-flow queuing o Area overhead o Energy overhead o Delay overhead o Scheduling complexity

11 Preemptive Virtual Clock (PVC) [HPCA ‘09] 11  Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs.  Full QOS support  Fairness, prioritization, performance isolation  Modest area and energy overhead  Minimal buffering in routers & source nodes  High Performance  Low latency, good BW efficiency

12 PVC: Scheduling 12  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect Flow X

13 PVC: Scheduling 13  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect  Framing: PVC’s solution to history effect  Frame rollover clears all BW counters  Fixed frame duration

14 PVC: Scheduling 14  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect Flow X Frame roller - BW counters reset - Priorities reset

15 PVC: Freedom from Priority Inversion 15  PVC: simple routers w/o per-flow buffering and no BW reservation  Problem: high priority packets may be blocked by lower priority packets (priority inversion) x

16 PVC: Freedom from Priority Inversion 16  PVC: simple routers w/o per-flow buffering and no BW reservation  Problem: high priority packets may be blocked by lower priority packets (priority inversion)  Solution: preemption of lower priority packets `

17 PVC: Preemption Recovery 17  Retransmission of dropped packets  Buffer outstanding packets at the source node  ACK/NACK protocol via a dedicated network  All packets acknowledged  Narrow, low-complexity network  Lower overhead than timeout-based recovery  64 node network: 30-flit backup buffer per node suffices

18 PVC: Preemption Throttling 18  Relaxed definition of priority inversion  Reduces preemption frequency  Small fairness penalty  Per-flow bandwidth reservation  Flits within the reserved quota are non-preemptible  Reserved quota is a function of rate and frame size  Coarsened priority classes  Mask out lower-order bits of each flow’s BW counter  Induces coarser priority classes  Enables a fairness/throughput trade-off

19 PVC: Guarantees 19  Minimum Bandwidth  Based on reserved quota  Fairness  Subject to BW counter resolution  Worst-case Latency  Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1

20 Performance Isolation 20

21 Performance Isolation 21  Baseline NOC  No QOS support  Globally Synchronized Frames (GSF)  J. Lee, et al. ISCA 2008  Frame-based scheme adapted for on-chip implementation  Source nodes enforce bandwidth quotas via self-throttling  Multiple frames in-flight for performance  Network prioritizes packets based on frame number  Preemptive Virtual Clock (PVC)  Highest fairness setting (unmasked bandwidth counters)

22 Performance Isolation 22

23 PVC Summary 23  Full QOS support  Fairness & service differentiation  Strong performance isolation  High performance  Inelaborate routers  low latency  Good bandwidth efficiency  Modest area and energy overhead  3.4 KB of storage per node (1.8x no-QOS router)  12-20% extra energy per packet

24 PVC Summary 24  Full QOS support  Fairness & service differentiation  Strong performance isolation  High performance  Inelaborate routers  low latency  Good bandwidth efficiency  Modest area and energy overhead  3.4 KB of storage per node (1.8x no-QOS router)  12-20% extra energy per packet Will it scale to 1000 nodes?

25 Outline 25  Introduction  Service Guarantees in Networks-on-Chip  Efficient Topologies for On-chip Interconnects  Mesh-based networks  Toward low-diameter topologies  Multidrop Express Channels  Kilo-NOC: A Network for 1000+ Nodes  Summary and Future Work

26 NOC Topologies 26  Topology is the principal determinant of network performance, cost, and energy efficiency  Topology desiderata  Rich connectivity  reduces router traversals  High bandwidth  reduces latency and contention  Low router complexity  reduces area and delay  On-chip constraints  2D substrates limit implementable topologies  Logic area/energy constrains use of wire resources  Power constrains restrict routing choices

27 2-D Mesh 27  Pros  Low design & layout complexity  Simple, fast routers

28  Pros  Low design & layout complexity  Simple, fast routers  Cons  Large diameter  Energy & latency impact 2-D Mesh 28

29  Pros  Multiple terminals at each node  Fast nearest-neighbor communication via the crossbar  Hop count reduction proportional to concentration degree  Cons  Benefits limited by crossbar complexity 29 Concentrated Mesh (Balfour & Dally, ICS ‘06)

30  Objectives:  Improve connectivity  Exploit the wire budget 30 Flattened Butterfly (Kim et al., Micro ‘07)

31  Point-to-point links  Nodes fully connected in each dimension 31 Flattened Butterfly

32  Pros  Excellent connectivity  Low diameter: 2 hops  Cons  High channel count: k 2 /2 per row/column  Low channel utilization  Control complexity 32 Flattened Butterfly

33  Objectives:  Connectivity  More scalable channel count  Better channel utilization 33 [Grot et al., Micro ‘09] Multidrop Express Channels (MECS)

34 34 Multidrop Express Channels (MECS)  Point-to-multipoint channels  Single source  Multiple destinations  Drop points:  Propagate further -OR-  Exit into a router

35 35 Multidrop Express Channels (MECS)

36 36  Pros  One-to-many topology  Low diameter: 2 hops  k channels row/column  I/O asymmetry  Cons  I/O asymmetry  Control complexity Multidrop Express Channels (MECS)

37 MECS Summary  MECS: a novel one-to-many topology  Excellent connectivity  Effective wire utilization  Good fit for planar substrates  Results summary  MECS: lowest latency, high energy efficiency  Mesh-based topologies: best throughput  Flattened butterfly: smallest router area 37

38 Outline 38  Introduction  Service Guarantees in Networks-on-Chip  Efficient Topologies for On-chip Interconnects  Kilo-NOC: A Networks for 1000+ Nodes  Requirements and obstacles  Topology-centric Kilo-NOC architecture  Evaluation highlights  Summary and Future Work

39 Scaling to a kilo-node NOC 39  Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees  MECS scalability obstacles  Buffer requirements: more ports, deeper buffers  area, energy, latency overheads  PVC scalability obstacles  Flow state, other storage  area, energy overheads  Preemption overheads  energy, latency overheads  Prioritization and arbitration  latency overheads

40 Scaling to a kilo-node NOC 40  Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees  MECS scalability obstacles  Buffer requirements: more ports, deeper buffers  area, energy, latency overheads  PVC scalability obstacles  Flow state, other storage  area, energy overheads  Preemption overheads  energy, latency overheads  Prioritization and arbitration  latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads

41 NOC QOS: Conventional Approach 41 Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)

42 NOC QOS: Conventional Approach 42 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

43 NOC QOS: Conventional Approach 43 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

44 NOC QOS: Conventional Approach 44 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

45 NOC QOS: Conventional Approach 45 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing Network-wide guarantees without network-wide QOS support

46 Kilo-NOC QOS: Topology-centric Approach 46  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom QOS-free

47 Kilo-NOC QOS: Topology-centric Approach 47  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

48 Kilo-NOC QOS: Topology-centric Approach 48  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

49 Kilo-NOC QOS: Topology-centric Approach 49  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

50 Kilo-NOC QOS: Topology-centric Approach 50  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

51 Performance Isolation 51 Stream PVC-enabled MECS topology

52 Performance Isolation 52 MECS topology With & without network-wide PVC QOS

53 Performance Isolation 53

54 Summary: Scaling NOCs to 1000+ nodes 54  Objectives: good performance, high energy- and area- efficiency, service guarantees  MECS topology  Point-to-multipoint interconnect fabric  Rich connectivity: improves performance and efficiency  PVC QOS scheme  Preemptive architecture: reduces buffer requirements  Strong guarantees, performance isolation

55 Summary: Scaling NOCs to 1000+ nodes 55  Topology-aware QOS architecture  Limits the extent of QOS support to a fraction of the die  Reduces network cost, improves performance  Enables efficiency-boosting optimizations in QOS-free regions of the chip  Kilo-NOC compared to MECS+PVC:  NOC area reduction of 47%  NOC energy reduction of 26-53%

56 Acknowledgement 56  Faculty  Steve Keckler (advisor)  Doug Burger  Onur Mutlu  Emmett Witchel  Collaborators  Paul Gratz  Joel Hestness  Special Thanks  The awesome CART group

57 57


Download ppt "Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin."

Similar presentations


Ads by Google