Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin.

Slides:

Advertisements

Similar presentations

A Novel 3D Layer-Multiplexed On-Chip Network

Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Dynamic Topology Optimization for Supercomputer Interconnection Networks Layer-1 (L1) switch –Dumb switch, Electronic “patch panel” –Establishes hard links.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Montek Singh COMP Nov 10,  Design questions at various leves ◦ Network Adapter design ◦ Network level: topology and routing ◦ Link level:

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Design of a High-Throughput Distributed Shared-Buffer NoC Router

In-Band Flow Establishment for End-to-End QoS in RDRN Saravanan Radhakrishnan.

1 Evgeny Bolotin – ClubNet Nov 2003 Network on Chip (NoC) Evgeny Bolotin Supervisors: Israel Cidon, Ran Ginosar and Avinoam Kolodny ClubNet - November.

School of Information Technologies IP Quality of Service NETS3303/3603 Weeks

Cristóbal Camarero With support from: Enrique Vallejo Ramón Beivide

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

Interconnect Networks

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.

Networks-on-Chips (NoCs) Basics

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Network-on-Chip Energy-Efficient Design Techniques for Interconnects Suhail Basit.

Computer Networks with Internet Technology William Stallings

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

50 th Annual Allerton Conference, 2012 On the Capacity of Bufferless Networks-on-Chip Alex Shpiner, Erez Kantor, Pu Li, Israel Cidon and Isaac Keslassy.

Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

University of Michigan, Ann Arbor

Yu Cai Ken Mai Onur Mutlu

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Efficient Microarchitecture for Network-on-Chip Routers

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.

Courtesy Piggybacking: Supporting Differentiated Services in Multihop Mobile Ad Hoc Networks Wei LiuXiang Chen Yuguang Fang WING Dept. of ECE University.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012.

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Rahul Boyapati. , Jiayi Huang

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Virtual-Channel Flow Control

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Presentation transcript:

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin

Technology Trends Core i7 Pentium D Pentium 4 Pentium Xeon Nehalem-EX Year of introduction Transistor count 2

Technology Applications 3

Networks-on-Chip (NOCs) 4  The backbone of highly integrated chips  Transport of memory, operand, and control traffic  Structured, packet-based, multi-hop networks  Increasing importance with greater levels of integration  Major impact on chip performance, energy, and area  TRIPS: 28% performance loss on SPEC 2K in NOC  Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10

On-chip vs Off-chip Interconnects 5  Topology  Routing  Flow control  Pins  Bandwidth  Power  Area

Future NOC Requirements  100’s to 1000’s of network clients  Cores, caches, accelerators, I/O ports, …  Efficient topologies  High performance, small footprint  Intelligent routing  Performance through better load balance  Light-weight flow control  High performance, low buffer requirements  Service Guarantees  cloud computing, real-time apps demand QOS support 6 HPCA ‘09 HPCA ‘08 MICRO ‘09 under submission

Outline 7  Introduction  Service Guarantees in Networks-on-Chip  Motivation  Desiderata, prior work  Preemptive Virtual Clock  Evaluation highlights  Efficient Topologies for On-chip Interconnects  Kilo-NOC: A Network for Nodes  Summary and Future Work

Why On-chip Quality-of-Service? 8  Shared on-chip resources  Memory controllers, accelerators, network-on-chip  … require QOS support  fairness, service differentiation, performance isolation  End-point QOS solutions are insufficient  Data has to traverse the on-chip network  Need QOS support at the interconnect level Hard guarantees in NOCs

NOC QOS Desiderata 9  Fairness  Isolation of flows  Bandwidth efficiency  Low overhead:  delay  area  energy

Conventional QOS Disciplines 10  Fixed schedule  Pros: algorithmic and implementation simplicity  Cons: inefficient BW utilization; per-flow queuing  Example: Round Robin  Rate-based  Pros: fine-grained scheduling; BW efficient  Cons: complex scheduling; per-flow queuing  Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89]  Frame-based  Pros: good throughput at modest complexity  Cons: throughput-complexity trade-off; per-flow queuing  Example: Rotating Combined Queuing (RCQ) [ ISCA ’96] Per-flow queuing o Area overhead o Energy overhead o Delay overhead o Scheduling complexity

Preemptive Virtual Clock (PVC) [HPCA ‘09] 11  Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs.  Full QOS support  Fairness, prioritization, performance isolation  Modest area and energy overhead  Minimal buffering in routers & source nodes  High Performance  Low latency, good BW efficiency

PVC: Scheduling 12  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect Flow X

PVC: Scheduling 13  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect  Framing: PVC’s solution to history effect  Frame rollover clears all BW counters  Fixed frame duration

PVC: Scheduling 14  Combines rate-based and frame-based features  Rate-based: evolved from Virtual Clock [SIGCOMM ’90]  Routers track each flow’s bandwidth consumption  Cheap priority computation  f (provisioned rate, consumed BW)  Problem: history effect Flow X Frame roller - BW counters reset - Priorities reset

PVC: Freedom from Priority Inversion 15  PVC: simple routers w/o per-flow buffering and no BW reservation  Problem: high priority packets may be blocked by lower priority packets (priority inversion) x

PVC: Freedom from Priority Inversion 16  PVC: simple routers w/o per-flow buffering and no BW reservation  Problem: high priority packets may be blocked by lower priority packets (priority inversion)  Solution: preemption of lower priority packets `

PVC: Preemption Recovery 17  Retransmission of dropped packets  Buffer outstanding packets at the source node  ACK/NACK protocol via a dedicated network  All packets acknowledged  Narrow, low-complexity network  Lower overhead than timeout-based recovery  64 node network: 30-flit backup buffer per node suffices

PVC: Preemption Throttling 18  Relaxed definition of priority inversion  Reduces preemption frequency  Small fairness penalty  Per-flow bandwidth reservation  Flits within the reserved quota are non-preemptible  Reserved quota is a function of rate and frame size  Coarsened priority classes  Mask out lower-order bits of each flow’s BW counter  Induces coarser priority classes  Enables a fairness/throughput trade-off

PVC: Guarantees 19  Minimum Bandwidth  Based on reserved quota  Fairness  Subject to BW counter resolution  Worst-case Latency  Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1

Performance Isolation 20

Performance Isolation 21  Baseline NOC  No QOS support  Globally Synchronized Frames (GSF)  J. Lee, et al. ISCA 2008  Frame-based scheme adapted for on-chip implementation  Source nodes enforce bandwidth quotas via self-throttling  Multiple frames in-flight for performance  Network prioritizes packets based on frame number  Preemptive Virtual Clock (PVC)  Highest fairness setting (unmasked bandwidth counters)

Performance Isolation 22

PVC Summary 23  Full QOS support  Fairness & service differentiation  Strong performance isolation  High performance  Inelaborate routers  low latency  Good bandwidth efficiency  Modest area and energy overhead  3.4 KB of storage per node (1.8x no-QOS router)  12-20% extra energy per packet

PVC Summary 24  Full QOS support  Fairness & service differentiation  Strong performance isolation  High performance  Inelaborate routers  low latency  Good bandwidth efficiency  Modest area and energy overhead  3.4 KB of storage per node (1.8x no-QOS router)  12-20% extra energy per packet Will it scale to 1000 nodes?

Outline 25  Introduction  Service Guarantees in Networks-on-Chip  Efficient Topologies for On-chip Interconnects  Mesh-based networks  Toward low-diameter topologies  Multidrop Express Channels  Kilo-NOC: A Network for Nodes  Summary and Future Work

NOC Topologies 26  Topology is the principal determinant of network performance, cost, and energy efficiency  Topology desiderata  Rich connectivity  reduces router traversals  High bandwidth  reduces latency and contention  Low router complexity  reduces area and delay  On-chip constraints  2D substrates limit implementable topologies  Logic area/energy constrains use of wire resources  Power constrains restrict routing choices

2-D Mesh 27  Pros  Low design & layout complexity  Simple, fast routers

 Pros  Low design & layout complexity  Simple, fast routers  Cons  Large diameter  Energy & latency impact 2-D Mesh 28

 Pros  Multiple terminals at each node  Fast nearest-neighbor communication via the crossbar  Hop count reduction proportional to concentration degree  Cons  Benefits limited by crossbar complexity 29 Concentrated Mesh (Balfour & Dally, ICS ‘06)

 Objectives:  Improve connectivity  Exploit the wire budget 30 Flattened Butterfly (Kim et al., Micro ‘07)

 Point-to-point links  Nodes fully connected in each dimension 31 Flattened Butterfly

 Pros  Excellent connectivity  Low diameter: 2 hops  Cons  High channel count: k 2 /2 per row/column  Low channel utilization  Control complexity 32 Flattened Butterfly

 Objectives:  Connectivity  More scalable channel count  Better channel utilization 33 [Grot et al., Micro ‘09] Multidrop Express Channels (MECS)

34 Multidrop Express Channels (MECS)  Point-to-multipoint channels  Single source  Multiple destinations  Drop points:  Propagate further -OR-  Exit into a router

35 Multidrop Express Channels (MECS)

36  Pros  One-to-many topology  Low diameter: 2 hops  k channels row/column  I/O asymmetry  Cons  I/O asymmetry  Control complexity Multidrop Express Channels (MECS)

MECS Summary  MECS: a novel one-to-many topology  Excellent connectivity  Effective wire utilization  Good fit for planar substrates  Results summary  MECS: lowest latency, high energy efficiency  Mesh-based topologies: best throughput  Flattened butterfly: smallest router area 37

Outline 38  Introduction  Service Guarantees in Networks-on-Chip  Efficient Topologies for On-chip Interconnects  Kilo-NOC: A Networks for Nodes  Requirements and obstacles  Topology-centric Kilo-NOC architecture  Evaluation highlights  Summary and Future Work

Scaling to a kilo-node NOC 39  Goal: a NOC architecture that scales to clients with good efficiency and strong guarantees  MECS scalability obstacles  Buffer requirements: more ports, deeper buffers  area, energy, latency overheads  PVC scalability obstacles  Flow state, other storage  area, energy overheads  Preemption overheads  energy, latency overheads  Prioritization and arbitration  latency overheads

Scaling to a kilo-node NOC 40  Goal: a NOC architecture that scales to clients with good efficiency and strong guarantees  MECS scalability obstacles  Buffer requirements: more ports, deeper buffers  area, energy, latency overheads  PVC scalability obstacles  Flow state, other storage  area, energy overheads  Preemption overheads  energy, latency overheads  Prioritization and arbitration  latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads

NOC QOS: Conventional Approach 41 Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)

NOC QOS: Conventional Approach 42 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

NOC QOS: Conventional Approach 43 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

NOC QOS: Conventional Approach 44 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing

NOC QOS: Conventional Approach 45 NOC contention scenarios:  Shared resource accesses  memory access  Intra-VM traffic  shared cache access  Inter-VM traffic  VM page sharing Network-wide guarantees without network-wide QOS support

Kilo-NOC QOS: Topology-centric Approach 46  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom QOS-free

Kilo-NOC QOS: Topology-centric Approach 47  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

Kilo-NOC QOS: Topology-centric Approach 48  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

Kilo-NOC QOS: Topology-centric Approach 49  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

Kilo-NOC QOS: Topology-centric Approach 50  Dedicated, QOS-enabled regions  Rest of die: QOS-free  A richly-connected topology (MECS)  Traffic isolation  Special routing rules  Ensure interference freedom

Performance Isolation 51 Stream PVC-enabled MECS topology

Performance Isolation 52 MECS topology With & without network-wide PVC QOS

Performance Isolation 53

Summary: Scaling NOCs to nodes 54  Objectives: good performance, high energy- and area- efficiency, service guarantees  MECS topology  Point-to-multipoint interconnect fabric  Rich connectivity: improves performance and efficiency  PVC QOS scheme  Preemptive architecture: reduces buffer requirements  Strong guarantees, performance isolation

Summary: Scaling NOCs to nodes 55  Topology-aware QOS architecture  Limits the extent of QOS support to a fraction of the die  Reduces network cost, improves performance  Enables efficiency-boosting optimizations in QOS-free regions of the chip  Kilo-NOC compared to MECS+PVC:  NOC area reduction of 47%  NOC energy reduction of 26-53%

Acknowledgement 56  Faculty  Steve Keckler (advisor)  Doug Burger  Onur Mutlu  Emmett Witchel  Collaborators  Paul Gratz  Joel Hestness  Special Thanks  The awesome CART group

57