George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Slides:



Advertisements
Similar presentations
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
PRESENTED BY: PRIYANK GUPTA 04/02/2012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems & A Complete Network on Chip Emulation Framework.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Design of a High-Throughput Distributed Shared-Buffer NoC Router
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Predictive Load Balancing Reconfigurable Computing Group.
Network-on-Chip Examples System-on-Chip Group, CSE-IMM, DTU.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.
1 Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Paper Review Building a Robust Software-based Router Using Network Processors.
José Vicente Escamilla José Flich Pedro Javier García 1.
Power Issues in On-chip Interconnection Networks Mojtaba Amiri Nov. 5, 2009.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb Alpha Development Group, Compaq HOT Interconnects 9 (2001) Presented by.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
1 Lecture 26: Networks, Storage Topics: router microarchitecture, disks, RAID (Appendix D) Final exam: Monday 30 th Apr 10:30-12:30 Same rules as the midterm.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
Lecture 16: Router Design
Efficient Microarchitecture for Network-on-Chip Routers
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Lecture 23: Router Design
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Rahul Boyapati. , Jiayi Huang
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CSC3050 – Computer Architecture
CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)
Presentation transcript:

George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks 1

The PPL Vision Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Data informatics Scientific Engineering Scientific Engineering Physics (Liszt) Scripting Probabilistic (RandomT) Machine Learning (OptiML) Rendering Parallel Runtime (Delite, Sequoia, GRAMPS) Dynamic Domain Spec. Opt. Locality Aware Scheduling Staging Polymorphic Embedding Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure Task & Data Parallelism Hardware Architecture OOO Cores SIMD Cores Threaded Cores Specialized Cores Static Domain Specific Opt. Programmable Hierarchies Programmable Hierarchies Scalable Coherence Scalable Coherence Isolation & Atomicity On-chip Networks On-chip Networks Pervasive Monitoring

In a Nutshell  Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs Input buffers at routers are not needed  Compared to VC routers: Reduces cycle time up to 67% Provides 43% more throughput per unit power, and 22% more throughput per unit area Makes for a simpler network  EB uses duplicate subnetworks for traffic isolation For many classes, a hybrid EB-VC router is used instead Uses buffers only to alleviate severe contention and deadlocks. Increases power efficiency 3

Outline  Building EB channels The basic building blocks of EB networks  EB router design  Deadlock avoidance & congestion sensing  Evaluation results 4

The Idea  Use the network channels as distributed FIFOs  Use that storage instead of input buffers at routers To remove input buffer area and power costs Pipelined channel Channel as FIFO 5

Building an Elastic Buffer  To build an EB in a pipelined channel with master-slave flip-flops (FFs):  Use latches for storage by driving their enables independently Master-slave FF Elastic buffer 6

How Elastic Buffer Channels Work  Ready/valid handshake between elastic buffers Ready: At least one free storage slot Valid: Non-empty (driving valid data) Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6 7

Outline  Building EB channels  EB router design The implications in router design  Deadlock avoidance & congestion sensing  Evaluation results 8

Use EB Flow-Control Through the Router VC input-buffered router EB router Input buffer replaced by input EB VC & SW allocators removed. Per-output arbiters instead. Three-slot output EB to cover for arbitration done one cycle in advance. LA routing also applicable to EB networks. 9

Two Improved Router Designs  Enhanced two- stage Fixes baseline design’s main inefficiencies Prioritizes cycle time  Single-stage Removes pipelining overhead Prioritizes latency 10

Outline  Building EB channels  EB router design  Deadlock avoidance & congestion sensing How to provide traffic classes  Evaluation results 11

Deadlock Avoidance  No input buffers no virtual channels  Can provide traffic isolation with duplicate physical channels Duplicating subnetworks most efficient due to crossbar quadratic cost That is only true for up to a certain number of classes 12

Hybrid EB-VC Router  For many classes, have an input buffer to drain flits after a predefined number of blocking cycles  Thus, buffer is used only to alleviate heavy contention and resolve deadlocks In the common case, as energy efficient as EB networks 13

Output Channel Occupancy Load Metric  Flit-buffered networks use credit count  EB networks measure output channel occupancy At a certain segment of the output channel (shown in red) Occupancy decremented when flits leave that segment Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle 14

Outline  Building EB channels  EB router design  Deadlock avoidance & congestion sensing  Evaluation results Let’s talk numbers 15

Throughput-Power Mesh (Baseline Router) EB network improvement: Same power: 10% increased throughput Same throughput: 12% reduced power Throughput gain EB: 18% lower cycle time. Not taken into account. 16

Router RTL Implementation  No buffers, VCs, allocators, credits VC router had look-ahead routing  Buffers: FF arrays. 2 VCs, 8 slots each AspectVC routerEB routerSavings Area (μm 2 )63,51514,73077% Clock (ns) % Power (mW) % 45nm, LP-CMOS, worst-case Mesh 5x5 routers. DOR. 64-bit datapath 17

Router Comparison 18 Baseline: 9% less energy than single- stage. 35% than enhanced Enhanced: 26% reduced cycle time than single-stage. 42% than baseline

Hybrid EB-VC Comparison  Cycle time comparable to VC, not EB routers 19 Hybrid offers 21% more throughput per unit power than VC. 12% than EB The VC network offers 41% more throughput per unit area. The EB 49%

Conclusions  EB flow-control uses channels as distributed FIFOs Uses the pipeline flip-flops that are required anyway Removes input buffers from routers  Provides 43% more throughput per unit power, and 22% more throughput per unit area Depends on what fraction of the cost input buffers are  Reduces cycle time up to 67%  Hybrid EB-VC router provides a large number of classes. Input buffer is used only when it has to 21% more throughput per unit power than VC  Remove buffers, keep buffering. Elastic buffers! 20

Questions? 21