Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Slides:

Advertisements

Similar presentations

Networks on Chip: Router Microarchitecture & Network Topologies

Advertisements

Interconnection Networks: Flow Control and Microarchitecture.

Prof. Natalie Enright Jerger

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.

A Novel 3D Layer-Multiplexed On-Chip Network

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.

Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

Predictive Load Balancing Reconfigurable Computing Group.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.

Issues in System-Level Direct Networks Jason D. Bakos.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

José Vicente Escamilla José Flich Pedro Javier García 1.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

Elastic-Buffer Flow-Control for On-Chip Networks

Networks-on-Chips (NoCs) Basics

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.

1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.

BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.

Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Yu Cai Ken Mai Onur Mutlu

Lecture 16: Router Design

Efficient Microarchitecture for Network-on-Chip Routers

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Virtual-Channel Flow Control William J. Dally

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock: Part II - Recovery.

Lecture 23: Interconnection Networks

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Lecture 23: Router Design

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

NoC Switch: Basic Design Principles &

Mechanics of Flow Control

Virtual-Channel Flow Control

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Presentation transcript:

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University

Overview Allocators have major impact on router performance – Zero-load latency, throughput under load, cycle time On-chip environment imposes stringent constraints – Cycle time, power, no iterative / multi-cycle allocators Main Contributions: – RTL-based performance & cost evaluation of virtual channel and switch allocators for NoC routers – Sparse VC allocation scheme reduces delay, area & power – Pessimistic speculation scheme minimizes delay penalty 11/18/09Allocator Implementations for NoC Routers2

Separable Allocators 11/18/09Allocator Implementations for NoC Routers3 Implement allocation as two phases – Local arbitration at each input – Global arbitration at each output Pros: – Straightforward implementation – Delay scales logarithmically Cons: – Arbiters within each phase are independent – Bad choice in first phase can limit matching Input-first: Output-first: Outputs Inputs

[Tamir’93] Wavefront Allocator Consider inputs and outputs together – Grant requests on diagonal, kill conflicts – Repeat for other diagonals Pros: – Tends to generate better matchings – Tiled design facilitates full-custom implem. Cons: – Delay scales linearly – Orig. design has (false) combinational loops 11/18/09Allocator Implementations for NoC Routers4 Outputs Inputs

Evaluation Methodology Analytical models useful for developing intuition But becoming increasingly inaccurate – Wire delay impact, synthesized vs. full-custom logic, … Use two-pronged evaluation approach: – Delay & cost via detailed RTL-based evaluation Synthesized using Synopsys Design Compiler in topo mode Commercial 45nm low power worst case – Network-level performance via simulation Cycle-oriented interconnection network simulator 64-node networks: 2D mesh & 2D flattened butterfly Request-reply traffic, synthetic traffic patterns 11/18/09Allocator Implementations for NoC Routers5

Virtual Channel Allocation Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use – P×V inputs (input VCs), P×V outputs (output VCs) – Once assigned, VC is used for entire packet’s duration 11/18/09Allocator Implementations for NoC Routers6

Sparse VC Allocation (1) VCs are used for variety of purposes: – Deadlock avoidance Break cyclic dependencies Routing deadlock (within network) Protocol deadlock (at network boundary) – Flow control Decouple buffers and channels to avoid head-of-line blocking Idea: Partition set of VCs to restrict legal requests – Significantly reduces VC allocator logic complexity – Delay/area/power savings of up to 41%/90%/83% 11/18/09Allocator Implementations for NoC Routers7

Sparse VC Allocation (2) 11/18/09Allocator Implementations for NoC Routers8 REQ REP NM MIN NM MIN IVCOVC P×8 Requests P×4 Requests P×2 Requests P×4 Requests P×2 Requests 8 VCs2×4 VCs2×2×2 VCs 64 Requests32 Requests24 Requests

VC Allocator Performance 11/18/09Allocator Implementations for NoC Routers9 [FBfly, 2×2×2 VCs]

VC Allocator Delay 11/18/09Allocator Implementations for NoC Routers10

VC Allocator Cost 11/18/09Allocator Implementations for NoC Routers11

Switch Allocation Flits require crossbar access to traverse router VCs at each input port share crossbar input Switch allocator generates crossbar schedule – Allocation performed on cycle-by-cycle basis – P×V inputs (input VCs), P outputs (output ports) – At most one VC per input can be granted in each cycle Speculative allocation reduces zero-load latency – Start switch allocation before VC allocation completes 11/18/09Allocator Implementations for NoC Routers12

Pessimistic Speculation (1) Conventional approach: – Separate allocators for spec. and non-spec. requests – Non-spec. grants mask conflicting spec. grants – Conflict detection is on critical path At low load, most requests are granted Idea: Assume all requests will be granted – Mask spec. grants with non-spec. requests – Overlap conflict detection and allocation – Sacrifice speculation accuracy for lower delay – But preserve zero-load latency improvement 11/18/09Allocator Implementations for NoC Routers13

Pessimistic Speculation (2) 11/18/09Allocator Implementations for NoC Routers14 nonspec. allocator spec. allocator conflict detection mask nonspec. requests spec. requests nonspec. grants spec. grants

Switch Allocator Performance (1) 11/18/09Allocator Implementations for NoC Routers15 [Mesh, 2×1×1 VCs]

Switch Allocator Performance (2) 11/18/09Allocator Implementations for NoC Routers16 [FBfly, 2×2×4 VCs] >20%

Switch Allocator Delay 11/18/09Allocator Implementations for NoC Routers17

Switch Allocator Cost 11/18/09Allocator Implementations for NoC Routers18

Speculation Performance (1) 11/18/09Allocator Implementations for NoC Routers19 [Mesh, 2×1×1 VCs]

Speculation Performance (2) 11/18/09Allocator Implementations for NoC Routers20 [Fbfly, 2×2×4 VCs]

Speculation Implementation 11/18/09Allocator Implementations for NoC Routers21

Conclusions Network-level performance is largely insensitive to VC allocator implemetation – Light effective load facilitates near-ideal matchings Sparse VC allocation can greatly reduce delay & cost – Partition set of VCs based on functionality – Restrict possible requests allocator must handle For switch allocation, wavefront allocator produces better matchings but increases delay & cost – Difference increases with number of ports, VCs Pessimistic speculation reduces switch allocator delay – Trade for some performance degradation near saturation 11/18/09Allocator Implementations for NoC Routers22