Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

Slides:



Advertisements
Similar presentations
Misbah Mubarak, Christopher D. Carothers
Advertisements

QuT: A Low-Power Optical Network-on-chip
A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
An Analytical Model for Worst-case Reorder Buffer Size of Multi-path Minimal Routing NoCs Gaoming Du 1, Miao Li 1, Zhonghai Lu 2, Minglun Gao 1, Chunhua.
Weighted Random Oblivious Routing on Torus Networks Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
Interconnection Networks Lecture 8: February 12, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter 2007 Transcribed by Wanping Zhang.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Design of a High-Throughput Distributed Shared-Buffer NoC Router
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
Statistical Approach to NoC Design Itamar Cohen, Ori Rottenstreich and Isaac Keslassy Technion (Israel)
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
McRouter: Multicast within a Router for High Performance NoCs
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
José Vicente Escamilla José Flich Pedro Javier García 1.
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
Packet-Switching Networks Routing in Packet Networks.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.
Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
50 th Annual Allerton Conference, 2012 On the Capacity of Bufferless Networks-on-Chip Alex Shpiner, Erez Kantor, Pu Li, Israel Cidon and Isaac Keslassy.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
Off-Line AGV Routing on the 2D Mesh Topology with Partial Permutation
Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan
University of Michigan, Ann Arbor
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.
Yu Cai Ken Mai Onur Mutlu
1 Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound Guang Sun 1,2, Chia-Wei Chang 1, Bill Lin 1, Lieguang Zeng 2,
Efficient Microarchitecture for Network-on-Chip Routers
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.
Internet Traffic Engineering Motivation: –The Fish problem, congested links. –Two properties of IP routing Destination based Local optimization TE: optimizing.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
Yiting Xia, T. S. Eugene Ng Rice University
How to Train your Dragonfly
FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers
Datacenter Interconnection Network Design
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Rachata Ausavarungnirun, Kevin Chang
Rahul Boyapati. , Jiayi Huang
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Interconnection Networks: Routing
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Presentation transcript:

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego

Networks-on-Chip Chip-multiprocessors (CMPs) increasingly popular 2D-mesh networks often used as on-chip fabric Routing algorithm central in determining performance Tilera Tile64Intel 48-core data center on die (ISSCC 2010)

Classes of Routing Algorithms Oblivious routing +Simple and fast router designs – Poor load balancing under bursty traffic Adaptive routing +Better performance (throughput, latency) +Better fault tolerance -Higher router complexity

Related Work Oblivious Routing [Valiant, ROMM, O1TURN, Optimal oblivious routing] – Optimize for worst and average-case performance Adaptive routing commercially used in multiprocessors from IBM, Cray, Compaq On-chip routing very different from off-chip: – Lower power – Lower area – Lower router complexity

Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Evaluation

Minimal Adaptive Routing Model – Adaptive routing along minimal directions D S

Coarse Fine Granularity of Congestion Estimation Local congestion

Local Congestion Local adaptive – Measure local congestion metric (free VC, free buffers) S Low congestion Moderate congestion D High congestion Optimal Local adaptive

Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion

Dimension-based Congestion RCA-1D (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes along a dimension S Low congestion Moderate congestion D High congestion Optimal RCA-1D

Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion

Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal

Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal

Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal RCA-quad

Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion

Ideally … On a per-destination basis: – Estimate end-to-end delay along all minimal paths to destination – Choose path with least delay S Low congestion Moderate congestion D High congestion Optimal

Challenges Limited bandwidth for congestion updates – Congestion notification not instantaneous Limited storage in on-chip routers – Exponential number of paths to each destination Limited hardware resources for computations How can we practically emulate ideal adaptive routing?

Destination-based adaptive routing (DAR) A node estimates delay to all other nodes through candidate outputs every T cycles S D L[N][D] = 20 L[E][D] = 30

DAR-High Level Traffic distribution to output ports controlled using per-destination split ratios W W[N][D]= 0.6 W[E][D]= 0.4 S D Estimate delay to destination through candidate outputs Shift traffic from more congested port to less congested port Start with initial set of split ratios L[N][D] = 20 L[E][D] = 30

DAR-High Level Traffic distribution to output ports controlled using per-destination split ratios W Estimate delay to destination through candidate outputs S D Shift traffic from more congested port to less congested port Start with initial set of split ratios W[N][D]= 0.8 W[E][D]= 0.2 L[N][D] = 20 L[E][D] = 30

Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) – Distributed delay measurement – Split ratio adaptation – Scaling Evaluation

Distributed Delay Measurement A node maintains: – Per-destination traffic split ratio through candidate output ports: W[p][j] – Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p]

Distributed Delay Measurement Every node estimates average delay to all other nodes in the network Avg 10 [10] 1.Delay from 10 to itself, Avg 10 [10] = l 10 [Ej] 2.Avg 10 [10] propagated to neighbors 3.Nodes 6, 9, 14, 11 add local delay to Avg 10 [10] to compute delay to node 10 4.For example, at node 9, L[E][10] = l[E] + Avg 10 [10] Avg 9 [10] = L[E][10]

Distributed Delay Measurement Every node estimates delay to all other nodes in the network Avg 14 [10] Avg 11 [10] Avg 9 [10] 1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors 2.For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg 6 [10] A[N][10] = Avg 9 [10] 3.Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] + l[E] L[N][10] = A[N][10] + l[N] 4.Finally, average delay from node 5 to node 10 is computed as: Avg 5 [10] = W[E][10]L[E][10] + W[N][10]L[N][10] Avg 14 [10] Avg 9 [10] Avg 6 [10] Avg 11 [10]

Distributed Delay Measurement Every node estimates delay to all other nodes in the network Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors 2.For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg 6 [10] A[N][10] = Avg 9 [10] 3.Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] + l[E] L[N][10] = A[N][10] + l[N] 4.Finally, average delay from node 5 to node 10 is computed as: Avg 5 [10] = W[E][10]L[E][10] + W[N][10]L[N][10]

Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Distributed delay measurement – Split ratio adaptation – Scaling Evaluation

Adaptation of Split ratio Objective: Equalize delay on candidate output ports If only one candidate output, split ratio is 1 If two candidate outputs, – Let p h be the port with higher delay to destination j – Let p l be the port with lower delay to destination j – W[p h ][j] + W[p l ][j] = 1 – Δ traffic shifted from p h to p l every T cycles – Δ proportional to (L[p h ][j]-L[p l ][j])/L[p h ][j]

Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion Does not scale !!

Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion Scalable Destination- based congestion

Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Distributed delay measurement Split ratio adaptation – Scaling Evaluation

Look-ahead Window Node S maintains delay estimate for MxM window centered at S. Any node outside window mapped to closest node within window A packet’s look-ahead window shifts as it is routed from source to destination

Window Size Destination D guaranteed to be within window when packet is (M-1)/2 hops away from D Intuition: Packet has (M-1)/2 hops to route around congestion hot spots 7x7 look-ahead window in 16x16 mesh has comparable performance to DAR (equivalent to 31x31 look-ahead window)

Outline Introduction Related work Destination-Based Adaptive Routing (DAR) Evaluation

Experimental setup Compare DAR with RCA-1D, RCA-quadrant, Local adaptive SPLASH-2 benchmarks + synthetic traffic patterns (uniform, transpose, shuffle) Cycle-accurate NoC simulator models 3-stage router pipeline 8 VC, 5 flit deep 1 VC used as escape VC for deadlock prevention

Splash results – 7x7 mesh 41%

Splash results – 7x7 mesh 65%

Uniform traffic – 8x8 mesh

Transpose traffic – 8x8 mesh

Shuffle traffic – 8x8 mesh

SDAR - 16x16 mesh, 7x7 window Average latency over 100 permutation traffic patterns at 18% injection load Network saturation statistics at 18% injection load

Summary Destination-based Adaptive Routing (DAR) for 2D mesh networks Scalable DAR (SDAR) uses look-ahead window and easily scales to large networks DAR outperforms existing adaptive and oblivious routing SDAR achieves comparable performance with significantly less overheads

Thank you!!

Key implementation details Simple router implementation: low storage, low bandwidth Synchronize delay updates to reuse delay computation and weight adaptation hardware Approximate computations to simplify implementation

Router architecture – Kim et al DAC ‘05 Quadrant Port Pre-select VC-1 VC Allocator XB Allocator N VC-v S E W VC VC-v Preferred Output Registers In N S E W Ej Congestion Value Registers Credits Routing Unit Override Credits

DAR Router

Distributed delay measurement A node maintains: – Per-destination traffic split ratio through candidate output ports: W[p][j] – Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p] Using updates received from downstream nodes, a node computes: – L[p][j]: Average delay from current node to node j through output port p – Avg[j]: Average delay from current node to node j

Destination-based Adaptive Routing (DAR) Every router maintains per-destination split ratios which control traffic distribution to output ports Split ratios adjusted every T cycles based on measured delay to D through the two ports S Low congestion Moderate congestion D High congestion