1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Slides:

Advertisements

Similar presentations

Ch. 12 Routing in Switched Networks

Advertisements

Ch. 12 Routing in Switched Networks Routing in Packet Switched Networks Routing Algorithm Requirements –Correctness –Simplicity –Robustness--the.

A Novel 3D Layer-Multiplexed On-Chip Network

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Queuing Network Models for Delay Analysis of Multihop Wireless Ad Hoc Networks Nabhendra Bisnik and Alhussein Abouzeid Rensselaer Polytechnic Institute.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Weighted Random Oblivious Routing on Torus Networks Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

On Selfish Routing In Internet-like Environments Lili Qiu (Microsoft Research) Yang Richard Yang (Yale University) Yin Zhang (AT&T Labs – Research) Scott.

Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.

Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.

Frame-Aggregated Concurrent Matching Switch Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

A Scalable Switch for Service Guarantees Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion The.

Design of a High-Throughput Distributed Shared-Buffer NoC Router

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Predictive Load Balancing Reconfigurable Computing Group.

Jerry Chou and Bill Lin University of California, San Diego

Interconnection Networks

Issues in System-Level Direct Networks Jason D. Bakos.

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

Statistical Approach to NoC Design Itamar Cohen, Ori Rottenstreich and Isaac Keslassy Technion (Israel)

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

McRouter: Multicast within a Router for High Performance NoCs

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

On-Chip Networks and Testing

Elastic-Buffer Flow-Control for On-Chip Networks

Networks-on-Chips (NoCs) Basics

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

Algorithms for Allocating Wavelength Converters in All-Optical Networks Authors: Goaxi Xiao and Yiu-Wing Leung Presented by: Douglas L. Potts CEG 790 Summer.

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

Network and Communications Ju Wang Chapter 5 Routing Algorithm Adopted from Choi’s notes Virginia Commonwealth University.

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.

S Master’s thesis seminar 8th August 2006 QUALITY OF SERVICE AWARE ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKS Thesis Author: Shan Gong Supervisor:Sven-Gustav.

Towards a More Fair and Robust Internet Backbone Year 1 Status Report Rene Cruz, Tara Javidi, Bill Lin Center for Networked Systems University of California,

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

On Selfish Routing In Internet-like Environments Lili Qiu (Microsoft Research) Yang Richard Yang (Yale University) Yin Zhang (AT&T Labs – Research) Scott.

Yu Cai Ken Mai Onur Mutlu

1 Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound Guang Sun 1,2, Chia-Wei Chang 1, Bill Lin 1, Lieguang Zeng 2,

1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Puzzle You have 2 glass marbles Building with 100 floors

How to Train your Dragonfly

Architecture and Algorithms for an IEEE 802

Lecture 23: Interconnection Networks

Datacenter Interconnection Network Design

EE382C Lecture 6 Adaptive Routing 4/14/11 What is tornado traffic?

Dragonfly+: Low Cost Topology for scaling Datacenters

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Presentation transcript:

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University of California, San Diego

2 Motivation: Networks-on-Chip Chip-multiprocessors (CMPs) increasingly popular 2D-mesh networks often used as on-chip fabric I/O Area single tile 1.5mm 2.0mm 21.72mm 12.64mm Tilera Tile64 Intel 80-core

3 Motivation: 3D Integrated Circuits 3D Benefits –Reduced wire delays –Enormous bandwidth –Heterogeneous system integration Natural progression –3D-mesh for 3D CMPs 2D to 3D

4 Routing Algorithm Objectives Maximize throughput –How much load the network can handle Minimize hop count –Minimize routing delay between source and destination

5 Challenges For 2D-case, a near-optimal throughput routing algorithm with minimal hop count called O1TURN is known [Seo’05]. Surprisingly, optimality of O1TURN does not extend to 3D case, actual throughput performance degrades severely. Only known optimal throughput routing algorithm is Valiant (VAL) load-balancing, but VAL performs poorly on hop count (latency), twice that of minimal routing.

6 Main Contribution Developed a new oblivious routing algorithm called “Randomized Partially Minimal” (RPM) routing. RPM provably guarantees near-optimal worst-case throughput in 3D case. –Optimal for even radix k (e.g. 8 x 8 x 8 mesh). –Within factor of 1/k 2 for odd radix (e.g. 7 x 7 x 7 mesh). Good latency performance. –Only factor of 1.33 of minimal routing (much better than 2x cost of VAL, only known routing algorithm with optimal throughput) –In practice, 3D-meshes are asymmetric because number of device layers less than number of tiles per edge. –e.g., for 16 x 16 x 4 mesh (4 layers), RPM’s hop count just factor of 1.1 of minimal routing.

7 Outline Motivation for our work Existing 2D routing algorithms don’t extend well into 3D RPM routing algorithm Simulation results Extensions and future work

8 Existing Routing Algorithms The 2D case Dimension-Ordered Routing (DOR) –Route minimal XY Valiant load-balancing (VAL) –Route source → randomly chosen intermediate node → destination –Route minimal XY in both phases ROMM –Same as VAL, but intermediate node restricted to minimal direction Orthogonal 1-TURN (O1TURN) –Route minimal XY and YX with equal probability Extending to the 3D case … Dimension-Ordered Routing (DOR) –Route minimal XYZ Valiant load-balancing (VAL) –Route source → randomly chosen intermediate node → destination –Route minimal XYZ in both phases ROMM –Same as VAL, but intermediate node restricted to minimal direction Orthogonal 1-TURN (O1TURN) –Route along one of 6 minimal orthogonal paths (XYZ, XZY, YXZ, YZX, ZXY, ZYX) with equal probability

9 Worst-Case Throughput Best theoretical normalized worst-case throughput known to be 50% (well-known result). Worst-case throughput analysis can be reduced to a maximal weighted matching problem [Towles’02]. VAL achieves this optimal throughput, but has poor latency. As shown next, DOR, ROMM, and O1TURN are all far from optimal in 3D.

10 Poor Worst-Case Throughput Only 6-15% VAL/Optimal

11 How do 2D mesh algorithms fare in 3D? Worst case throughput of DOR, ROMM, O1TURN far from optimal Average hop count of VAL far from minimal Need a routing algorithm that can trade latency for worst-case throughput Hop Count (normalized to minimal) Normalized Worst-Case Throughput Normalized Average-Case Throughput 8 x 8 x 8 Network VALDORROMMO1TURN VALDORROMMO1TURN

12 Why O1TURN performs poorly in 3D? O1TURN – Worst-Case throughput optimal for 2D but more than 3 times worse than optimal for 3D The difference –2D traffic matrix is “admissible” for 2D mesh –In 3D, projected traffic on each 2D plane is no longer admissible !! Can we transform the 3D routing problem to routing admissible traffic on each 2D plane ?

13 Outline Motivation for our work Existing 2D algorithms don’t extend well into 3D RPM routing algorithm Simulation results Extensions and future work

14 Randomized Partially-Minimal Routing (RPM) Source Destination Random intermediate layer XY or YX routing on the intermediate layer X Y Z Phase-1 Z Source to intermediate layer Phase-2 Z Intermediate layer to destination

15 Main Idea Load-balance uniformly across the vertical layers Min XY/YX used on each layer Main Result: RPM has near-optimal worst-case throughput –Achieves optimal worst-case throughput when network radix k is even –Within a factor of 1/k 2 optimal when k is odd.

16 RPM achieves Near-Optimal Worst Case Throughput (optimal for even radix) VAL/Optimal RPM

17 Average-Case Throughput RPM outperforms VAL, DOR, ROMM and O1TURN in average- throughput on randomly generated traffic.

18 Average Hop Count Normalized hop count of RPM –Symmetric Meshes times minimal compared to 2x for VAL –Asymmetric 16x16x4 Mesh – 1.1 times minimal

19 Outline Motivation for our work Existing 2D routing algorithms don’t extend well into 3D RPM routing algorithm Simulation results Extensions and future work

20 Flit-Level Simulation Ideal throughput evaluation assumes –Ideal single-cycle router –Infinite buffers –No contention in switches, no flow control Flit-level simulation –PopNet network simulator –4 stage router pipeline – Route computation, VC allocation, Switch arbitration, Link traversal –Credit-based flow control –8 virtual channels, each 5 flits deep –Multi-flit packets injected into the network (5 flits/packet)

21 Flit-Level Simulation (cont’d) Network configurations simulated –4 x 4 x 4 Mesh –8 x 8 x 8 Mesh –16 x 16 x 4 Mesh Routing algorithms compared: DOR, VAL, ROMM, O1TURN, DUATO, RPM –DUATO is a minimal adaptive routing algorithm implemented for comparison Four different traffic traces used –Transpose traffic – (x,y,z) → (y,z,x) –Complement traffic – (x,y,z) → (k-x-1, k-y-1, k-z-1) –Uniform traffic –Worst Case traffic pattern for DOR (DOR-WC) – (x,y,z) → (k-z-1, k-y-1, k-x-1)

22 Uniform Traffic 8x8x8 Mesh16x16x4 Mesh

23 Transpose Traffic 8x8x8 Mesh16x16x4 Mesh

24 Complement Traffic 8x8x8 Mesh16x16x4 Mesh

25 DOR-WC Traffic 8x8x8 Mesh16x16x4 Mesh

26 To sum it up … 3D IC technology is emerging. Stacking cores in 3 dimensions offers several advantages over 2D placement of cores. 2D minimal Mesh routing algorithms have poor worst-case throughput in 3D, VAL has high latency penalty. RPM trades off latency (partially-minimal) for better worst case performance (near-optimal).

27 Thank You Questions?