Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan

Slides:



Advertisements
Similar presentations
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Advertisements

Misbah Mubarak, Christopher D. Carothers
A Novel 3D Layer-Multiplexed On-Chip Network
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and.
Do We Need Wide Flits in Networks-On-Chip? Junghee Lee, Chrysostomos Nicopoulos, Sung Joo Park, Madhavan Swaminathan and Jongman Kim Presented by Junghee.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Data and Computer Communications Ninth Edition by William Stallings Chapter 12 – Routing in Switched Data Networks Data and Computer Communications, Ninth.
Advanced Networking Wickus Nienaber Daniel Beech.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
ECE 8813a (1) Non-minimal Routing Non-minimal routing  Wormhole degrades performance while VCT has less secondary effects  Fault tolerance is the main.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Predictive Load Balancing Reconfigurable Computing Group.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring
Dragonfly Topology and Routing
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Roadmap-Based End-to-End Traffic Engineering for Multi-hop Wireless Networks Mustafa O. Kilavuz Ahmet Soran Murat Yuksel University of Nevada Reno.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.
Elastic-Buffer Flow-Control for On-Chip Networks
Network Aware Resource Allocation in Distributed Clouds.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Chi-Cheng Lin, Winona State University CS 313 Introduction to Computer Networking & Telecommunication Chapter 5 Network Layer.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
University of Michigan, Ann Arbor
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.
Yu Cai Ken Mai Onur Mutlu
1 Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound Guang Sun 1,2, Chia-Wei Chang 1, Bill Lin 1, Lieguang Zeng 2,
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
ARIADNE Agnostic Reconfiguration In A Disconnected Network Environment Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li-Shiuan Peh (MIT),
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Placing Relay Nodes for Intra-Domain Path Diversity Meeyoung Cha Sue Moon Chong-Dae Park Aman Shaikh Proc. of IEEE INFOCOM 2006 Speaker 游鎮鴻.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Application-Aware Traffic Scheduling for Workload Offloading in Mobile Clouds Liang Tong, Wei Gao University of Tennessee – Knoxville IEEE INFOCOM
NoCVision: A Network-on-Chip Dynamic Visualization Solution
How to Train your Dragonfly
Architecture and Algorithms for an IEEE 802
A Study of Group-Tree Matching in Large Scale Group Communications
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Rahul Boyapati. , Jiayi Huang
Using Packet Information for Efficient Communication in NoCs
Distributed Channel Assignment in Multi-Radio Mesh Networks
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan Highly Fault-tolerant NoC Routing with Application-aware Congestion Management Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan

Wide Range of Applications everyday applications physical simulation scientific applications computational chemistry computational biology semiconductor simulation cloud computing varying computation characteristic, user requirement, etc. (picture sources) 1. N-body simulation: https://www.astro.rug.nl/~weygaert 2. semiconductor: http://spectrum.ieee.org 3. computational biology: http://csbio.cs.umn.edu/ 4. molecular structure: http://nanotechnologyuniverse.com

Application Running on Network-on-Chip application example: video encoder mapping chip multiprocessor with network-on-chip (NoC) communication frequency destination source 64-thread simulation of SPLASH-2 (ocean) (number of flits) some pairs communicate more frequently A B analysis (picture sources) 1. Video encoder: Gary Sullivan et al., Standardized Extensions of High Efficiency Video Coding (HEVC) 2. Tilera TILE-Gx8072: http://www.tilera.com

Fragile Networks-on-Chip 22 nm (Intel) 14 nm 7 nm (IBM) tail of transistor scaling increasing transistor density  transistor reliability↓ network-on-chip…  possible single point of failure permanent faults solution: network-on-chip routing reconfiguration

How to reduce NoC degradation from faults? Network-on-chip reconfiguration entails performance degradation motivating experiment: fault vs. performance degradation minimum throughput requirement state-of-the-art routing reconfiguration [Aisopos 11] our goal KEY IDEA: application-aware routing optimized to application’s communication patterns

Application-Aware Routing (1/2) How do we find adaptive routing optimized to communication patterns? various route options (no restriction) problem solution 1 deadlock-free 1 1 1 1 S D S D 1 1 1 1 1 1 avoid deadlock by restricting turns 1 2 1 2 3 1 1 1 1 3 1 2 path diversity = 6 deadlock possible path diversity = 3

Application-Aware Routing (2/2) How do we find adaptive routing optimized to communication patterns? various route options (no restriction) problem solution 2 1 1 1 S D 2 3 S D 1 1 1 avoid deadlock by restricting turns 1 2 1 2 3 1 3 path diversity = 6 deadlock possible path diversity = 6 Where to best place turn restrictions?  NP-complete problem OUR CONTRIBUTION: turn-restriction placement heuristic

Presentation Outline FATE (Fault- and Application-aware Turn-model Extension) (1) Turn-enabling rules (2) Load estimation “How to reduce search?” “Which is the most valuable turn?” (3) Overall routing computation algorithm Experimental evaluation Conclusions 1 3 4 2 5 6 7 8

How to reduce turn-restriction search? To avoid unfruitful turn-restriction patterns… pattern 1. network disconnection pattern 2. non-minimal restriction 1 2 3 1 2 3 pattern 3. possible deadlock 1 3 4 2 5

Turn-Enabling Rules To avoid unfruitful turn-restriction patterns… … each time a turn is disabled, several others should be enabled basic rules advanced rules 1 3 4 2 5 6 7 8 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 enable adjacent turns (cycle, node, link) enable remote turns (horizontal, vertical, diagonal)

Traffic-Load Estimation Which is the most valuable turn?  use traffic-load estimation to decide specific goals (1) balancing link utilization (2) prioritizing turns that are critical load calculation steps path diversity link load turn load cycle load weight scaling take into account hop-by-hop route-decisions

Traffic-Load Estimation Step by Step path diversity 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 3 1 2 3 6 1 2 link load turn load 1 high traffic cycle load 1 weight scaling medium traffic low traffic multiply by communication frequency

Example: Link, Turn, Cycle Load (1/2) traffic-load estimation 5 steps: diversity link turn cycle scale link load (from path diversity) 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 9 0.25 turn load 0.25 0.125 0.17 0.33 1 2 0.125 0.25 1 9 10 14 13 cycle load 1/2 = 0.5 0.5 link load 1 path diversity sum: 2 4 6

Example: Link, Turn, Cycle Load (2/2) traffic-load estimation 5 steps: diversity link turn cycle scale link load (from path diversity) 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 0.25 turn load 0.25 (no path) 0.25 0.17 0.33 14 0.25 9 10 14 13 0.125 cycle load 1/2 = 0.5 0.5 link load sum: 0.375

Example: Weight Scaling sourcedestination S1D1 S2D2 communication frequency 20 8 scaling S2 1 2 D1 1 4 5 8 9 7 11 D2 13 S1 D1 S2 2 6 10 14 2.5 3 9.8 8 13.2 12.5 9.2 9 4 5 6 7 8 9 10 11 0.125 0.38 13.5 0.38 0.25 S1 13 14 D2 most congested cycle

Putting it all together 1) evaluate turns, one at a time (choose the one leading to least congestion) 2) apply turn-enabling rules S2 1 2 D1 S2 1 2 D1 4 5 6 7 4 5 6 7 8 10 11 8 9 10 11 9 1 2 3 4 S1 13 14 D2 S1 13 14 D2 iterate this process until no undecided turn is left

Backtracking deadlock possible due to greedy turn-restriction selections turn-enabling rules do not resolve all deadlock-causing patterns backtrack to the last decision example placement decision tree 1 4 5 2 6 8 9 3 7 10 11 node 5 turn NW backtrack node 6 turn NE node 3 turn SW deadlock detected …

FATE Route-Computation Procedure procedure flowchart trigger: (1) new application launch (2) fault occurrence start (trigger) network example loop estimate traffic load choose turn to be disabled apply turn-enabling rules backtrack deadlock? disconnect? no undecided turn? : disabled turn : high traffic : enabled turn : medium traffic end : undecided turn : low traffic

Presentation Outline FATE routing Experimental evaluation Conclusions Experimental setup Evaluation on faulty topologies Evaluation on fault-free topologies Overheads Conclusions

Experimental Setup BookSim simulation with 8 X 8 mesh networks 3-stage router pipeline, 2 VCs/protocol class, 5 flits/VC Fault injection faults in bidirectional links 5 fault rates: 1 faulty link, 3%, 5%, 10%, and 15% faulty links 10 random fault patterns for each fault rate Traffic benchmarks 5 synthetic patterns: bit complement, bit reversal, shuffle, transpose, uniform random 11 traces from SPLASH-2 multi-threaded workloads generated from gem5 simulation with MESI cache coherence 4 memory controllers at mesh corners

Prior Routing Solutions Fault-tolerant routing Breadth-First Search (BFS) [Schroeder 91, Aisopos 11] Depth-First Search (DFS) [Sancho 04] Application-aware routing Bandwidth-Sensitive Oblivious Routing (BSOR) [Kinsy 09, Kinsy 13] Application-Specific Routing Algorithms (APSRA) [Palesi 08] Fully-adaptive routing on 2D mesh (congestion management) Dynamic XY (DyXY) [Li 06] Neighbor on Path (NoP) [Ascia 08] Regional Congestion Awareness (RCA) [Gratz 08]

Saturation Throughput for Synthetic Patterns fault-tolerant application-aware our solution less performance degradation as faults increase 9.5% 5.5% 10.6% -0.5% 17.7% 0.1% saturation throughput (packet/cycle/router) 23.3% 2.9% 33.3% 9.3% 33.3% ↑ over fault- tolerant routing 9.3% ↑ over app.- aware routing number of faulty links traffic pattern saturation throughput (packet/cycle/router) gains maximized with unbalanced load still provide gain with uniform load (15% fault rate)

Packet Latency for SPLASH-2 Traces 13% 228 cycles 59% average packet latency (cycles) minimal increase until 5% faults up to 59% (13%) latency reduction over BFS (APSRA) number of faulty links benchmark program average packet latency (cycles) significantly lower latency in 5 programs (15% fault rate)

Performance on Fault-Free Meshes deterministic fully-adaptive saturation throughput (packet/cycle/router) fault-tolerant application-aware our solution number of VCs Compared to DOR, fault-tolerant and application-aware routing, FATE always provides higher saturation throughput ( better traffic-load estimation) Compared to fully-adaptive, FATE outperforms at small number of VCs ( more VCs for normal transfer)

Overheads Software computation Hardware overheads 2-4 sec for 8X8 meshes on Intel Xeon® processor (two orders of magnitude faster than APSRA) ~110 turn-placement attempts (little dependence on fault rate) Hardware overheads Area: 6% increase (routing table, route-computation logic) Power consumption not measured Better power-efficiency than APSRA Can be more power-efficient than application-agnostic solutions when reusing same routing multiple times

Conclusions FATE provides highly fault-tolerant routing with graceful performance degradation by leveraging application traffic patterns Performance improvement over existing fault-tolerant routing 33% improvement in saturation throughput (synthetic traffic patterns) 59% improvement in packet latency (SPLASH-2 traces) Two orders of magnitude faster route-computation

Thank you! Question?

Backup Slides

Various Turn-Restriction Choices exponential increase of turn-restriction choices as network size increases example 2: 6 nodes example 1: 4 nodes 4 possibilities 2-D mesh with M nodes contains possibilities 𝟒 ( 𝑴 −𝟏)×( 𝑴 −𝟏) 16 possibilities (not shown other 8 cases)

Basic Turn-Enabling Rules (Cycle, Node, Link) Which turns should be enabled upon a turn-restriction decision? to minimize the number of restrictions to guarantee deadlock-freedom What happens if we break the rules? 1 2 3 rule 1 (cycle) : undecided : enabled : disabled turn types 1 3 4 2 5 1 3 4 2 5 6 7 8 rule 2 (node) 2 5 6 7 8 1 3 4 rule 3 (link) violated turn deadlock happens

Advanced Turn-Enabling Rules (Common Link, Opposite-corner Turn) 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 rule 4: common link 1 4 5 2 6 3 7 Why rule 4? Let’s applying basic rules… should be enabled for both candidates horizontal enabling vertical enabling rule 5: opposite-corner turn 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 diagonal enabling : undecided : enabled (basic) : disabled turn types : enabled (advanced) : candidate see paper for details

Applying Basic Turn-Enabling Rules to Faulty Topologies rule 1: cycle rules 2 & 3: node & link special case – no doublecount: counted only for one cycle no special change 1 3 4 2 5 6 7 8 1 3 4 2 5 6 7 8 mutual turn deadlock when disabling only mutual turn

Applying Advanced Turn-Enabling Rules to Faulty Topologies rule 4: common link rule 5: opposite-corner turn apply only towards fault-free directions apply as if fault-free 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12