Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan Highly Fault-tolerant NoC Routing with Application-aware Congestion Management Doowon Lee, Ritesh Parikh and Valeria Bertacco University of Michigan
Wide Range of Applications everyday applications physical simulation scientific applications computational chemistry computational biology semiconductor simulation cloud computing varying computation characteristic, user requirement, etc. (picture sources) 1. N-body simulation: https://www.astro.rug.nl/~weygaert 2. semiconductor: http://spectrum.ieee.org 3. computational biology: http://csbio.cs.umn.edu/ 4. molecular structure: http://nanotechnologyuniverse.com
Application Running on Network-on-Chip application example: video encoder mapping chip multiprocessor with network-on-chip (NoC) communication frequency destination source 64-thread simulation of SPLASH-2 (ocean) (number of flits) some pairs communicate more frequently A B analysis (picture sources) 1. Video encoder: Gary Sullivan et al., Standardized Extensions of High Efficiency Video Coding (HEVC) 2. Tilera TILE-Gx8072: http://www.tilera.com
Fragile Networks-on-Chip 22 nm (Intel) 14 nm 7 nm (IBM) tail of transistor scaling increasing transistor density transistor reliability↓ network-on-chip… possible single point of failure permanent faults solution: network-on-chip routing reconfiguration
How to reduce NoC degradation from faults? Network-on-chip reconfiguration entails performance degradation motivating experiment: fault vs. performance degradation minimum throughput requirement state-of-the-art routing reconfiguration [Aisopos 11] our goal KEY IDEA: application-aware routing optimized to application’s communication patterns
Application-Aware Routing (1/2) How do we find adaptive routing optimized to communication patterns? various route options (no restriction) problem solution 1 deadlock-free 1 1 1 1 S D S D 1 1 1 1 1 1 avoid deadlock by restricting turns 1 2 1 2 3 1 1 1 1 3 1 2 path diversity = 6 deadlock possible path diversity = 3
Application-Aware Routing (2/2) How do we find adaptive routing optimized to communication patterns? various route options (no restriction) problem solution 2 1 1 1 S D 2 3 S D 1 1 1 avoid deadlock by restricting turns 1 2 1 2 3 1 3 path diversity = 6 deadlock possible path diversity = 6 Where to best place turn restrictions? NP-complete problem OUR CONTRIBUTION: turn-restriction placement heuristic
Presentation Outline FATE (Fault- and Application-aware Turn-model Extension) (1) Turn-enabling rules (2) Load estimation “How to reduce search?” “Which is the most valuable turn?” (3) Overall routing computation algorithm Experimental evaluation Conclusions 1 3 4 2 5 6 7 8
How to reduce turn-restriction search? To avoid unfruitful turn-restriction patterns… pattern 1. network disconnection pattern 2. non-minimal restriction 1 2 3 1 2 3 pattern 3. possible deadlock 1 3 4 2 5
Turn-Enabling Rules To avoid unfruitful turn-restriction patterns… … each time a turn is disabled, several others should be enabled basic rules advanced rules 1 3 4 2 5 6 7 8 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 enable adjacent turns (cycle, node, link) enable remote turns (horizontal, vertical, diagonal)
Traffic-Load Estimation Which is the most valuable turn? use traffic-load estimation to decide specific goals (1) balancing link utilization (2) prioritizing turns that are critical load calculation steps path diversity link load turn load cycle load weight scaling take into account hop-by-hop route-decisions
Traffic-Load Estimation Step by Step path diversity 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 3 1 2 3 6 1 2 link load turn load 1 high traffic cycle load 1 weight scaling medium traffic low traffic multiply by communication frequency
Example: Link, Turn, Cycle Load (1/2) traffic-load estimation 5 steps: diversity link turn cycle scale link load (from path diversity) 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 9 0.25 turn load 0.25 0.125 0.17 0.33 1 2 0.125 0.25 1 9 10 14 13 cycle load 1/2 = 0.5 0.5 link load 1 path diversity sum: 2 4 6
Example: Link, Turn, Cycle Load (2/2) traffic-load estimation 5 steps: diversity link turn cycle scale link load (from path diversity) 1 4 5 8 9 3 7 11 15 13 12 2 6 10 14 source destination 0.25 turn load 0.25 (no path) 0.25 0.17 0.33 14 0.25 9 10 14 13 0.125 cycle load 1/2 = 0.5 0.5 link load sum: 0.375
Example: Weight Scaling sourcedestination S1D1 S2D2 communication frequency 20 8 scaling S2 1 2 D1 1 4 5 8 9 7 11 D2 13 S1 D1 S2 2 6 10 14 2.5 3 9.8 8 13.2 12.5 9.2 9 4 5 6 7 8 9 10 11 0.125 0.38 13.5 0.38 0.25 S1 13 14 D2 most congested cycle
Putting it all together 1) evaluate turns, one at a time (choose the one leading to least congestion) 2) apply turn-enabling rules S2 1 2 D1 S2 1 2 D1 4 5 6 7 4 5 6 7 8 10 11 8 9 10 11 9 1 2 3 4 S1 13 14 D2 S1 13 14 D2 iterate this process until no undecided turn is left
Backtracking deadlock possible due to greedy turn-restriction selections turn-enabling rules do not resolve all deadlock-causing patterns backtrack to the last decision example placement decision tree 1 4 5 2 6 8 9 3 7 10 11 node 5 turn NW backtrack node 6 turn NE node 3 turn SW deadlock detected …
FATE Route-Computation Procedure procedure flowchart trigger: (1) new application launch (2) fault occurrence start (trigger) network example loop estimate traffic load choose turn to be disabled apply turn-enabling rules backtrack deadlock? disconnect? no undecided turn? : disabled turn : high traffic : enabled turn : medium traffic end : undecided turn : low traffic
Presentation Outline FATE routing Experimental evaluation Conclusions Experimental setup Evaluation on faulty topologies Evaluation on fault-free topologies Overheads Conclusions
Experimental Setup BookSim simulation with 8 X 8 mesh networks 3-stage router pipeline, 2 VCs/protocol class, 5 flits/VC Fault injection faults in bidirectional links 5 fault rates: 1 faulty link, 3%, 5%, 10%, and 15% faulty links 10 random fault patterns for each fault rate Traffic benchmarks 5 synthetic patterns: bit complement, bit reversal, shuffle, transpose, uniform random 11 traces from SPLASH-2 multi-threaded workloads generated from gem5 simulation with MESI cache coherence 4 memory controllers at mesh corners
Prior Routing Solutions Fault-tolerant routing Breadth-First Search (BFS) [Schroeder 91, Aisopos 11] Depth-First Search (DFS) [Sancho 04] Application-aware routing Bandwidth-Sensitive Oblivious Routing (BSOR) [Kinsy 09, Kinsy 13] Application-Specific Routing Algorithms (APSRA) [Palesi 08] Fully-adaptive routing on 2D mesh (congestion management) Dynamic XY (DyXY) [Li 06] Neighbor on Path (NoP) [Ascia 08] Regional Congestion Awareness (RCA) [Gratz 08]
Saturation Throughput for Synthetic Patterns fault-tolerant application-aware our solution less performance degradation as faults increase 9.5% 5.5% 10.6% -0.5% 17.7% 0.1% saturation throughput (packet/cycle/router) 23.3% 2.9% 33.3% 9.3% 33.3% ↑ over fault- tolerant routing 9.3% ↑ over app.- aware routing number of faulty links traffic pattern saturation throughput (packet/cycle/router) gains maximized with unbalanced load still provide gain with uniform load (15% fault rate)
Packet Latency for SPLASH-2 Traces 13% 228 cycles 59% average packet latency (cycles) minimal increase until 5% faults up to 59% (13%) latency reduction over BFS (APSRA) number of faulty links benchmark program average packet latency (cycles) significantly lower latency in 5 programs (15% fault rate)
Performance on Fault-Free Meshes deterministic fully-adaptive saturation throughput (packet/cycle/router) fault-tolerant application-aware our solution number of VCs Compared to DOR, fault-tolerant and application-aware routing, FATE always provides higher saturation throughput ( better traffic-load estimation) Compared to fully-adaptive, FATE outperforms at small number of VCs ( more VCs for normal transfer)
Overheads Software computation Hardware overheads 2-4 sec for 8X8 meshes on Intel Xeon® processor (two orders of magnitude faster than APSRA) ~110 turn-placement attempts (little dependence on fault rate) Hardware overheads Area: 6% increase (routing table, route-computation logic) Power consumption not measured Better power-efficiency than APSRA Can be more power-efficient than application-agnostic solutions when reusing same routing multiple times
Conclusions FATE provides highly fault-tolerant routing with graceful performance degradation by leveraging application traffic patterns Performance improvement over existing fault-tolerant routing 33% improvement in saturation throughput (synthetic traffic patterns) 59% improvement in packet latency (SPLASH-2 traces) Two orders of magnitude faster route-computation
Thank you! Question?
Backup Slides
Various Turn-Restriction Choices exponential increase of turn-restriction choices as network size increases example 2: 6 nodes example 1: 4 nodes 4 possibilities 2-D mesh with M nodes contains possibilities 𝟒 ( 𝑴 −𝟏)×( 𝑴 −𝟏) 16 possibilities (not shown other 8 cases)
Basic Turn-Enabling Rules (Cycle, Node, Link) Which turns should be enabled upon a turn-restriction decision? to minimize the number of restrictions to guarantee deadlock-freedom What happens if we break the rules? 1 2 3 rule 1 (cycle) : undecided : enabled : disabled turn types 1 3 4 2 5 1 3 4 2 5 6 7 8 rule 2 (node) 2 5 6 7 8 1 3 4 rule 3 (link) violated turn deadlock happens
Advanced Turn-Enabling Rules (Common Link, Opposite-corner Turn) 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 rule 4: common link 1 4 5 2 6 3 7 Why rule 4? Let’s applying basic rules… should be enabled for both candidates horizontal enabling vertical enabling rule 5: opposite-corner turn 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 diagonal enabling : undecided : enabled (basic) : disabled turn types : enabled (advanced) : candidate see paper for details
Applying Basic Turn-Enabling Rules to Faulty Topologies rule 1: cycle rules 2 & 3: node & link special case – no doublecount: counted only for one cycle no special change 1 3 4 2 5 6 7 8 1 3 4 2 5 6 7 8 mutual turn deadlock when disabling only mutual turn
Applying Advanced Turn-Enabling Rules to Faulty Topologies rule 4: common link rule 5: opposite-corner turn apply only towards fault-free directions apply as if fault-free 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12 1 4 5 2 6 8 9 3 7 10 11 15 14 13 12