Download presentation
Presentation is loading. Please wait.
Published byRodney Daniels Modified over 9 years ago
1
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego
2
Networks-on-Chip Chip-multiprocessors (CMPs) increasingly popular 2D-mesh networks often used as on-chip fabric Routing algorithm central in determining performance Tilera Tile64Intel 48-core data center on die (ISSCC 2010)
3
Classes of Routing Algorithms Oblivious routing +Simple and fast router designs – Poor load balancing under bursty traffic Adaptive routing +Better performance (throughput, latency) +Better fault tolerance -Higher router complexity
4
Related Work Oblivious Routing [Valiant, ROMM, O1TURN, Optimal oblivious routing] – Optimize for worst and average-case performance Adaptive routing commercially used in multiprocessors from IBM, Cray, Compaq On-chip routing very different from off-chip: – Lower power – Lower area – Lower router complexity
5
Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Evaluation
6
Minimal Adaptive Routing Model – Adaptive routing along minimal directions D S
7
Coarse Fine Granularity of Congestion Estimation Local congestion
8
Local Congestion Local adaptive – Measure local congestion metric (free VC, free buffers) S Low congestion Moderate congestion D High congestion Optimal Local adaptive
9
Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion
10
Dimension-based Congestion RCA-1D (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes along a dimension S Low congestion Moderate congestion D High congestion Optimal RCA-1D
11
Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion
12
Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal
13
Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal
14
Quadrant-based Congestion RCA-Quadrant (Gratz et al. HPCA’ 08) – Exponential moving average of congestion to all nodes in the destination quadrant S Low congestion Moderate congestion D High congestion Optimal RCA-quad
15
Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion
16
Ideally … On a per-destination basis: – Estimate end-to-end delay along all minimal paths to destination – Choose path with least delay S Low congestion Moderate congestion D High congestion Optimal
17
Challenges Limited bandwidth for congestion updates – Congestion notification not instantaneous Limited storage in on-chip routers – Exponential number of paths to each destination Limited hardware resources for computations How can we practically emulate ideal adaptive routing?
18
Destination-based adaptive routing (DAR) A node estimates delay to all other nodes through candidate outputs every T cycles S D L[N][D] = 20 L[E][D] = 30
19
DAR-High Level Traffic distribution to output ports controlled using per-destination split ratios W W[N][D]= 0.6 W[E][D]= 0.4 S D Estimate delay to destination through candidate outputs Shift traffic from more congested port to less congested port Start with initial set of split ratios L[N][D] = 20 L[E][D] = 30
20
DAR-High Level Traffic distribution to output ports controlled using per-destination split ratios W Estimate delay to destination through candidate outputs S D Shift traffic from more congested port to less congested port Start with initial set of split ratios W[N][D]= 0.8 W[E][D]= 0.2 L[N][D] = 20 L[E][D] = 30
21
Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) – Distributed delay measurement – Split ratio adaptation – Scaling Evaluation
22
Distributed Delay Measurement A node maintains: – Per-destination traffic split ratio through candidate output ports: W[p][j] – Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p]
23
Distributed Delay Measurement Every node estimates average delay to all other nodes in the network 12131415 8 4 0 9 5 11 67 123 10 Avg 10 [10] 1.Delay from 10 to itself, Avg 10 [10] = l 10 [Ej] 2.Avg 10 [10] propagated to neighbors 3.Nodes 6, 9, 14, 11 add local delay to Avg 10 [10] to compute delay to node 10 4.For example, at node 9, L[E][10] = l[E] + Avg 10 [10] Avg 9 [10] = L[E][10]
24
Distributed Delay Measurement Every node estimates delay to all other nodes in the network 1213 14 15 8 4 0 9 9 5 11 6 6 7 123 10 Avg 14 [10] Avg 11 [10] Avg 9 [10] 1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors 2.For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg 6 [10] A[N][10] = Avg 9 [10] 3.Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] + l[E] L[N][10] = A[N][10] + l[N] 4.Finally, average delay from node 5 to node 10 is computed as: Avg 5 [10] = W[E][10]L[E][10] + W[N][10]L[N][10] Avg 14 [10] Avg 9 [10] Avg 6 [10] Avg 11 [10]
25
Distributed Delay Measurement Every node estimates delay to all other nodes in the network 12 13 14 15 8 8 4 0 9 9 5 5 11 6 6 7 7 1 2 2 3 10 1.Nodes 6, 9, 14, 11 propagate estimated delay to node 10 to upstream neighbors 2.For example, node 5 receives two delay updates, from nodes 9 and 6 A[E][10] = Avg 6 [10] A[N][10] = Avg 9 [10] 3.Node 5 adds local link delay to received delay update: L[E][10] = A[E][10] + l[E] L[N][10] = A[N][10] + l[N] 4.Finally, average delay from node 5 to node 10 is computed as: Avg 5 [10] = W[E][10]L[E][10] + W[N][10]L[N][10]
26
Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Distributed delay measurement – Split ratio adaptation – Scaling Evaluation
27
Adaptation of Split ratio Objective: Equalize delay on candidate output ports If only one candidate output, split ratio is 1 If two candidate outputs, – Let p h be the port with higher delay to destination j – Let p l be the port with lower delay to destination j – W[p h ][j] + W[p l ][j] = 1 – Δ traffic shifted from p h to p l every T cycles – Δ proportional to (L[p h ][j]-L[p l ][j])/L[p h ][j]
28
Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion Does not scale !!
29
Coarse Fine Granularity of Congestion Estimation Local congestion Dimension-based congestion Quadrant-based congestion Destination-based congestion Scalable Destination- based congestion
30
Outline Introduction Motivation Destination-Based Adaptive Routing (DAR) Distributed delay measurement Split ratio adaptation – Scaling Evaluation
31
Look-ahead Window Node S maintains delay estimate for MxM window centered at S. Any node outside window mapped to closest node within window A packet’s look-ahead window shifts as it is routed from source to destination
32
Window Size Destination D guaranteed to be within window when packet is (M-1)/2 hops away from D Intuition: Packet has (M-1)/2 hops to route around congestion hot spots 7x7 look-ahead window in 16x16 mesh has comparable performance to DAR (equivalent to 31x31 look-ahead window)
33
Outline Introduction Related work Destination-Based Adaptive Routing (DAR) Evaluation
34
Experimental setup Compare DAR with RCA-1D, RCA-quadrant, Local adaptive SPLASH-2 benchmarks + synthetic traffic patterns (uniform, transpose, shuffle) Cycle-accurate NoC simulator models 3-stage router pipeline 8 VC, 5 flit deep 1 VC used as escape VC for deadlock prevention
35
Splash results – 7x7 mesh 41%
36
Splash results – 7x7 mesh 65%
37
Uniform traffic – 8x8 mesh
38
Transpose traffic – 8x8 mesh
39
Shuffle traffic – 8x8 mesh
40
SDAR - 16x16 mesh, 7x7 window Average latency over 100 permutation traffic patterns at 18% injection load Network saturation statistics at 18% injection load
41
Summary Destination-based Adaptive Routing (DAR) for 2D mesh networks Scalable DAR (SDAR) uses look-ahead window and easily scales to large networks DAR outperforms existing adaptive and oblivious routing SDAR achieves comparable performance with significantly less overheads
42
Thank you!!
43
Key implementation details Simple router implementation: low storage, low bandwidth Synchronize delay updates to reuse delay computation and weight adaptation hardware Approximate computations to simplify implementation
44
Router architecture – Kim et al DAC ‘05 Quadrant Port Pre-select VC-1 VC Allocator XB Allocator...... N VC-v...... S E W VC-1...... VC-v Preferred Output Registers In N S E W Ej Congestion Value Registers Credits Routing Unit Override Credits
45
DAR Router
46
Distributed delay measurement A node maintains: – Per-destination traffic split ratio through candidate output ports: W[p][j] – Delay to next-hop router/ejection interface through each output port (N, S, E, W, Ej): l[p] Using updates received from downstream nodes, a node computes: – L[p][j]: Average delay from current node to node j through output port p – Avg[j]: Average delay from current node to node j
47
Destination-based Adaptive Routing (DAR) Every router maintains per-destination split ratios which control traffic distribution to output ports Split ratios adjusted every T cycles based on measured delay to D through the two ports S Low congestion Moderate congestion D High congestion 0.8 0.2 0.7 0.3 1 1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.