Presentation is loading. Please wait.

Presentation is loading. Please wait.

GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and.

Similar presentations


Presentation on theme: "GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and."— Presentation transcript:

1 GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and Computer Engineering Texas A&M University

2 Networks-on-Chip Moore’s Law is putting more and more transistors on the chip. NoCs scale better than traditional interconnects. High interconnect latencies translate into idle processor core cycles and wasted power Tilera Tile64 Intel Single-Chip Cloud Computer (ISSCC 2010)

3 Routing in NoCs Real workloads are unbalanced in nature Oblivious routing (DOR) –Tends to exacerbate congestion Adaptive routing: –Try to avoid congested spots –Classified on the basis of awareness: Local adaptive Regionally aware Globally aware fft benchmark from SPLASH-2 under DOR Darker arrows = Higher congestion

4 Outline Introduction Motivation for Global Awareness Related Work Global Congestion Awareness (GCA): –Route Computation –Information Propagation Implementation Evaluation Conclusion and Future Work

5 Local Congestion Local adaptive –Measure local congestion metric (free VC, free buffers) –Greedy local decisions due to poor visibility S Low congestion Moderate congestion D High congestion Optimal Local adaptive

6 Regional Awareness Regionally aware (Gratz et al. HPCA’ 08, Ma et al. ISCA ’11 ) –Aggregated congestion of all nodes in a dimension –Noisy information degrades performance. S Low congestion Moderate congestion D High congestion Optimal RCA-1D

7 Ideally … On a per-destination basis: –Evaluate end-to-end delay along all minimal paths to destination –Pick path with least delay S Low congestion Moderate congestion D High congestion Optimal

8 Global Awareness Earlier schemes utilize a separate congestion monitoring network (Manevich et al. DSD`11, Ramanujam et al. ANCS`10) Increased network complexity Slow route calculation mechanism in DAR Challenges: –Low overhead dissemination technique –Limited resource for storage and computation

9 Outline Introduction Motivation for Global Awareness Global Congestion Awareness (GCA): –Route Computation –Information Propagation Implementation Evaluation Conclusion and Future Work

10 GCA: Bird’s eye view 1.Congestion information is conveyed via piggybacking onto the header flits 2.Every node builds a “map” representing the congestion of the network 3.Optimal path is calculated using a shortest path graph algorithm in each router

11 Packet-level “Piggybacking” of congestion information in header flits (Zhang et al. PrimeAsia`10, Chen et al. NOCS`12) Back-annotation appends congestion information for link in opposite direction –Direction of flit traversal: Black –Congestion Information appended: Red

12 Router Micro-Architecture VC-1 VC-n N E W S InIn VC-1 VC-n VC Allocator XB Allocator Congestion Map Route Compute Hardware Optimal Output Port Table Routing Unit Header Modification Local Congestion Values X Traffic Vector Destination Node N E W S Ej

13 Router Micro-Architecture VC-1 VC-n N E W S InIn VC-1 VC-n VC Allocator XB Allocator Congestion Map Route Compute Hardware Optimal Output Port Table Routing Unit Header Modification Local Congestion Values X Traffic Vector Destination Node N E W S Ej

14 Router Micro-Architecture + < P2P2 P1P1 P out D out + d1d1 d2d2 l1l1 l2l2 4

15 Route Computation Node marked 0 is source Number on link denotes congestion Number in node denotes shortest path cost to that node Letter denotes optimal output port.

16 Route Computation (contd.) At most two feeder nodes for every node Pick the feeder node with the least cost path For nodes a hop away: –Cost = Congestion of connecting link

17 Route Computation (contd.) Simple “add and compare” step Example: Top-left node –From East port: 3+1=4 –From South port: 8+0=8 Cost assigned = 4

18 Route Computation (contd.) Every iteration flows outward Every quadrant computed in parallel Re-evaluate only the downstream sub- graph every update

19 Caveats

20 Limited GCA (LGCA) Constrain the visibility to a smaller window Store information only for nodes k hops or less away Reduces storage overhead at the cost of slight performance penalty vis-à-vis GCA

21 Implementation

22 Simulation carried out in a cycle accurate C++ simulator 1 1.S. Prabhu, B. Grot, P. Gratz, and J. Hu, “Ocin_tsim - DVFS Aware Simulator for NoCs,” in Proc. SAW-1, Jan 2010. Simulation parameters Characteristics of simulated design Realistic WorkloadsSynthetic traffic Topology7x7 2D Mesh8x8 2D Mesh Router uArchTwo Stage Speculative Per hop latency3 cycles: 2 cycles in router, 1 cycle to cross channel Virtual Channels/Port8 Flit buffers/VC5 Traffic WorkloadSPLASH-2 tracesRandom, Transpose, Bit- complement Duration of simulation10 million cycles or end of trace10000 warm-up cycles followed by 100000 packets Scaling Factor (w)0.25 Fading (x,n)x=1 unit; n = 100 cycles

23 SPLASH-2

24 Improvement due to GCA (average): - DOR: 45%- Local: 26% - RCA-1D: 15%- DAR: 8%

25 SPLASH-2 Outliers: DAR better than GCA on inherently static workloads (fft,radix) Statistical traffic distribution enables better performance GCA better than DAR on all other workloads

26 SPLASH-2 LGCA performance: Close to GCA on most workloads lu is an exception Overall average slightly worse than GCA but still better than other competing algorithms

27 Conclusion Proposed a novel adaptive routing mechanism which uses global congestion information to perform per-hop routing in on-chip networks. Uses back-annotated piggybacking to propagate congestion information which alleviates the issue of overheads Light-weight implementation of the shortest path computation GCA improves average packet latency –By 26% against local adaptive –By 15% against RCA -1D –By 8% against DAR On average for the SPLASH-2 suite of benchmarks.

28 Thank You

29 BACKUP

30 Minimal Adaptive Routing Model –Adaptive routing along minimal paths. D S

31 Fading

32 Synthetic Traffic

33 Network Sensitivity Experiments Variation of two parameters: –Vary VC Count No variation in relative performance –Vary the mesh size Performs better for larger meshes For both experiments, we simulate Transpose traffic

34 Network Size 21 %

35 Congestion Information Scaling

36 Number of steps

37 Empirical parameters Scaling Factor w: –w = 0.25 GCA: links beyond 4 hops are assigned a constant scaling factor of 0.25 LGCA: links beyond 4 hops are not stored as k=4 Fading mechanism: –n = 100 cycles –x = 1 unit

38 Challenges in global awareness Dissemination of congestion information –Low overhead –Account for staleness Limited storage in on-chip routers –Exponential number of paths to each destination Limited hardware resources for computations

39 Future Work Congestion prediction: proactive adaptive routing instead of reactive adaptive routing Stability analysis: Does the algorithm thrash between different paths for some traffic patterns? Effect of imperfect congestion state representation

40 Back-annotation For each outgoing flit, the node appends the congestion metric for the link in the same direction For each outgoing flit, the node appends the congestion metric for the link in the opposite direction. S D Packet Traversal direction Congestion information direction S D

41 Multi-region Network partitioned into four quadrants Each quadrant runs a benchmark as shown Isolated traffic regions emulate virtual machine-like scenario

42 Multi-region Local adaptive is unaffected due to its lack of visibility RCA’s performance suffers due to noise through aggregation GCA maintains fine-grained information Helps avoid noise and perform better than RCA

43 SPLASH-2 Improvement over local adaptive: GCA: 26% average, 86% best case LGCA: 23% average, 82% best case

44 SPLASH-2 Improvement over RCA-1D: GCA: 15% average, 51% best case LGCA: 11% average, 38% best case

45 SPLASH-2 Improvement over DAR: GCA: 8% average, 53% best case LGCA: 4% average, 41% best case


Download ppt "GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and."

Similar presentations


Ads by Google