Presentation is loading. Please wait.

Presentation is loading. Please wait.

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.

Similar presentations


Presentation on theme: "DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group."— Presentation transcript:

1 DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina Appeared in the 7 th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001

2 Slide 2 DUKE UNIVERSITY Network Saturation

3 Slide 3 DUKE UNIVERSITY Why Network Saturation? router Tree saturation Deadlock cycles New packets block older packets Backpressure take 1000s of cycles to propagate back

4 Slide 4 DUKE UNIVERSITY Why Do We Care?  Computation power per router is increasing More aggressive speculation More aggressive speculation Simultaneous Multithreading Simultaneous Multithreading Chip Multiprocessors Chip Multiprocessors  “Unstable” behavior makes designers very nervous Router CPUs

5 Slide 5 DUKE UNIVERSITY So, what’s the solution?  Throttle stop injecting packets when you hit a “threshold” stop injecting packets when you hit a “threshold” “threshold” = % full network buffers “threshold” = % full network buffers  But Local estimate of threshold insufficient Local estimate of threshold insufficient Saturation point differs for communication patterns Saturation point differs for communication patterns  Questions How do we collect global estimate of % full network buffers? How do we collect global estimate of % full network buffers? How do we “tune” the threshold to different patterns? How do we “tune” the threshold to different patterns?

6 Slide 6 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics Deadlocks & virtual channels Deadlocks & virtual channels Adaptive routing & Duato’s theory Adaptive routing & Duato’s theory  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

7 Slide 7 DUKE UNIVERSITY A Multiprocessor Network router

8 Slide 8 DUKE UNIVERSITY Deadlock Avoidance 1 2 3 4 1  3 2  4 3  1 4  2 Deadlocked 1 2 3 4 1  3 2  4 3  1 4  2 Virtual Channels (red & yellow)

9 Slide 9 DUKE UNIVERSITY Virtual Channels (VC) 1 3 4 1  3 2  4 3  1 4  2 One Buffer Per VC Logically, red and yellow networks (deadlock-free)

10 Slide 10 DUKE UNIVERSITY Duato’s Theory  Adaptive network for high performance deadlock-prone deadlock-prone  Deadlock-free network when adaptive network deadlocks drop down to deadlock-free when router is congested drop down to deadlock-free when router is congested  Implemented with different virtual channels adaptive virtual channels adaptive virtual channels deadlock-free virtual channels (escape channels) deadlock-free virtual channels (escape channels)

11 Slide 11 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

12 Slide 12 DUKE UNIVERSITY Global Estimate of Congestion  % of full buffers in entire network more & more buffers occupied when network saturates more & more buffers occupied when network saturates throttle network when % full buffers cross threshold throttle network when % full buffers cross threshold  Advantages simple aggregation simple aggregation empirical observation: works well empirical observation: works well  Disadvantages doesn’t detect localized congestion doesn’t detect localized congestion threshold differs for communication patterns (we solve this) threshold differs for communication patterns (we solve this)

13 Slide 13 DUKE UNIVERSITY Gather Global Information  Global Information % full network buffers in an “interval” % full network buffers in an “interval” % packets or flits delivered during an “interval” % packets or flits delivered during an “interval”  Constraint gather time << backpressure buildup time (1000s of cycles) gather time << backpressure buildup time (1000s of cycles)  Mechanisms piggybacking piggybacking meta-packets meta-packets side-band signal side-band signal

14 Slide 14 DUKE UNIVERSITY Sideband: Dimension-wise Aggregation Each hop takes h cycles on the sideband After 2 hops, aggregation in one dimenstion done 2 such phases Total gather time = 2 * 2 * h = 4h cycles For k-ary, n-cubes, gather-time (g) = n * k * h / 2 For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles Phase I Phase 2

15 Slide 15 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

16 Slide 16 DUKE UNIVERSITY Dynamic Detection of Threshold (Hill Climbing) B A C % full buffers (%) 0 Throughput YesNo Increment No Change YesDecrement Currently throttling? Drop in Bandwidth > 25% Threshold … we may still creep into saturation (later)

17 Slide 17 DUKE UNIVERSITY Summary of Approach  Global Knowledge of a Network Collect % full network buffers and overall throughput Collect % full network buffers and overall throughput Dimension-wise aggregation, g-cycle snapshots Dimension-wise aggregation, g-cycle snapshots Aggregation via sideband signals Aggregation via sideband signals  Dynamically detect throttling threshold Threshold = % of full network buffers Threshold = % of full network buffers Self-tuned using hill climbing Self-tuned using hill climbing Reset if hill climbing fails Reset if hill climbing fails

18 Slide 18 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

19 Slide 19 DUKE UNIVERSITY Methodology  Flitsim 2.0 Simulator (Pinkston’s group at USC) warmup for 10k cycles, simulate for 50k cycles warmup for 10k cycles, simulate for 50k cycles  Network architecture 16x16 two-dimensional torus (16-ary, 2-cube) 16x16 two-dimensional torus (16-ary, 2-cube) Full-duplex links Full-duplex links Packet size = 16 flits Packet size = 16 flits Wormhole routing Wormhole routing Deadlock avoidance (paper has deadlock recovery results) Deadlock avoidance (paper has deadlock recovery results)  Router architecture 3 virtual channels per physical channel 3 virtual channels per physical channel Each virtual channel buffer holds 8 flits Each virtual channel buffer holds 8 flits 1 cycle central arbitration, 1 cycle switching 1 cycle central arbitration, 1 cycle switching

20 Slide 20 DUKE UNIVERSITY Input Traffic  Packet Generation Frequency “attempt” to send one packet per packet regeneration interval “attempt” to send one packet per packet regeneration interval  Traffic Patterns Random destination Random destination Perfect Shuffle: a n-1 a n-2... a 1 a 0  a n-2 a n-3... a 0 a n-1 Perfect Shuffle: a n-1 a n-2... a 1 a 0  a n-2 a n-3... a 0 a n-1 Butterfly: a n-1 a n-2... a 1 a 0  a 0 a n-2 … a 1 a n-1 Butterfly: a n-1 a n-2... a 1 a 0  a 0 a n-2 … a 1 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0  a 0 a 1... a n-2 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0  a 0 a 1... a n-2 a n-1

21 Slide 21 DUKE UNIVERSITY Throttling Algorithms  Base no throttling no throttling  ALO (At Least One) Lopez, Martinez, and Duato, ICPP, August, 1998 Lopez, Martinez, and Duato, ICPP, August, 1998 Throttling based on local estimation of congestion Throttling based on local estimation of congestion Inject new packet only if Inject new packet only if – “useful” physical channel has all virtual channels free, or – at least one virtual channel on every “useful” channel is free  Tune (this work)

22 Slide 22 DUKE UNIVERSITY Tuning Parameters  Total number of network buffers = 256 * 3 * 4 = 3072  Gather time (g) = n * k * h / 2 = 32 cycles  Sideband communication latency (h) = 2 cycles  Sideband communication bandwidth = 25 bits (!) # network buffers = 3072 = 12 bits # network buffers = 3072 = 12 bits max throughput = g * 256 * 1 = 8192 = 13 bits max throughput = g * 256 * 1 = 8192 = 13 bits  Tuning frequency = once every 96 cycles  Initial threshold value = 1% ~= 30 buffers  Threshold increment = 1%, decrement = 4%

23 Slide 23 DUKE UNIVERSITY Random Pattern Beyond saturation point, Tune outperforms ALO and Base

24 Slide 24 DUKE UNIVERSITY Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles) Tune fairly insensitive to delayed collection of information

25 Slide 25 DUKE UNIVERSITY Static Threshold Choice Optimal Thesholds different for random and butterfly Tune performs close to the best static threshold

26 Slide 26 DUKE UNIVERSITY With Bursty Load Tune outperforms ALO random bit reversal shuffle butterfly

27 Slide 27 DUKE UNIVERSITY Avoiding Local Maxima  What if steady decrease in bandwidth < 25%? potential to “creep” into saturation potential to “creep” into saturation  Solution: remember global maxima max = maximum throughput seen in any tuning period max = maximum throughput seen in any tuning period N max = number of full buffers at max N max = number of full buffers at max T max = threshold at max T max = threshold at max  Reset threshold min(T max, N max ) if throughput < 50% max  If “r” consecutive resets don’t fix the problem, then restart hypothesis: communication pattern has changed hypothesis: communication pattern has changed

28 Slide 28 DUKE UNIVERSITY Threshold Reset Necessary Packet Rengeration Interval = 10 cycles Hill Climbing Hill Climbing + Local Maxima Hill Climbing Hill Climbing + Local Maxima

29 Slide 29 DUKE UNIVERSITY Summary  Network Saturation is a severe problem advent of powerful processors, SMT, and CMPs advent of powerful processors, SMT, and CMPs “unstable” behavior makes designers nervous “unstable” behavior makes designers nervous  We propose throttling based on global knowledge aggregate global knowledge (% full buffers,throughput) aggregate global knowledge (% full buffers,throughput) throttle when % full buffers exceed threshold throttle when % full buffers exceed threshold tune threshold for communication patters & offered load tune threshold for communication patters & offered load


Download ppt "DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group."

Similar presentations


Ads by Google