Download presentation
Presentation is loading. Please wait.
Published byNoah French Modified over 9 years ago
1
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina Appeared in the 7 th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001
2
Slide 2 DUKE UNIVERSITY Network Saturation
3
Slide 3 DUKE UNIVERSITY Why Network Saturation? router Tree saturation Deadlock cycles New packets block older packets Backpressure take 1000s of cycles to propagate back
4
Slide 4 DUKE UNIVERSITY Why Do We Care? Computation power per router is increasing More aggressive speculation More aggressive speculation Simultaneous Multithreading Simultaneous Multithreading Chip Multiprocessors Chip Multiprocessors “Unstable” behavior makes designers very nervous Router CPUs
5
Slide 5 DUKE UNIVERSITY So, what’s the solution? Throttle stop injecting packets when you hit a “threshold” stop injecting packets when you hit a “threshold” “threshold” = % full network buffers “threshold” = % full network buffers But Local estimate of threshold insufficient Local estimate of threshold insufficient Saturation point differs for communication patterns Saturation point differs for communication patterns Questions How do we collect global estimate of % full network buffers? How do we collect global estimate of % full network buffers? How do we “tune” the threshold to different patterns? How do we “tune” the threshold to different patterns?
6
Slide 6 DUKE UNIVERSITY Outline Overview Multiprocessor Network Basics Deadlocks & virtual channels Deadlocks & virtual channels Adaptive routing & Duato’s theory Adaptive routing & Duato’s theory How to collect global estimate of congestion? How to “tune” the throttle threshold? Methodology & Results Summary, Future Work, & Other Projects
7
Slide 7 DUKE UNIVERSITY A Multiprocessor Network router
8
Slide 8 DUKE UNIVERSITY Deadlock Avoidance 1 2 3 4 1 3 2 4 3 1 4 2 Deadlocked 1 2 3 4 1 3 2 4 3 1 4 2 Virtual Channels (red & yellow)
9
Slide 9 DUKE UNIVERSITY Virtual Channels (VC) 1 3 4 1 3 2 4 3 1 4 2 One Buffer Per VC Logically, red and yellow networks (deadlock-free)
10
Slide 10 DUKE UNIVERSITY Duato’s Theory Adaptive network for high performance deadlock-prone deadlock-prone Deadlock-free network when adaptive network deadlocks drop down to deadlock-free when router is congested drop down to deadlock-free when router is congested Implemented with different virtual channels adaptive virtual channels adaptive virtual channels deadlock-free virtual channels (escape channels) deadlock-free virtual channels (escape channels)
11
Slide 11 DUKE UNIVERSITY Outline Overview Multiprocessor Network Basics How to collect global estimate of congestion? How to “tune” the throttle threshold? Methodology & Results Summary, Future Work, & Other Projects
12
Slide 12 DUKE UNIVERSITY Global Estimate of Congestion % of full buffers in entire network more & more buffers occupied when network saturates more & more buffers occupied when network saturates throttle network when % full buffers cross threshold throttle network when % full buffers cross threshold Advantages simple aggregation simple aggregation empirical observation: works well empirical observation: works well Disadvantages doesn’t detect localized congestion doesn’t detect localized congestion threshold differs for communication patterns (we solve this) threshold differs for communication patterns (we solve this)
13
Slide 13 DUKE UNIVERSITY Gather Global Information Global Information % full network buffers in an “interval” % full network buffers in an “interval” % packets or flits delivered during an “interval” % packets or flits delivered during an “interval” Constraint gather time << backpressure buildup time (1000s of cycles) gather time << backpressure buildup time (1000s of cycles) Mechanisms piggybacking piggybacking meta-packets meta-packets side-band signal side-band signal
14
Slide 14 DUKE UNIVERSITY Sideband: Dimension-wise Aggregation Each hop takes h cycles on the sideband After 2 hops, aggregation in one dimenstion done 2 such phases Total gather time = 2 * 2 * h = 4h cycles For k-ary, n-cubes, gather-time (g) = n * k * h / 2 For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles Phase I Phase 2
15
Slide 15 DUKE UNIVERSITY Outline Overview Multiprocessor Network Basics How to collect global estimate of congestion? How to “tune” the throttle threshold? Methodology & Results Summary, Future Work, & Other Projects
16
Slide 16 DUKE UNIVERSITY Dynamic Detection of Threshold (Hill Climbing) B A C % full buffers (%) 0 Throughput YesNo Increment No Change YesDecrement Currently throttling? Drop in Bandwidth > 25% Threshold … we may still creep into saturation (later)
17
Slide 17 DUKE UNIVERSITY Summary of Approach Global Knowledge of a Network Collect % full network buffers and overall throughput Collect % full network buffers and overall throughput Dimension-wise aggregation, g-cycle snapshots Dimension-wise aggregation, g-cycle snapshots Aggregation via sideband signals Aggregation via sideband signals Dynamically detect throttling threshold Threshold = % of full network buffers Threshold = % of full network buffers Self-tuned using hill climbing Self-tuned using hill climbing Reset if hill climbing fails Reset if hill climbing fails
18
Slide 18 DUKE UNIVERSITY Outline Overview Multiprocessor Network Basics How to collect global estimate of congestion? How to “tune” the throttle threshold? Methodology & Results Summary, Future Work, & Other Projects
19
Slide 19 DUKE UNIVERSITY Methodology Flitsim 2.0 Simulator (Pinkston’s group at USC) warmup for 10k cycles, simulate for 50k cycles warmup for 10k cycles, simulate for 50k cycles Network architecture 16x16 two-dimensional torus (16-ary, 2-cube) 16x16 two-dimensional torus (16-ary, 2-cube) Full-duplex links Full-duplex links Packet size = 16 flits Packet size = 16 flits Wormhole routing Wormhole routing Deadlock avoidance (paper has deadlock recovery results) Deadlock avoidance (paper has deadlock recovery results) Router architecture 3 virtual channels per physical channel 3 virtual channels per physical channel Each virtual channel buffer holds 8 flits Each virtual channel buffer holds 8 flits 1 cycle central arbitration, 1 cycle switching 1 cycle central arbitration, 1 cycle switching
20
Slide 20 DUKE UNIVERSITY Input Traffic Packet Generation Frequency “attempt” to send one packet per packet regeneration interval “attempt” to send one packet per packet regeneration interval Traffic Patterns Random destination Random destination Perfect Shuffle: a n-1 a n-2... a 1 a 0 a n-2 a n-3... a 0 a n-1 Perfect Shuffle: a n-1 a n-2... a 1 a 0 a n-2 a n-3... a 0 a n-1 Butterfly: a n-1 a n-2... a 1 a 0 a 0 a n-2 … a 1 a n-1 Butterfly: a n-1 a n-2... a 1 a 0 a 0 a n-2 … a 1 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0 a 0 a 1... a n-2 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0 a 0 a 1... a n-2 a n-1
21
Slide 21 DUKE UNIVERSITY Throttling Algorithms Base no throttling no throttling ALO (At Least One) Lopez, Martinez, and Duato, ICPP, August, 1998 Lopez, Martinez, and Duato, ICPP, August, 1998 Throttling based on local estimation of congestion Throttling based on local estimation of congestion Inject new packet only if Inject new packet only if – “useful” physical channel has all virtual channels free, or – at least one virtual channel on every “useful” channel is free Tune (this work)
22
Slide 22 DUKE UNIVERSITY Tuning Parameters Total number of network buffers = 256 * 3 * 4 = 3072 Gather time (g) = n * k * h / 2 = 32 cycles Sideband communication latency (h) = 2 cycles Sideband communication bandwidth = 25 bits (!) # network buffers = 3072 = 12 bits # network buffers = 3072 = 12 bits max throughput = g * 256 * 1 = 8192 = 13 bits max throughput = g * 256 * 1 = 8192 = 13 bits Tuning frequency = once every 96 cycles Initial threshold value = 1% ~= 30 buffers Threshold increment = 1%, decrement = 4%
23
Slide 23 DUKE UNIVERSITY Random Pattern Beyond saturation point, Tune outperforms ALO and Base
24
Slide 24 DUKE UNIVERSITY Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles) Tune fairly insensitive to delayed collection of information
25
Slide 25 DUKE UNIVERSITY Static Threshold Choice Optimal Thesholds different for random and butterfly Tune performs close to the best static threshold
26
Slide 26 DUKE UNIVERSITY With Bursty Load Tune outperforms ALO random bit reversal shuffle butterfly
27
Slide 27 DUKE UNIVERSITY Avoiding Local Maxima What if steady decrease in bandwidth < 25%? potential to “creep” into saturation potential to “creep” into saturation Solution: remember global maxima max = maximum throughput seen in any tuning period max = maximum throughput seen in any tuning period N max = number of full buffers at max N max = number of full buffers at max T max = threshold at max T max = threshold at max Reset threshold min(T max, N max ) if throughput < 50% max If “r” consecutive resets don’t fix the problem, then restart hypothesis: communication pattern has changed hypothesis: communication pattern has changed
28
Slide 28 DUKE UNIVERSITY Threshold Reset Necessary Packet Rengeration Interval = 10 cycles Hill Climbing Hill Climbing + Local Maxima Hill Climbing Hill Climbing + Local Maxima
29
Slide 29 DUKE UNIVERSITY Summary Network Saturation is a severe problem advent of powerful processors, SMT, and CMPs advent of powerful processors, SMT, and CMPs “unstable” behavior makes designers nervous “unstable” behavior makes designers nervous We propose throttling based on global knowledge aggregate global knowledge (% full buffers,throughput) aggregate global knowledge (% full buffers,throughput) throttle when % full buffers exceed threshold throttle when % full buffers exceed threshold tune threshold for communication patters & offered load tune threshold for communication patters & offered load
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.