DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.

Slides:



Advertisements
Similar presentations
Ch. 12 Routing in Switched Networks
Advertisements

Michele Pagano – A Survey on TCP Performance Evaluation and Modeling 1 Department of Information Engineering University of Pisa Network Telecomunication.
Prof. Natalie Enright Jerger
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
High Performance All-Optical Networks with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
CSE 291-a Interconnection Networks Lecture 12: Deadlock Avoidance (Cont’d) Router February 28, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Predictive Load Balancing Reconfigurable Computing Group.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
Rafael C. Nunez - Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware May 19 th, 2005 Diffusion Marking Mechanisms.
Diffusion Mechanisms for Active Queue Management Department of Electrical and Computer Engineering University of Delaware Aug 19th / 2004 Rafael Nunez.
Diffusion Mechanisms for Active Queue Management Department of Electrical and Computer Engineering University of Delaware May 19th / 2004 Rafael Nunez.
A Comparative Analysis of Deadlock Recovery and Avoidance-Based Routing Algorithms in Wormhole-Switched k-Ary n-Cubes Paper review Reviewer : Nthu CS03.
Diffusion Early Marking Department of Electrical and Computer Engineering University of Delaware May / 2004 Rafael Nunez Gonzalo Arce.
Dragonfly Topology and Routing
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
Switching, routing, and flow control in interconnection networks.
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
CS540/TE630 Computer Network Architecture Spring 2009 Tu/Th 10:30am-Noon Sue Moon.
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb Alpha Development Group, Compaq HOT Interconnects 9 (2001) Presented by.
Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.
1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Lecture 16: Router Design
Advanced Processor Group The School of Computer Science A Dynamic Link Allocation Router Wei Song, Doug Edwards Advanced Processor Group The University.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
How to Train your Dragonfly
Rachata Ausavarungnirun, Kevin Chang
Interconnection Networks: Flow Control
Lecture 23: Router Design
Switching, routing, and flow control in interconnection networks
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CEG 4131 Computer Architecture III Miodrag Bolic
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Presentation transcript:

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Mithuna Thottethodi Alvin R. Lebeck Department of Computer Sciences Duke University, Durham, North Carolina Appeared in the 7 th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001

Slide 2 DUKE UNIVERSITY Network Saturation

Slide 3 DUKE UNIVERSITY Why Network Saturation? router Tree saturation Deadlock cycles New packets block older packets Backpressure take 1000s of cycles to propagate back

Slide 4 DUKE UNIVERSITY Why Do We Care?  Computation power per router is increasing More aggressive speculation More aggressive speculation Simultaneous Multithreading Simultaneous Multithreading Chip Multiprocessors Chip Multiprocessors  “Unstable” behavior makes designers very nervous Router CPUs

Slide 5 DUKE UNIVERSITY So, what’s the solution?  Throttle stop injecting packets when you hit a “threshold” stop injecting packets when you hit a “threshold” “threshold” = % full network buffers “threshold” = % full network buffers  But Local estimate of threshold insufficient Local estimate of threshold insufficient Saturation point differs for communication patterns Saturation point differs for communication patterns  Questions How do we collect global estimate of % full network buffers? How do we collect global estimate of % full network buffers? How do we “tune” the threshold to different patterns? How do we “tune” the threshold to different patterns?

Slide 6 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics Deadlocks & virtual channels Deadlocks & virtual channels Adaptive routing & Duato’s theory Adaptive routing & Duato’s theory  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

Slide 7 DUKE UNIVERSITY A Multiprocessor Network router

Slide 8 DUKE UNIVERSITY Deadlock Avoidance  3 2  4 3  1 4  2 Deadlocked  3 2  4 3  1 4  2 Virtual Channels (red & yellow)

Slide 9 DUKE UNIVERSITY Virtual Channels (VC)  3 2  4 3  1 4  2 One Buffer Per VC Logically, red and yellow networks (deadlock-free)

Slide 10 DUKE UNIVERSITY Duato’s Theory  Adaptive network for high performance deadlock-prone deadlock-prone  Deadlock-free network when adaptive network deadlocks drop down to deadlock-free when router is congested drop down to deadlock-free when router is congested  Implemented with different virtual channels adaptive virtual channels adaptive virtual channels deadlock-free virtual channels (escape channels) deadlock-free virtual channels (escape channels)

Slide 11 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

Slide 12 DUKE UNIVERSITY Global Estimate of Congestion  % of full buffers in entire network more & more buffers occupied when network saturates more & more buffers occupied when network saturates throttle network when % full buffers cross threshold throttle network when % full buffers cross threshold  Advantages simple aggregation simple aggregation empirical observation: works well empirical observation: works well  Disadvantages doesn’t detect localized congestion doesn’t detect localized congestion threshold differs for communication patterns (we solve this) threshold differs for communication patterns (we solve this)

Slide 13 DUKE UNIVERSITY Gather Global Information  Global Information % full network buffers in an “interval” % full network buffers in an “interval” % packets or flits delivered during an “interval” % packets or flits delivered during an “interval”  Constraint gather time << backpressure buildup time (1000s of cycles) gather time << backpressure buildup time (1000s of cycles)  Mechanisms piggybacking piggybacking meta-packets meta-packets side-band signal side-band signal

Slide 14 DUKE UNIVERSITY Sideband: Dimension-wise Aggregation Each hop takes h cycles on the sideband After 2 hops, aggregation in one dimenstion done 2 such phases Total gather time = 2 * 2 * h = 4h cycles For k-ary, n-cubes, gather-time (g) = n * k * h / 2 For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles Phase I Phase 2

Slide 15 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

Slide 16 DUKE UNIVERSITY Dynamic Detection of Threshold (Hill Climbing) B A C % full buffers (%) 0 Throughput YesNo Increment No Change YesDecrement Currently throttling? Drop in Bandwidth > 25% Threshold … we may still creep into saturation (later)

Slide 17 DUKE UNIVERSITY Summary of Approach  Global Knowledge of a Network Collect % full network buffers and overall throughput Collect % full network buffers and overall throughput Dimension-wise aggregation, g-cycle snapshots Dimension-wise aggregation, g-cycle snapshots Aggregation via sideband signals Aggregation via sideband signals  Dynamically detect throttling threshold Threshold = % of full network buffers Threshold = % of full network buffers Self-tuned using hill climbing Self-tuned using hill climbing Reset if hill climbing fails Reset if hill climbing fails

Slide 18 DUKE UNIVERSITY Outline  Overview  Multiprocessor Network Basics  How to collect global estimate of congestion?  How to “tune” the throttle threshold?  Methodology & Results  Summary, Future Work, & Other Projects

Slide 19 DUKE UNIVERSITY Methodology  Flitsim 2.0 Simulator (Pinkston’s group at USC) warmup for 10k cycles, simulate for 50k cycles warmup for 10k cycles, simulate for 50k cycles  Network architecture 16x16 two-dimensional torus (16-ary, 2-cube) 16x16 two-dimensional torus (16-ary, 2-cube) Full-duplex links Full-duplex links Packet size = 16 flits Packet size = 16 flits Wormhole routing Wormhole routing Deadlock avoidance (paper has deadlock recovery results) Deadlock avoidance (paper has deadlock recovery results)  Router architecture 3 virtual channels per physical channel 3 virtual channels per physical channel Each virtual channel buffer holds 8 flits Each virtual channel buffer holds 8 flits 1 cycle central arbitration, 1 cycle switching 1 cycle central arbitration, 1 cycle switching

Slide 20 DUKE UNIVERSITY Input Traffic  Packet Generation Frequency “attempt” to send one packet per packet regeneration interval “attempt” to send one packet per packet regeneration interval  Traffic Patterns Random destination Random destination Perfect Shuffle: a n-1 a n-2... a 1 a 0  a n-2 a n-3... a 0 a n-1 Perfect Shuffle: a n-1 a n-2... a 1 a 0  a n-2 a n-3... a 0 a n-1 Butterfly: a n-1 a n-2... a 1 a 0  a 0 a n-2 … a 1 a n-1 Butterfly: a n-1 a n-2... a 1 a 0  a 0 a n-2 … a 1 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0  a 0 a 1... a n-2 a n-1 Bit Reversal: a n-1 a n-2... a 1 a 0  a 0 a 1... a n-2 a n-1

Slide 21 DUKE UNIVERSITY Throttling Algorithms  Base no throttling no throttling  ALO (At Least One) Lopez, Martinez, and Duato, ICPP, August, 1998 Lopez, Martinez, and Duato, ICPP, August, 1998 Throttling based on local estimation of congestion Throttling based on local estimation of congestion Inject new packet only if Inject new packet only if – “useful” physical channel has all virtual channels free, or – at least one virtual channel on every “useful” channel is free  Tune (this work)

Slide 22 DUKE UNIVERSITY Tuning Parameters  Total number of network buffers = 256 * 3 * 4 = 3072  Gather time (g) = n * k * h / 2 = 32 cycles  Sideband communication latency (h) = 2 cycles  Sideband communication bandwidth = 25 bits (!) # network buffers = 3072 = 12 bits # network buffers = 3072 = 12 bits max throughput = g * 256 * 1 = 8192 = 13 bits max throughput = g * 256 * 1 = 8192 = 13 bits  Tuning frequency = once every 96 cycles  Initial threshold value = 1% ~= 30 buffers  Threshold increment = 1%, decrement = 4%

Slide 23 DUKE UNIVERSITY Random Pattern Beyond saturation point, Tune outperforms ALO and Base

Slide 24 DUKE UNIVERSITY Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles) Tune fairly insensitive to delayed collection of information

Slide 25 DUKE UNIVERSITY Static Threshold Choice Optimal Thesholds different for random and butterfly Tune performs close to the best static threshold

Slide 26 DUKE UNIVERSITY With Bursty Load Tune outperforms ALO random bit reversal shuffle butterfly

Slide 27 DUKE UNIVERSITY Avoiding Local Maxima  What if steady decrease in bandwidth < 25%? potential to “creep” into saturation potential to “creep” into saturation  Solution: remember global maxima max = maximum throughput seen in any tuning period max = maximum throughput seen in any tuning period N max = number of full buffers at max N max = number of full buffers at max T max = threshold at max T max = threshold at max  Reset threshold min(T max, N max ) if throughput < 50% max  If “r” consecutive resets don’t fix the problem, then restart hypothesis: communication pattern has changed hypothesis: communication pattern has changed

Slide 28 DUKE UNIVERSITY Threshold Reset Necessary Packet Rengeration Interval = 10 cycles Hill Climbing Hill Climbing + Local Maxima Hill Climbing Hill Climbing + Local Maxima

Slide 29 DUKE UNIVERSITY Summary  Network Saturation is a severe problem advent of powerful processors, SMT, and CMPs advent of powerful processors, SMT, and CMPs “unstable” behavior makes designers nervous “unstable” behavior makes designers nervous  We propose throttling based on global knowledge aggregate global knowledge (% full buffers,throughput) aggregate global knowledge (% full buffers,throughput) throttle when % full buffers exceed threshold throttle when % full buffers exceed threshold tune threshold for communication patters & offered load tune threshold for communication patters & offered load