CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University.

Slides:



Advertisements
Similar presentations
CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con’t) March 16 th, 2011 John Kubiatowicz Electrical Engineering and Computer.
Advertisements

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
EECC756 - Shaaban #1 lec # 9 Spring Network Definitions A network is a graph V = {switches and nodes} connected by communication channels.
CS 258 Parallel Computer Architecture Lecture 4 Network Topology and Routing February 4, 2008 Prof John D. Kubiatowicz
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks (con’t) March 15 th, 2010 John Kubiatowicz Electrical Engineering and Computer.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz
Issues in System-Level Direct Networks Jason D. Bakos.
EECS 570: Fall rev1 1 Chapter 10: Scalable Interconnection Networks.
Interconnection Network Topology Design Trade-offs
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
John Kubiatowicz Electrical Engineering and Computer Sciences
Storage area network and System area network (SAN)
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
Interconnect Network Topologies
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 12 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Interconnect Networks
On-Chip Networks and Testing
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
1 Scalable Interconnection Networks. 2 Scalable, High Performance Network At Core of Parallel Computer Architecture Requirements and trade-offs at many.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
1 Lecture 14: Interconnection Networks Topics: dimension vs. arity, deadlock.
Effective bandwidth with link pipelining Pipeline the flight and transmission of packets over the links Overlap the sending overhead with the transport.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Advanced Computer Networks
Lecture 23: Interconnection Networks
Multiprocessor Interconnection Networks Todd C
Azeddien M. Sllame, Amani Hasan Abdelkader
Prof John D. Kubiatowicz
Static and Dynamic Networks
Interconnection Network Routing, Topology Design Trade-offs
John Kubiatowicz Electrical Engineering and Computer Sciences
Interconnection Network Design Contd.
Introduction to Scalable Interconnection Network Design
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Interconnection Network Design Lecture 14
Introduction to Scalable Interconnection Networks
Interconnection Network Design
CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)
Networks: Routing and Design
Switching, routing, and flow control in interconnection networks
Presentation transcript:

CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

4/15/2009cs252-S09, Lecture 21 2 Review: On Chip: Embeddings in two dimensions Embed multiple logical dimension in one physical dimension using long wires When embedding higher-dimension in lower one, either some wires longer than others, or all wires long 6 x 3 x 2

4/15/2009cs252-S09, Lecture 21 3 Review: Store&Forward vs Cut-Through Routing Time:h(n/b +  ) vsn/b + h  OR(cycles): h(n/w +  ) vsn/w + h  what if message is fragmented? wormhole vs virtual cut-through

4/15/2009cs252-S09, Lecture 21 4 Contention Two packets trying to use the same link at same time –limited buffering –drop? Most parallel mach. networks block in place –link-level flow control –tree saturation Closed system - offered load depends on delivered –Source Squelching

4/15/2009cs252-S09, Lecture 21 5 Bandwidth What affects local bandwidth? –packet densityb x n data /n –routing delayb x n data /(n + w  ) –contention »endpoints »within the network Aggregate bandwidth –bisection bandwidth »sum of bandwidth of smallest set of links that partition the network –total bandwidth of all the channels: Cb –suppose N hosts issue packet every M cycles with ave dist »each msg occupies h channels for l = n/w cycles each »C/N channels available per node »link utilization for store-and-forward:  = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1! »link utilization for wormhole routing?

4/15/2009cs252-S09, Lecture 21 6 Saturation

4/15/2009cs252-S09, Lecture 21 7 How Many Dimensions? n = 2 or n = 3 –Short wires, easy to build –Many hops, low bisection bandwidth –Requires traffic locality n >= 4 –Harder to build, more wires, longer average length –Fewer hops, better bisection bandwidth –Can handle non-local traffic k-ary d-cubes provide a consistent framework for comparison –N = k d –scale dimension (d) or nodes per dimension (k) –assume cut-through

4/15/2009cs252-S09, Lecture 21 8 Traditional Scaling: Latency scaling with N Assumes equal channel width –independent of node count or dimension –dominated by average distance

4/15/2009cs252-S09, Lecture 21 9 Average Distance but, equal channel width is not equal cost! Higher dimension => more channels ave dist = d (k-1)/2

4/15/2009cs252-S09, Lecture In the 3D world For n nodes, bisection area is O(n 2/3 ) For large n, bisection bandwidth is limited to O(n 2/3 ) –Bill Dally, IEEE TPDS, [Dal90a] –For fixed bisection bandwidth, low-dimensional k-ary n- cubes are better (otherwise higher is better) –i.e., a few short fat wires are better than many long thin wires –What about many long fat wires?

4/15/2009cs252-S09, Lecture Dally paper (con’t) Equal Bisection,W=1 for hypercube  W= ½k Three wire models: –Constant delay, independent of length –Logarithmic delay with length (exponential driver tree) –Linear delay (speed of light/optimal repeaters) Logarithmic Delay Linear Delay

4/15/2009cs252-S09, Lecture Equal cost in k-ary n-cubes Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know? switch degree: ddiameter = d(k-1) total links = Nd pins per node = 2wd bisection = k d-1 = N/k links in each directions 2Nw/k wires cross the middle

4/15/2009cs252-S09, Lecture Latency for Equal Width Channels total links(N) = Nd

4/15/2009cs252-S09, Lecture Latency with Equal Pin Count Baseline d=2, has w = 32 (128 wires per node) fix 2dw pins => w(d) = 64/d distance up with d, but channel time down

4/15/2009cs252-S09, Lecture Latency with Equal Bisection Width N-node hypercube has N bisection links 2d torus has 2N 1/2 Fixed bisection => w(d) = N 1/d / 2 = k/2 1 M nodes, d=2 has w=512!

4/15/2009cs252-S09, Lecture Larger Routing Delay (w/ equal pin) Dally’s conclusions strongly influenced by assumption of small routing delay –Here, Routing delay  =20

4/15/2009cs252-S09, Lecture Latency under Contention Optimal packet size?Channel utilization?

4/15/2009cs252-S09, Lecture Saturation Fatter links shorten queuing delays

4/15/2009cs252-S09, Lecture Phits per cycle higher degree network has larger available bandwidth –cost?

4/15/2009cs252-S09, Lecture Discussion Rich set of topological alternatives with deep relationships Design point depends heavily on cost model –nodes, pins, area,... –Wire length or wire delay metrics favor small dimension –Long (pipelined) links increase optimal dimension Need a consistent framework and analysis to separate opinion from design Optimal point changes with technology

4/15/2009cs252-S09, Lecture Problem: Low-dimensional networks have high k –Consequence: may have to travel many hops in single dimension –Routing latency can dominate long-distance traffic patterns Solution: Provide one or more “express” links –Like express trains, express elevators, etc »Delay linear with distance, lower constant »Closer to “speed of light” in medium »Lower power, since no router cost –“Express Cubes: Improving performance of k-ary n-cube interconnection networks,” Bill Dally 1991 Another Idea: route with pass transistors through links Another Idea: Express Cubes

4/15/2009cs252-S09, Lecture The Routing problem: Local decisions Routing at each hop: Pick next output port!

4/15/2009cs252-S09, Lecture How do you build a crossbar?

4/15/2009cs252-S09, Lecture Input buffered swtich Independent routing logic per input –FSM Scheduler logic arbitrates each output –priority, FIFO, random Head-of-line blocking problem

4/15/2009cs252-S09, Lecture Output Buffered Switch How would you build a shared pool?

4/15/2009cs252-S09, Lecture Output scheduling n independent arbitration problems? –static priority, random, round-robin simplifications due to routing algorithm? general case is max bipartite matching

4/15/2009cs252-S09, Lecture Switch Components Output ports –transmitter (typically drives clock and data) Input ports –synchronizer aligns data signal with local clock domain –essentially FIFO buffer Crossbar –connects each input to any output –degree limited by area or pinout Buffering Control logic –complexity depends on routing logic and scheduling algorithm –determine output port for each incoming packet –arbitrate among inputs directed at same output

4/15/2009cs252-S09, Lecture Properties of Routing Algorithms Routing algorithm: –R: N x N -> C, which at each switch maps the destination node n d to the next channel on the route –which of the possible paths are used as routes? –how is the next hop determined? »arithmetic »source-based port select »table driven »general computation Deterministic –route determined by (source, dest), not intermediate state (i.e. traffic) Adaptive –route influenced by traffic along the way Minimal –only selects shortest paths Deadlock free –no traffic pattern can lead to a situation where packets are deadlocked and never move forward

4/15/2009cs252-S09, Lecture Routing Mechanism need to select output port for each input packet –in a few cycles Simple arithmetic in regular topologies –ex:  x,  y routing in a grid »west (-x)  x < 0 »east (+x)  x > 0 »south (-y)  x = 0,  y < 0 »north (+y)  x = 0,  y > 0 »processor  x = 0,  y = 0 Reduce relative address of each dimension in order –Dimension-order routing in k-ary d-cubes –e-cube routing in n-cube

4/15/2009cs252-S09, Lecture Deadlock Freedom How can deadlock arise? –necessary conditions: »shared resource »incrementally allocated »non-preemptible –think of a channel as a shared resource that is acquired incrementally »source buffer then dest. buffer »channels along a route How do you avoid it? –constrain how channel resources are allocated –ex: dimension order How do you prove that a routing algorithm is deadlock free? –Show that channel dependency graph has no cycles!

4/15/2009cs252-S09, Lecture Consider Trees Why is the obvious routing on X deadlock free? –butterfly? –tree? –fat tree? Any assumptions about routing mechanism? amount of buffering?

4/15/2009cs252-S09, Lecture Up*-Down* routing for general topology Given any bidirectional network Construct a spanning tree Number of the nodes increasing from leaves to roots UP increase node numbers Any Source -> Dest by UP*-DOWN* route –up edges, single turn, down edges –Proof of deadlock freedom? Performance? –Some numberings and routes much better than others –interacts with topology in strange ways

4/15/2009cs252-S09, Lecture Turn Restrictions in X,Y XY routing forbids 4 of 8 turns and leaves no room for adaptive routing Can you allow more turns and still be deadlock free?

4/15/2009cs252-S09, Lecture Minimal turn restrictions in 2D north-lastnegative first -x +x +y -y

4/15/2009cs252-S09, Lecture Example legal west-first routes Can route around failures or congestion Can combine turn restrictions with virtual channels

4/15/2009cs252-S09, Lecture General Proof Technique resources are logically associated with channels messages introduce dependences between resources as they move forward need to articulate the possible dependences that can arise between channels show that there are no cycles in Channel Dependence Graph –find a numbering of channel resources such that every legal route follows a monotonic sequence  no traffic pattern can lead to deadlock network need not be acyclic, just channel dependence graph

4/15/2009cs252-S09, Lecture Example: k-ary 2D array Thm: Dimension-ordered (x,y) routing is deadlock free Numbering –+x channel (i,y) -> (i+1,y) gets i –similarly for -x with 0 as most positive edge –+y channel (x,j) -> (x,j+1) gets N+j –similary for -y channels any routing sequence: x direction, turn, y direction is increasing Generalization: –“e-cube routing” on 3-D: X then Y then Z

4/15/2009cs252-S09, Lecture Channel Dependence Graph

4/15/2009cs252-S09, Lecture More examples: What about wormhole routing on a ring? Or: Unidirectional Torus of higher dimension?

4/15/2009cs252-S09, Lecture Deadlock free wormhole networks? Basic dimension order routing techniques don’t work for unidirectional k-ary d-cubes –only for k-ary d-arrays (bi-directional) –And – dimension-ordered routing not adaptive! Idea: add channels! –provide multiple “virtual channels” to break the dependence cycle –good for BW too! –Do not need to add links, or xbar, only buffer resources This adds nodes to the CDG, remove edges?

4/15/2009cs252-S09, Lecture When are virtual channels allocated? Two separate processes: –Virtual channel allocation –Switch/connection allocation Virtual Channel Allocation –Choose route and free output virtual channel Switch Allocation –For each incoming virtual channel, must negotiate switch on outgoing pin In ideal case (not highly loaded), would like to optimistically allocate a virtual channel Hardware efficient design For crossbar

4/15/2009cs252-S09, Lecture Breaking deadlock with virtual channels

4/15/2009cs252-S09, Lecture Paper Discusion: Linder and Harden Paper: “An Adaptive and Fault Tolerant Wormhole Routing Stategy for k-ary n-cubes” –Daniel Linder and Jim Harden General virtual-channel scheme for k-ary n-cubes –With wrap-around paths Properties of result for uni-directional k-ary n-cube: –1 virtual interconnection network –n+1 levels Properties of result for bi-directional k-ary n-cube: –2 n-1 virtual interconnection networks –n+1 levels per network

4/15/2009cs252-S09, Lecture Example: Unidirectional 4-ary 2-cube Physical Network Wrap-around channels necessary but can cause deadlock Virtual Network Use VCs to avoid deadlock 1 level for each wrap-around

4/15/2009cs252-S09, Lecture Bi-directional 4-ary 2-cube: 2 virtual networks Virtual Network 1 Virtual Network 2