EECS 570: Fall 2003 -- rev1 1 Chapter 10: Scalable Interconnection Networks.

Slides:



Advertisements
Similar presentations
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
Advertisements

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
EECC756 - Shaaban #1 lec # 9 Spring Network Definitions A network is a graph V = {switches and nodes} connected by communication channels.
1 Lecture 16: On-Chip Networks Today: on-chip networks background.
CS 258 Parallel Computer Architecture Lecture 4 Network Topology and Routing February 4, 2008 Prof John D. Kubiatowicz
CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University.
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
Generic Multiprocessor Architecture
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks (con’t) March 15 th, 2010 John Kubiatowicz Electrical Engineering and Computer.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz
Issues in System-Level Direct Networks Jason D. Bakos.
Interconnection Network Topology Design Trade-offs
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
EECC756 - Shaaban #1 lec # 10 Spring Generic Multiprocessor Architecture Generic Multiprocessor Architecture Node: processor(s), memory system,
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
John Kubiatowicz Electrical Engineering and Computer Sciences
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Interconnect Network Topologies
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 12 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Interconnect Networks
On-Chip Networks and Testing
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection Networks.
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
1 Scalable Interconnection Networks. 2 Scalable, High Performance Network At Core of Parallel Computer Architecture Requirements and trade-offs at many.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnection Networks Alvin R. Lebeck CPS 220.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
1 Lecture 14: Interconnection Networks Topics: dimension vs. arity, deadlock.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Interconnect Networks
Advanced Computer Networks
Lecture 23: Interconnection Networks
Multiprocessor Interconnection Networks Todd C
Prof John D. Kubiatowicz
Static and Dynamic Networks
Interconnection Network Routing, Topology Design Trade-offs
John Kubiatowicz Electrical Engineering and Computer Sciences
Interconnection Network Design Contd.
Introduction to Scalable Interconnection Network Design
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Interconnection Network Design Lecture 14
Introduction to Scalable Interconnection Networks
Interconnection Networks Contd.
CS 6290 Many-core & Interconnect
Networks: Routing and Design
Switching, routing, and flow control in interconnection networks
Presentation transcript:

EECS 570: Fall rev1 1 Chapter 10: Scalable Interconnection Networks

EECS 570: Fall rev1 2 Goals Low latency –to neighbors –to average/furthest node High bandwidth –per-node and aggregate –bisection bandwidth.. Low cost –switch complexity, pin count –wiring cost (connectors!) Scalability

EECS 570: Fall rev1 3 Requirements from Above Communication-to-computation ratio implies bandwidth needs –local or global? –regular or irregular (arbitrary)? –bursty or uniform? –broadcasts? multicasts? Programming Model –protocol structure –transfer size –importance of latency vs. bandwidth

EECS 570: Fall rev1 4 Basic Definitions switch is a device capable of transferring data from input ports to output ports in an arbitrary pattern network is a graph V={ switches/nodes} connected by communication channels (aka links) C  V x V direct – node at each switch indirect – node only on edge (like internet) route: sequence of links message follows from node A to node B

EECS 570: Fall rev1 5 What characterizes a network? Topology –interconnection structure of the graph Switching Strategy –circuit vs. packet switching –store-and-forward vs. cut-through –virtual cut-through vs. wormhole, etc. Routing Algorithm –how is route determined Control Strategy –centralized vs. distributed Flow Control Mechanism –when does a packet (or a portion of it) move along its route –can't send two packets down same link at same time

EECS 570: Fall rev1 6 Topologies Topology determines many critical parameters: –degree: number of input (output) ports per switch –distance: number of links in route from A to B –average distance: average over all (A,B) pairs –diameter: max distance between two nodes (using shortest paths) –bisection: min number of links to separate into two halves

EECS 570: Fall rev1 7 Bus Fully connected network topology –bus plus interface logic is a form of switch Parameters: –diameter = average distance = 1 –degree = N –switch cost O(N) –wire cost constant –bisection bandwidth constant (or worse) Broadcast is free

EECS 570: Fall rev1 8 Crossbar Another fully connected network topology –one big switch of degree N connects all nodes Parameters: –diameter = average distance = 0(1) –degree=N –switch cost O(N 2 ) –wire cost 2N –bisection bandwidth O(N) Most switches in other topologies are crossbars inside

EECS 570: Fall rev1 9 How to build a crossbar

EECS 570: Fall rev1 10 Switches Input Buffer Output Buffer Cross-bar Control Routing, Scheduling Receiver Transmitter Output Ports Input Ports

EECS 570: Fall rev1 11 Linear Arrays and Rings Linear Array –Diameter? N -1 –Avg. Distance? N/3 –Bisection? 1 Torus (ring): links may be unidirectional or bidirectional Examples: FDDI, SCI, FiberChannel, KSR1

EECS 570: Fall rev1 12 Multidimensional Meshes and Tori d-dimensional k-ary mesh N = k d k = d √N –nodes in each of d dimensions –torus has wraparound, array doesn't; “mesh” ambiguous general d-dimensional mesh –n = k d-1 x...x k 0 nodes

EECS 570: Fall rev1 13 Properties (N=k d,k= d √N) Diameter? –Average Distance? –d x k/3 for mesh Degree Switch Cost? Wire Cost? Bisection? –k d+1

EECS 570: Fall rev1 14 Hypercubes

EECS 570: Fall rev1 15 Trees Usually indirect, occasionally direct Diameter and avg. distance are logarithmic –k-ary tree, height d = log k N Fixed degree Route up to common ancestor and down –benefit for local traffic Bisection?

EECS 570: Fall rev1 16 Fat-Trees Fatter links (really more of them) as you go up, so bisection BW scales with N

EECS 570: Fall rev1 17 Butterflies Tree with lots of roots ! N log N (actually N/2 x logN) Exactly one route from any source to any dest R = A xor B, at level i use “straight” edge if r i =o, otherwise cross edge Bisection N/2 vs N (d-1)/d

EECS 570: Fall rev1 18 Benes Networks and Fat Trees Back-to-back butterfly can route all permutations –off line What if you just pick a random mid point?

EECS 570: Fall rev1 19 Relationship of Butterflies to Hypercubes Wiring is isomorphic Except that Butterfly always lakes log n steps

EECS 570: Fall rev1 20 Summary of Topologies Topology Degree Diameter Ave Dist Bisection Diam/Ave Dist N=1024 1DArray 2 N-1 N/ /341 1D Ring 2 N/2 N/ /2 2DArray 4 2(N ½ -1) ⅔N ½ N ½ 63/21 3D Array 6 3(N ⅓ -1) N ⅓ N ⅔ -30/-10 2DTorus 4 N ½ ½N2N ½ 32/16 k-ary n-cube2n nk/2 nk/4nk/4 n=3 Hypercube n=logNn n/2 N/2 10/5 2DTree 3 2log 2 N -2Iog 2 N 1 20/-20 4DTree 5 2log 4 N -2Iog 4 N-2/31 10/-9 2D fat tree 4 log 2 N -2Iog 2 N N 20/-20 2D butterfly 4 log 2 N log 2 N N/2 20/20

EECS 570: Fall rev1 21 Choosing a Topology Cost vs. performance For fixed cost which topology provides best performance? best performance on what workload? –message size –traffic pattern define cost target machine size Simplify tradeoff to dimension vs. radix restrict to k-ary d-cubes what is best dimension?

EECS 570: Fall rev1 22 How Many Dimensions in a Network? d=2 or d=3 Short wires, easy to build Many hops, low bisection bandwidth Benefits from traffic locality d>=4 Harder to build, more wires, longer average length Higher switch degree Fewer hops, better bisection bandwidth Handles non-local traffic better Effect of # of hops on latency depends on switching strategy...

EECS 570: Fall rev1 23 Store&Forward vs. Cut-Through Routing messages typ. fragmented into packets & pipelined cut-through pipelines on flits

EECS 570: Fall rev1 24 Handling Output Contention What if output is blocked? virtual cut-through –switch w/blocked output buffers entire packet –degenerates to: –requires lots of buffering in switches wormhole –leave flits strung out over network (In buffers) –minimal switch buffering –one blocked packet can tie up lots of channels

EECS 570: Fall rev1 25 Traditional Scaling: Latency(P) Assumes equal channel width independent of node count or dimension dominated by average distance

EECS 570: Fall rev1 26 Average Distance but equal channel width is not equal cost! Higher dimension => more channels average distance = d(k-1)/2

EECS 570: Fall rev1 27 In the 3-D world For large n, bisection bandwidth is limited to O(n 2/3 ) Dally, IEEE TPDS, [Dal90a] For fixed bisection bandwidth, low-dimensional k-ary n-cubes are better (otherwise higher is better) i.e., a few short fat wires are better than many long thin wires What about many long fat wires? For n nodes, bisection area is O(n2/3 )

EECS 570: Fall rev1 28 Equal cost in k-ary n-cubes Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know? switch degree: d diameter = d(k-1) total links = Nd pins per node = 2wd bisection = k d-1 = N/k links in each direction 2Nw/k wires cross the middle

EECS 570: Fall rev1 29 Latency(d) for P with Equal Width total links(N)=Nd

EECS 570: Fall rev1 30 Latency with Equal Pin Count Baseline d=2, has w = 32 (128 wires per node) fix 2dw pins => w(d) = 64/d distance down with increasing d, but channel time up

EECS 570: Fall rev1 31 Latency with Equal Bisection Width N-node hypercube has N bisection links 2d torus has 2N ½ Fixed bisection => w(d) = N 1/d /2= k/2 1 M nodes, d=2 has w=512!

EECS 570: Fall rev1 32 Larger Routing Delay (w/equal pin) ratio of routing to channel time is key

EECS 570: Fall rev1 33 Topology Summary Rich set of topological alternatives with deep relationships Design point depends heavily on cost model nodes, pins, area, … Need for downward scalability lends to fix dimension –high-degree switches wasted in small configuration –grow machine by increasing nodes per dimension Need a consistent framework and analysis to separate opinion from design Optimal point changes with technology store-and-forward vs. cut-through non-pipelined vs. pipelined signaling

EECS 570: Fall rev1 34 Real Machines Wide links, smaller routing delay Tremendous variation

EECS 570: Fall rev1 35 What characterizes a network? Topology interconnection structure of the graph Switching Strategy circuit vs. packet switching store-and-forward vs. cut-through virtual cut-through vs. wormhole, etc, Routing Algorithm how is route determined Control Strategy centralized vs. distributed Flow Control Mechanism when does a packet (or a portion of it) move along its route can't send two packets down same link at same time

EECS 570: Fall rev1 36 Typical Packet Format Two basic mechanisms for abstraction encapsulation fragmentation

EECS 570: Fall rev1 37 Routing Recall: routing algorithm determines which of the possible paths are used as routes how the route is determined R: N x N → C, which at each switch maps the destination node n d to the next channel on the route Issues: Routing mechanism –arithmetic –source-based port select –table driven –general computation Properties of the routes Deadlock free

EECS 570: Fall rev1 38 Routing Mechanism need to select output port for each input packet in a few cycles Simple arithmetic in regular topologies ex: ∆x, ∆y routing in a grid –west(-x) ∆x<0 –east (+x) ∆x>0 –south(-y) ∆x=0, ∆y<0 –north(+y) ∆x=0, ∆y>0 –processor ∆x=0, ∆y=0 Reduce relative address of each dimension in order Dimension-order routing in k-ary d-cubes e-cube routing in n-cube

EECS 570: Fall rev1 39 Routing Mechanism (cont) Source-based message header carries series of port selects used and stripped en route CRC? header length … CS-2, Myrinet, MIT Artic Table-driven message header carried index for next port at next switch –o = R[i] table also gives index for following hop –, I’ = R[i] ATM, HPPI P3P3 P2P2 P1P1 P0P0

EECS 570: Fall rev1 40 Properties of Routing Algorithms Deterministic route determined by (source, dest), not intermediate state (i.e. traffic) Adaptive route influenced by traffic along the way Minimal only selects shortest paths Deadlock free no traffic pattern can lead to a situation where no packets move forward

EECS 570: Fall rev1 41 Deadlock Freedom How can it arise? necessary conditions: –shared resource –incrementally allocated –non-pre-emptible think of a channel as a shared that is acquired incrementally –source buffer then dest. buffer –channels along a route How do you avoid it? constrain how channel resources are allocated ex: dimension order How do you prove that a routing algorithm is deadlock free?

EECS 570: Fall rev1 42 Proof Technique Resources are logically associated with channels Messages introduce dependences between resources as they move forward Need to articulate possible dependences between channels Show that there are no cycles in Channel Dependence Graph find a numbering of channel resources such that every legal route follows a monotonic sequence => no traffic pattern can lead to deadlock Network need not be acyclic, only channel dependence graph

EECS 570: Fall rev1 43 Example: k-ary 2D array Theorem: dimension-order x,y routing is deadlock free Numbering +x channel (i,y) → (i+1,y) gets i similarly for -x with 0 as most positive edge +y channel (x,j) -> (x,j+ I) gets N+j similarly for -y channels Any routing sequence' x direction, turn, y direction is increasing

EECS 570: Fall rev1 44 Channel Dependence Graph

EECS 570: Fall rev1 45 More examples Why is the obvious routing on X deadlock free?. butterfly? tree? fat tree? Any assumptions about routing mechanism? amount of buffering? What about wormhole routing on a ring?

EECS 570: Fall rev1 46 Deadlock free wormhole networks? Basic dimension-order routing doesn't work for k-ary d-cubes only for k-ary d-arrays (bi-directional, no wrap-around) Idea: add channels! provide multiple “virtual channels” to break dependence cycle good for BW too! Don't need to add links, or x bar, only buffer resources This adds nodes to the CDG, remove edges?

EECS 570: Fall rev1 47 Breaking deadlock with virtual channels

EECS 570: Fall rev1 48 Turn Restrictions in X, Y XY routing forbids 4 of 8 turns and leaves no room for adaptive routing Can you allow more turns and still be deadlock free

EECS 570: Fall rev1 49 Minimal turn restrictions in 2D north-lastnegative first -x +x -y +y

EECS 570: Fall rev1 50 Example legal west-first routes Can route around failures or congestion Can combine turn restrictions with virtual channels

EECS 570: Fall rev1 51 Adaptive Routing R: C x N x ∑ → C Essential for fault tolerance at least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Fully/partially adaptive, minimal/non-minimal Can introduce complexity or anomalies Little adaptation goes a long way!

EECS 570: Fall rev1 52 Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel machine networks block in place link-level flow control tree saturation Closed system - offered load depends on delivered

EECS 570: Fall rev1 53 Flow Control What do you do when push comes to shove? ethernet collision detection and retry after delay TCPIW AN: buffer, drop, adjust rate any solution must adjust to output rate Link-level flow control

EECS 570: Fall rev1 54 Example: T3D 3D bidirectional torus, dimension order (NIC selected), virtual cut-through, packet sw 16 bit x 150 MHz, short, wide, synch rotating priority per output logically separate request/response 3 independent, stacked switches 8 16-bit flits on each of 4 VC in each directions

EECS 570: Fall rev1 55 Routing and Switch Design Summary Routing Algorithms restrict the set of routes within the topology simple mechanism selects turn at each hop arithmetic, selection, lookup Deadlock-free if channel dependence graph is acyclic limit turns to eliminate dependences add separate channel resources to break dependences combination of topology, algorithm, and switch design Deterministic vs adaptive routing Switch design issues input/output/pooled buffering, routing logic, selection logic Flow control Real networks are a “package” of design choices

EECS 570: Fall rev1 56 Example: SP 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock packet sw, cut-through, no virtual channel, source-based routing variable packet <= 255 bytes, 31 byte FIFO per input, 7 bytes per output, 16 phit links byte “chunks” in central queue, LRU per output