©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University
©2003 Dror Feitelson The Issues Topology –The road map: what connects to what Routing –Selecting a route to the desired destination Flow Control –Traffic lights: who gets to go Switching –The mechanics of moving bits Dally, VLSI & Parallel Computation chap. 3, 1990
©2003 Dror Feitelson Topologies and Routing
©2003 Dror Feitelson The Model Network is separate from processors (In old systems nodes did the switching) Network composed of switches and links Each processor connected to some switch PE
©2003 Dror Feitelson Considerations Diameter –Expected to correlate with maximal latency Switch degree –Harder to implement switches with high degree Capacity –Potential of serving multiple communications at once Number of switches –Obviously effects network cost Existence of a simple routing function –Implemented in hardware
©2003 Dror Feitelson Hypercubes Multi-dimensional cubes n dimensions, N = 2 n nodes Nodes identified by n-bit numbers Each node connected to other nodes that differ in a single bit Degree: log N Diameter: log N Cost: N switches (one per node) Used in Intel iPSC, cCUBE
©2003 Dror Feitelson Recursive Construction
©2003 Dror Feitelson Node Numbering Each node has an n-bit number Each bit corresponds to a dimension of the hypercube
©2003 Dror Feitelson Routing Given source and destination, correct one bit at a time Example: go from 001 to 110 In what order?
©2003 Dror Feitelson Mesh N nodes arranged in a rectangle or square Each node connected to 4 neighbors Diameter: Degree: 4 Cost: N switches (one per node) Used in Intel Paragon
©2003 Dror Feitelson Routing Each node identified by x,y coordinates X-Y routing: first along one dimension, then along the other Deadlock prevention –Always route along X dimension first –Turn model: disallow only one turn –Odd-even turns: disallow different turns in odd or even rows/columns
©2003 Dror Feitelson Adaptive Routing Decide route on-line according to conditions –Avoid congested links –Circumvent failed links Dally, IEEE Trans. Par. Dist. Syst., 1993
©2003 Dror Feitelson Congestion Avoidance Desired pattern
©2003 Dror Feitelson Congestion Avoidance Dimension order routing: first along X, then along Y Congestion at top right (assuming pipelining)
©2003 Dror Feitelson Congestion Avoidance Adaptive routing: disjoint paths can be used Throughput increased 7- fold
©2003 Dror Feitelson Fault Tolerance Example dimension order routing
©2003 Dror Feitelson Fault Tolerance Example dimension order routing Fault causes many nodes to be inaccessible
©2003 Dror Feitelson Adaptive Routing Decide route on-line according to conditions –Avoid congested links –Circumvent failed links In mesh, source and destination nodes define a rectangle of all minimal routes Also possible to use non-minimal routes
©2003 Dror Feitelson Torus Mesh with wrap-around Reduces diameter to In 2D this is topologically a donut But… –Harder to prevent deadlocks –Longer cables than simple mesh –Harder to partition Used in Cray T3D/T3E
©2003 Dror Feitelson Hypercubes vs. Meshes Hypercubes have a smaller diameter But this is not a real advantage –We live in 3D, so some cables have to be long –They also overlap other cables –Given a certain space, each cable must therefore be thinner (less wires in parallel) –Result: less hops, but each takes more time Meshes also have a smaller degree Dally, IEEE Trans. Comput, 1990
©2003 Dror Feitelson Network Bisection Definition: bisection width is the minimal number of wires that, when cut, will divide the network into two halves Assume W wires in each link –Binary tree: B = 1· W –Mesh: –Hypercube: B = N/2 · W Assumption: wire bisection is fixed Note: count wires, not links!
©2003 Dror Feitelson Network Bisection Assume machine packs N nodes in 3D Expected traffic proportional to N Bisection proportional to N 2/3 (assuming arranged in 3D cube) May be proportional to N 1/2 (if arranged in plane)
©2003 Dror Feitelson Communication Parameters N: number of nodes in system W: wires in each link B: bisection width D: average distance between nodes L: message length p: time to forward header Time for message arrival: pD + L/W cycles (assumes wormhole routing = pipelining)
©2003 Dror Feitelson Hypercube vs. Mesh Assumptions:Binary 8-cube 16 16 mesh N = 256 nodesW=1W=8 B = 128 wiresD=4D=11.6 pD dominates L/W dominates
©2003 Dror Feitelson A Generalization K-ary N-cubes –N dimensions as in a hypercube –K nodes along each dimension as in a mesh Hypercubes are 2-ary N-cubes Meshes are K-ary 2-cubes Good tradeoffs provided by 2, 3, and maybe 4 dimensions
©2003 Dror Feitelson Capacity Problem Each message must occupy several links on the way to its destination, depending on the distance If all nodes attempt to transmit at the same time, there will not be enough space Possible solution: diluted networks only some nodes have PEs attached Alternative solution: multistage networks and crossbars
©2003 Dror Feitelson Multistage Networks Organized as several (logarithmic) stages of switches Various interconnection patterns among the stages Diameter: log N Switch degree: constant Cost: O(N log N) –Constant depends on switch degree Used in IBM SP2
©2003 Dror Feitelson Example: Cube Network Cube-like pattern among stages Routing according to address bits 0 = top exit 1 = bottom exit Example: go from 0010 to 1100 Or from anywhere else!
©2003 Dror Feitelson Problems Only one path from each source to each destination –No fault tolerance –Susceptible to congestion Hot-spot can block the whole network (tree saturation) Popular patterns also lead to congestion
©2003 Dror Feitelson Solution Use extra stages Obviously increases cost
©2003 Dror Feitelson Fat-Tree Implementation Tree: routing through common ancestor Problem: root becomes a bottleneck Fat-tree: make top of tree fatter (i.e. with more bandwidth) Most commonly implemented using a multistage network with multiple “roots” Adaptiveness by selection of root Used in Connection Machine CM-5 Leiserson, IEEE Trans. Comput, 1989
©2003 Dror Feitelson
Crossbars A switch for each possible connection Diameter: 1 Degree: 4 Cost: N 2 switches No congestion if destination is free Used in Earth Simulator
©2003 Dror Feitelson Irregular Networks It is now possible to buy switches and cables to construct your own network –Myrinet –Quadrics –Switched giga/fast Ethernet Multistage network topologies are often recommended But you can connect the switches in arbitrary ways too
©2003 Dror Feitelson Myrinet Components processor memory PCI bus NIC LANai processor memory switch Connections to other nodes and switches Myrinet control program Routing table Boden et al, IEEE Micro, 1995 Communication buffers
©2003 Dror Feitelson Source Routing Routes decided by the source node Routing instructions included in packet header Typically several bits for each switch along the way, specifying which exit port to use Allows easier control by nodes Also simpler switches
©2003 Dror Feitelson Exercise Adaptive routing requires switches that make routing decisions themselves In source routing, routing is decided at the source, and switches just operate according to instructions Is it possible to create some form of adaptive source routing? –What are the goals? –Can they be achieved using source routing? –If not, can they be approximated?
©2003 Dror Feitelson Routing on Irregular Networks Find minimal routes (e.g. using Dijkstra’s algorithm) Problem: may lead to deadlock (assuming mutual blocking at switches due to buffer constraints)
©2003 Dror Feitelson Up-Down Routing Give a direction to each edge in the network Routing first goes with the direction, and then against it This prevents cycles, and hence deadlocks Need to ensure connectivity –Simple option: start with a spanning tree –But might lead to congestion at the root –Also many non-minimal routes Schroeder, IEEE J. Select Areas Comm., 1991
©2003 Dror Feitelson Flow Control
©2003 Dror Feitelson Switching Mechanisms Circuit switching –Create a dedicated circuit and then transmit data on it –Like traditional telephone network Message switching –Store and forward the whole message at each switch Packet switching –Partition message into fixed-size packets and send them independently –Guarantee buffer availability at receiver –Pipeline the packets to reduce latency –But need to reconstruct the message
©2003 Dror Feitelson Three Levels Message –Unit of communication at the application level Packet –Unit of transmission –Each packet has a header and is routed independently –Large messages partitioned into multiple packets Flit –Unit of flow control –Each packet divided into flits that follow each other
©2003 Dror Feitelson Moving Packets Store and forward of complete packets –Latency depends on number of hops Virtual cut-through: start forwarding as soon as header is decoded –Allow for overlap of transmission on consecutive links –Still need buffers big enough for full packets in case one gets stuck Wormhole routing: block in place –Reduce required buffer size at cost of additional blocking Dally, VLSI & Parallel Computation chap. 3, 1990
©2003 Dror Feitelson Store and Forward packet latency sourcedestination L/W D
©2003 Dror Feitelson Virtual Cut Through flit latency sourcedestination L/W + D
©2003 Dror Feitelson Wormhole Routing Wormhole routing operates on flits – the units of flow control –Typically what can be transmitted in a single cycle –Equal to the number of wires in each link Packet header is typically one or two flits Flits don’t contain routing information –They must follow previous flits without interleaving with flits of another packet Each switch only has buffer space for a small number of flits
©2003 Dror Feitelson Collisions Store whole packet in a buffer (called virtual cut through) Block in-place across multiple switches (called wormhole routing) Drop the data Resources are lost!!! Misroute: keep moving, but in the wrong direction What happens if a stream of flits arrives at a switch, and the desired output port is busy?
©2003 Dror Feitelson Throughput and Load offered load (input) throughput (output) saturation blocking or buffering dropping
©2003 Dror Feitelson Throttling Sources are usually permitted to inject traffic as fast as they wish Throttling slows them down by placing a limit on the rate of injecting new traffic This puts a cap on the maximal throughput possible But it also prevents excessive congestion, and may lead to better overall performance
©2003 Dror Feitelson Deadlock
©2003 Dror Feitelson Deadlocks In wormhole routing, packets hold switch resources while they move –Flit buffers –Output ports Another packet may arrive that needs the same resources Cyclic dependencies may lead to deadlock
©2003 Dror Feitelson Deadlocks
©2003 Dror Feitelson Dependencies Deadlocks are the most dramatic problems But can also just lead to inefficiency –A blocked packet still holds its channels (because flits need to stay contiguous to maintain routing) –Another packet may be able to utilize these channels
©2003 Dror Feitelson Inefficiency
©2003 Dror Feitelson Virtual Channels Divide the buffers in each switch into several virtual channels Each virtual channel also has its own state and routing information Virtual channels share the use of physical resources Dally, IEEE Trans. Par. Dist. Syst., 1992
©2003 Dror Feitelson Efficiency! Red packet occupies some (not all!!!) buffer space Green packet actually uses link
©2003 Dror Feitelson Deadlocks Again Virtual channels can also be used to solve the deadlock problem –In a network with diameter D, create D virtual channels on each link –Newly injected messages can only use virtual channel no. 1 –Packets coming on virtual channel i can only move to virtual channel i+1 –Virtual channels used are strictly ordered, so no cycles –This version limits flexibility, hence inefficient
©2003 Dror Feitelson Dally’s Methodology Create a routing function using minimal paths Remove arcs to make this acyclic, and hence deadlock free If this results in disconnecting the routing function, duplicate links using virtual channels Dally, IEEE Trans. Comput., 1987
©2003 Dror Feitelson Duato’s Methodology Start with a deadlock-free routing function (e.g. Dally’s) Duplicate all channels by virtualization The extended routing function allows use of the new or the original channels; but once an original channel is used, you cannot revert to new channels Works for any topology Duato, IEEE Trans. Par. Dist. Syst., 1993
©2003 Dror Feitelson Performance
©2003 Dror Feitelson Methodology Network simulation –Network topology –Routing algorithm –Packet-level simulation of flow control (including virtual circuits) Workloads –Synthetic patterns Uniformly distributed random destinations Hot spots –Real applications Requires co-simulation of application and network Experiment with different options
©2003 Dror Feitelson Simulation Results Metrics: –Throughput: packets per second delivered, or fraction of capacity supported –Latency: delay in delivering packets Results: –Adaptive routing and virtual circuits improve both metrics under loaded conditions
©2003 Dror Feitelson However… This does not necessarily translate into improved application performance Realistic applications typically do not create sufficiently high communication loads –There is not a lot of congestion –So overcoming it is not an issue Supporting virtual channels and adaptive routing comes at a cost –Switches are more complex and therefore slower Vaidya et al., IEEE Trans. Parallel Distrib. Syst., 2001
©2003 Dror Feitelson The Bottom Line Virtual circuits are useful for deadlock prevention Virtual circuits and adaptive routing may hurt performance more than they improve it It may be more important to make switches fast
©2003 Dror Feitelson Switching
©2003 Dror Feitelson Switch Elements Input ports Output ports Buffers A crossbar connecting the inputs to the outputs Xbar
©2003 Dror Feitelson Input Buffering Buffers are associated with input ports If desired output port is busy, no more data enters Suffers from head- of-line blocking Xbar
©2003 Dror Feitelson Output Buffering Buffers are associated with output ports Packets block only if their desired output is busy Xbar
©2003 Dror Feitelson Central Queue Queues are associated with output ports Buffer space is shared –More for busier inputs –More for busier outputs Xbar Stunkel et al., IBM Syst. J., 1995