©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University.

Slides:



Advertisements
Similar presentations
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Advertisements

Advanced Networking Wickus Nienaber Daniel Beech.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Issues in System-Level Direct Networks Jason D. Bakos.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Storage area network and System area network (SAN)
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Interconnect Networks
On-Chip Networks and Testing
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University
Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
NC2 (No6) 1 Maximally Adaptive Routing Maximize adaptivity for a double-x routing based on turn model. Virtual network 0 Virtual network 1 Maximally adaptive.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Ch 8. Switching. Switch  Devices that interconnected with each other  Connecting all nodes (like mesh network) is not cost-effective  Some topology.
Super computers Parallel Processing
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
1 Lecture 14: Interconnection Networks Topics: dimension vs. arity, deadlock.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Parallel Architecture
Lecture 23: Interconnection Networks
Interconnection topologies
Azeddien M. Sllame, Amani Hasan Abdelkader
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Interconnection Network Design Lecture 14
Storage area network and System area network (SAN)
Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Interconnection Networks Contd.
Embedded Computer Architecture 5SAI0 Interconnection Networks
Lecture: Interconnection Networks
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Switching, routing, and flow control in interconnection networks
Presentation transcript:

©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson The Issues Topology –The road map: what connects to what Routing –Selecting a route to the desired destination Flow Control –Traffic lights: who gets to go Switching –The mechanics of moving bits Dally, VLSI & Parallel Computation chap. 3, 1990

©2003 Dror Feitelson Topologies and Routing

©2003 Dror Feitelson The Model Network is separate from processors (In old systems nodes did the switching) Network composed of switches and links Each processor connected to some switch PE

©2003 Dror Feitelson Considerations Diameter –Expected to correlate with maximal latency Switch degree –Harder to implement switches with high degree Capacity –Potential of serving multiple communications at once Number of switches –Obviously effects network cost Existence of a simple routing function –Implemented in hardware

©2003 Dror Feitelson Hypercubes Multi-dimensional cubes n dimensions, N = 2 n nodes Nodes identified by n-bit numbers Each node connected to other nodes that differ in a single bit Degree: log N Diameter: log N Cost: N switches (one per node) Used in Intel iPSC, cCUBE

©2003 Dror Feitelson Recursive Construction

©2003 Dror Feitelson Node Numbering Each node has an n-bit number Each bit corresponds to a dimension of the hypercube

©2003 Dror Feitelson Routing Given source and destination, correct one bit at a time Example: go from 001 to 110 In what order?

©2003 Dror Feitelson Mesh N nodes arranged in a rectangle or square Each node connected to 4 neighbors Diameter: Degree: 4 Cost: N switches (one per node) Used in Intel Paragon

©2003 Dror Feitelson Routing Each node identified by x,y coordinates X-Y routing: first along one dimension, then along the other Deadlock prevention –Always route along X dimension first –Turn model: disallow only one turn –Odd-even turns: disallow different turns in odd or even rows/columns

©2003 Dror Feitelson Adaptive Routing Decide route on-line according to conditions –Avoid congested links –Circumvent failed links Dally, IEEE Trans. Par. Dist. Syst., 1993

©2003 Dror Feitelson Congestion Avoidance Desired pattern

©2003 Dror Feitelson Congestion Avoidance Dimension order routing: first along X, then along Y Congestion at top right (assuming pipelining)

©2003 Dror Feitelson Congestion Avoidance Adaptive routing: disjoint paths can be used Throughput increased 7- fold

©2003 Dror Feitelson Fault Tolerance Example dimension order routing

©2003 Dror Feitelson Fault Tolerance Example dimension order routing Fault causes many nodes to be inaccessible

©2003 Dror Feitelson Adaptive Routing Decide route on-line according to conditions –Avoid congested links –Circumvent failed links In mesh, source and destination nodes define a rectangle of all minimal routes Also possible to use non-minimal routes

©2003 Dror Feitelson Torus Mesh with wrap-around Reduces diameter to In 2D this is topologically a donut But… –Harder to prevent deadlocks –Longer cables than simple mesh –Harder to partition Used in Cray T3D/T3E

©2003 Dror Feitelson Hypercubes vs. Meshes Hypercubes have a smaller diameter But this is not a real advantage –We live in 3D, so some cables have to be long –They also overlap other cables –Given a certain space, each cable must therefore be thinner (less wires in parallel) –Result: less hops, but each takes more time Meshes also have a smaller degree Dally, IEEE Trans. Comput, 1990

©2003 Dror Feitelson Network Bisection Definition: bisection width is the minimal number of wires that, when cut, will divide the network into two halves Assume W wires in each link –Binary tree: B = 1· W –Mesh: –Hypercube: B = N/2 · W Assumption: wire bisection is fixed Note: count wires, not links!

©2003 Dror Feitelson Network Bisection Assume machine packs N nodes in 3D Expected traffic proportional to N Bisection proportional to N 2/3 (assuming arranged in 3D cube) May be proportional to N 1/2 (if arranged in plane)

©2003 Dror Feitelson Communication Parameters N: number of nodes in system W: wires in each link B: bisection width D: average distance between nodes L: message length p: time to forward header Time for message arrival: pD + L/W cycles (assumes wormhole routing = pipelining)

©2003 Dror Feitelson Hypercube vs. Mesh Assumptions:Binary 8-cube 16  16 mesh N = 256 nodesW=1W=8 B = 128 wiresD=4D=11.6 pD dominates L/W dominates

©2003 Dror Feitelson A Generalization K-ary N-cubes –N dimensions as in a hypercube –K nodes along each dimension as in a mesh Hypercubes are 2-ary N-cubes Meshes are K-ary 2-cubes Good tradeoffs provided by 2, 3, and maybe 4 dimensions

©2003 Dror Feitelson Capacity Problem Each message must occupy several links on the way to its destination, depending on the distance If all nodes attempt to transmit at the same time, there will not be enough space Possible solution: diluted networks only some nodes have PEs attached Alternative solution: multistage networks and crossbars

©2003 Dror Feitelson Multistage Networks Organized as several (logarithmic) stages of switches Various interconnection patterns among the stages Diameter: log N Switch degree: constant Cost: O(N log N) –Constant depends on switch degree Used in IBM SP2

©2003 Dror Feitelson Example: Cube Network Cube-like pattern among stages Routing according to address bits  0 = top exit  1 = bottom exit Example: go from 0010 to 1100 Or from anywhere else!

©2003 Dror Feitelson Problems Only one path from each source to each destination –No fault tolerance –Susceptible to congestion Hot-spot can block the whole network (tree saturation) Popular patterns also lead to congestion

©2003 Dror Feitelson Solution Use extra stages Obviously increases cost

©2003 Dror Feitelson Fat-Tree Implementation Tree: routing through common ancestor Problem: root becomes a bottleneck Fat-tree: make top of tree fatter (i.e. with more bandwidth) Most commonly implemented using a multistage network with multiple “roots” Adaptiveness by selection of root Used in Connection Machine CM-5 Leiserson, IEEE Trans. Comput, 1989

©2003 Dror Feitelson

Crossbars A switch for each possible connection Diameter: 1 Degree: 4 Cost: N 2 switches No congestion if destination is free Used in Earth Simulator

©2003 Dror Feitelson Irregular Networks It is now possible to buy switches and cables to construct your own network –Myrinet –Quadrics –Switched giga/fast Ethernet Multistage network topologies are often recommended But you can connect the switches in arbitrary ways too

©2003 Dror Feitelson Myrinet Components processor memory PCI bus NIC LANai processor memory switch Connections to other nodes and switches Myrinet control program Routing table Boden et al, IEEE Micro, 1995 Communication buffers

©2003 Dror Feitelson Source Routing Routes decided by the source node Routing instructions included in packet header Typically several bits for each switch along the way, specifying which exit port to use Allows easier control by nodes Also simpler switches

©2003 Dror Feitelson Exercise Adaptive routing requires switches that make routing decisions themselves In source routing, routing is decided at the source, and switches just operate according to instructions Is it possible to create some form of adaptive source routing? –What are the goals? –Can they be achieved using source routing? –If not, can they be approximated?

©2003 Dror Feitelson Routing on Irregular Networks Find minimal routes (e.g. using Dijkstra’s algorithm) Problem: may lead to deadlock (assuming mutual blocking at switches due to buffer constraints)

©2003 Dror Feitelson Up-Down Routing Give a direction to each edge in the network Routing first goes with the direction, and then against it This prevents cycles, and hence deadlocks Need to ensure connectivity –Simple option: start with a spanning tree –But might lead to congestion at the root –Also many non-minimal routes Schroeder, IEEE J. Select Areas Comm., 1991

©2003 Dror Feitelson Flow Control

©2003 Dror Feitelson Switching Mechanisms Circuit switching –Create a dedicated circuit and then transmit data on it –Like traditional telephone network Message switching –Store and forward the whole message at each switch Packet switching –Partition message into fixed-size packets and send them independently –Guarantee buffer availability at receiver –Pipeline the packets to reduce latency –But need to reconstruct the message

©2003 Dror Feitelson Three Levels Message –Unit of communication at the application level Packet –Unit of transmission –Each packet has a header and is routed independently –Large messages partitioned into multiple packets Flit –Unit of flow control –Each packet divided into flits that follow each other

©2003 Dror Feitelson Moving Packets Store and forward of complete packets –Latency depends on number of hops Virtual cut-through: start forwarding as soon as header is decoded –Allow for overlap of transmission on consecutive links –Still need buffers big enough for full packets in case one gets stuck Wormhole routing: block in place –Reduce required buffer size at cost of additional blocking Dally, VLSI & Parallel Computation chap. 3, 1990

©2003 Dror Feitelson Store and Forward packet latency sourcedestination L/W  D

©2003 Dror Feitelson Virtual Cut Through flit latency sourcedestination L/W + D

©2003 Dror Feitelson Wormhole Routing Wormhole routing operates on flits – the units of flow control –Typically what can be transmitted in a single cycle –Equal to the number of wires in each link Packet header is typically one or two flits Flits don’t contain routing information –They must follow previous flits without interleaving with flits of another packet Each switch only has buffer space for a small number of flits

©2003 Dror Feitelson Collisions Store whole packet in a buffer (called virtual cut through) Block in-place across multiple switches (called wormhole routing) Drop the data Resources are lost!!! Misroute: keep moving, but in the wrong direction What happens if a stream of flits arrives at a switch, and the desired output port is busy?

©2003 Dror Feitelson Throughput and Load offered load (input) throughput (output) saturation blocking or buffering dropping

©2003 Dror Feitelson Throttling Sources are usually permitted to inject traffic as fast as they wish Throttling slows them down by placing a limit on the rate of injecting new traffic This puts a cap on the maximal throughput possible But it also prevents excessive congestion, and may lead to better overall performance

©2003 Dror Feitelson Deadlock

©2003 Dror Feitelson Deadlocks In wormhole routing, packets hold switch resources while they move –Flit buffers –Output ports Another packet may arrive that needs the same resources Cyclic dependencies may lead to deadlock

©2003 Dror Feitelson Deadlocks

©2003 Dror Feitelson Dependencies Deadlocks are the most dramatic problems But can also just lead to inefficiency –A blocked packet still holds its channels (because flits need to stay contiguous to maintain routing) –Another packet may be able to utilize these channels

©2003 Dror Feitelson Inefficiency

©2003 Dror Feitelson Virtual Channels Divide the buffers in each switch into several virtual channels Each virtual channel also has its own state and routing information Virtual channels share the use of physical resources Dally, IEEE Trans. Par. Dist. Syst., 1992

©2003 Dror Feitelson Efficiency! Red packet occupies some (not all!!!) buffer space Green packet actually uses link

©2003 Dror Feitelson Deadlocks Again Virtual channels can also be used to solve the deadlock problem –In a network with diameter D, create D virtual channels on each link –Newly injected messages can only use virtual channel no. 1 –Packets coming on virtual channel i can only move to virtual channel i+1 –Virtual channels used are strictly ordered, so no cycles –This version limits flexibility, hence inefficient

©2003 Dror Feitelson Dally’s Methodology Create a routing function using minimal paths Remove arcs to make this acyclic, and hence deadlock free If this results in disconnecting the routing function, duplicate links using virtual channels Dally, IEEE Trans. Comput., 1987

©2003 Dror Feitelson Duato’s Methodology Start with a deadlock-free routing function (e.g. Dally’s) Duplicate all channels by virtualization The extended routing function allows use of the new or the original channels; but once an original channel is used, you cannot revert to new channels Works for any topology Duato, IEEE Trans. Par. Dist. Syst., 1993

©2003 Dror Feitelson Performance

©2003 Dror Feitelson Methodology Network simulation –Network topology –Routing algorithm –Packet-level simulation of flow control (including virtual circuits) Workloads –Synthetic patterns Uniformly distributed random destinations Hot spots –Real applications Requires co-simulation of application and network Experiment with different options

©2003 Dror Feitelson Simulation Results Metrics: –Throughput: packets per second delivered, or fraction of capacity supported –Latency: delay in delivering packets Results: –Adaptive routing and virtual circuits improve both metrics under loaded conditions

©2003 Dror Feitelson However… This does not necessarily translate into improved application performance Realistic applications typically do not create sufficiently high communication loads –There is not a lot of congestion –So overcoming it is not an issue Supporting virtual channels and adaptive routing comes at a cost –Switches are more complex and therefore slower Vaidya et al., IEEE Trans. Parallel Distrib. Syst., 2001

©2003 Dror Feitelson The Bottom Line Virtual circuits are useful for deadlock prevention Virtual circuits and adaptive routing may hurt performance more than they improve it It may be more important to make switches fast

©2003 Dror Feitelson Switching

©2003 Dror Feitelson Switch Elements Input ports Output ports Buffers A crossbar connecting the inputs to the outputs Xbar

©2003 Dror Feitelson Input Buffering Buffers are associated with input ports If desired output port is busy, no more data enters Suffers from head- of-line blocking Xbar

©2003 Dror Feitelson Output Buffering Buffers are associated with output ports Packets block only if their desired output is busy Xbar

©2003 Dror Feitelson Central Queue Queues are associated with output ports Buffer space is shared –More for busier inputs –More for busier outputs Xbar Stunkel et al., IBM Syst. J., 1995