Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2

Slides:



Advertisements
Similar presentations
Comparison Of Network On Chip Topologies Ahmet Salih BÜYÜKKAYHAN Fall.
Advertisements

Shantanu Dutt Univ. of Illinois at Chicago
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
Interconnection Network Topologies
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
Interconnection Network Topology Design Trade-offs
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Storage area network and System area network (SAN)
Interconnect Network Topologies
Interconnect Networks
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Embedded Computer Architecture 5SAI0 Interconnection Networks
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
Effective bandwidth with link pipelining Pipeline the flight and transmission of packets over the links Overlap the sending overhead with the transport.
Interconnection Networks Communications Among Processors.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
INTERCONNECTION NETWORK
Network Connected Multiprocessors
Overview Parallel Processing Pipelining
Parallel Architecture
Interconnect Networks
CS 704 Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
Connection System Serve on mutual connection processors and memory .
Multiprocessor Interconnection Networks Todd C
Interconnection topologies
Course Outline Introduction in algorithms and applications
Azeddien M. Sllame, Amani Hasan Abdelkader
Overview Parallel Processing Pipelining
Static and Dynamic Networks
Interconnection Network Routing, Topology Design Trade-offs
John Kubiatowicz Electrical Engineering and Computer Sciences
Interconnect Basics.
Introduction to Scalable Interconnection Network Design
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Interconnection Network Design Lecture 14
Mesh-Connected Illiac Networks
Communication operations
Introduction to Scalable Interconnection Networks
Storage area network and System area network (SAN)
Advanced Computer Architecture 5MD00 Project on Network-on-Chip
Static Interconnection Networks
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Interconnection Networks Contd.
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Birds Eye View of Interconnection Networks
Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing
Static Interconnection Networks
Switching, routing, and flow control in interconnection networks
Topology What are the consequences of using fewer than n2 connections?
Introduction Johan Lukkien
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2 Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

Multi-Processor design decision We have already discussed: Shared memory versus Message passing Coherence, Consistency and Synchronization issues Other extremely important decisions: Processing units: Homogeneous versus Heterogeneous? Generic versus Application specific ? Interconnect: Bus versus Network ? Type (topology) of network Focus on Performance, Power or Cost ? Memory organization ? 1/14/2019 ACA H.Corporaal

Homogeneous or Heterogeneous Global Network-on-Chip Local inter-node connections Homogenous: replication effect easy to desing fault tolerance can be built-in process migration possible solve realization issues once and for all (i.e. highly tuning of nodes and network) less flexible nodes Reasoning: future chips are memory dominated any way, so adding some logic to make all nodes equal does not matter 1/14/2019 ACA H.Corporaal

Homogeneous or Heterogeneous better fit to application domain smaller increments more costly design no replication advantage 1/14/2019 ACA H.Corporaal

Homogeneous or Heterogeneous Middle of the road approach Flexible tiles Fixed tile structure at top level 1/14/2019 ACA H.Corporaal

Bus (shared) or Network (switched) claimed to be more scalable no bus arbitration point-to-point connections but router overhead node R Example: NoC with 2x4 mesh routing network 1/14/2019 ACA H.Corporaal

Historical Perspective Early machines were: Collection of microprocessors Communication was performed using bi-directional queues between nearest neighbors Messages were forwarded by processors on path “Store and forward” networking There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time 1/14/2019 ACA H.Corporaal

Design Characteristics of a Network Topology (how things are connected): Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, .... Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids deadlock) Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc. QoS guarantees Error handling etc, etc. 1/14/2019 ACA H.Corporaal

Switch / Network Topology Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth = link bandwidth * bisection 1/14/2019 ACA H.Corporaal

bisection bw = sqrt(n) * link bw Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves Bandwidth across “narrowest” part of the network not a bisection cut bisection cut bisection bw= link bw bisection bw = sqrt(n) * link bw Bisection bandwidth is important for algorithms in which all processors need to communicate with all others 1/14/2019 ACA H.Corporaal

Common Topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension 1/14/2019 ACA H.Corporaal

Linear and Ring Topologies Linear array Diameter = n-1; average distance ~n/3 Bisection bandwidth = 1 (in units of link bandwidth) Torus or Ring Diameter = n/2; average distance ~ n/4 Bisection bandwidth = 2 Natural for algorithms that work with 1D arrays 1/14/2019 ACA H.Corporaal

Meshes and Tori Two dimensional mesh Diameter = 2 * (sqrt( n ) – 1) Bisection bandwidth = sqrt(n) Two dimensional torus Diameter = sqrt( n ) Bisection bandwidth = 2* sqrt(n) Generalizes to higher dimensions Natural for algorithms that work with 2D and/or 3D arrays 1/14/2019 ACA H.Corporaal

Hypercubes Number of nodes n = 2d for dimension d 0d 1d 2d 3d 4d Diameter = d Bisection bandwidth = n/2 0d 1d 2d 3d 4d Popular in early machines (Intel iPSC, NCUBE, CM) Lots of clever algorithms Extension: k-ary n-cubes Greycode addressing: Each node connected to others with 1 bit different 001 000 100 010 011 111 101 110 1/14/2019 ACA H.Corporaal

Trees Diameter = log n. Bisection bandwidth = 1 Easy layout as planar graph Many tree algorithms (e.g., summation) Fat trees avoid bisection bandwidth problem: More (or wider) links near top Example: Thinking Machines CM-5 1/14/2019 ACA H.Corporaal

Fat Tree example A multistage fat tree (CM-5) avoids congestion at the root node Randomly assign packets to different paths on way up to spread the load Increase degree near root, decrease congestion 1/14/2019 ACA H.Corporaal

Butterflies with n = (k-1)2k switches Connecting 2k processors, with Bisection bandwidth = 2*2k Cost: lots of wires 2log(k) hop-distance for all connections, however blocking possible Used in BBN Butterfly Natural for FFT Switch PE O 1 Butterfly switch Multistage butterfly network: k=3 1/14/2019 ACA H.Corporaal

Topologies in Real Machines Red Storm (Opteron + Cray network, future) 3D Mesh Blue Gene/L 3D Torus SGI Altix Fat tree Cray X1 4D Hypercube* Myricom (Millennium) Arbitrary Quadrics (in HP Alpha server clusters) IBM SP Fat tree (approx) SGI Origin Hypercube Intel Paragon (old) 2D Mesh BBN Butterfly (really old) Butterfly older newer Many of these are approximations: E.g., the X1 is really a “quad bristled hypercube” and some of the fat trees are not as fat as they should be at the top 1/14/2019 ACA H.Corporaal

More examples Hypercube 2D-Grid/Mesh 2D-Torus Assume 64 nodes: Criteria Bus Ring 2D-Mesh 2D-torus 6-cube Fully connected Performance Bisection bandwidth 1 2 8 16 32 1024 Cost Ports/switch Total #links 3 128 5 176 192 7 256 64 2080 1/14/2019 ACA H.Corporaal

QoS: Quality-of-Service Hard and Soft Real-time applications require QoS guarantees Predicatable delays Guaranteed throughput Issues: Different inter processor traffic service types: GT: guaranteed throughput / latency traffic BE: best effort Resource manager interface between applications and platform resources (processing elements, network, memory, i/o) Do we allow caches software controlled 1/14/2019 ACA H.Corporaal

Generic or Specialized? Computational Efficiency 1 pJ/op 1/14/2019 ACA H.Corporaal