Introduction to Architecture for Parallel Computing

Slides:



Advertisements
Similar presentations
Shantanu Dutt Univ. of Illinois at Chicago
Advertisements

SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Parallel System Performance CS 524 – High-Performance Computing.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Parallel Architectures: Topologies Heiko Schröder, 2003.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
Communication operations Efficient Parallel Algorithms COMP308.
Parallel Computing Platforms
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
Introduction to Parallel Processing Ch. 12, Pg
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Computer Science Department
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Lecture 3 Innerconnection Networks for Parallel Computers
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Birds Eye View of Interconnection Networks
Super computers Parallel Processing
HYPERCUBE ALGORITHMS-1
Basic Communication Operations Carl Tropper Department of Computer Science.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Interconnection Networks Communications Among Processors.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
INTERCONNECTION NETWORK
Overview Parallel Processing Pipelining
Parallel Architecture
Distributed and Parallel Processing
Interconnect Networks
Multiprocessor Systems
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
Interconnection topologies
Course Outline Introduction in algorithms and applications
Introduction to Parallel Computing
Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel and Multiprocessor Architectures
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Outline Interconnection networks Processor arrays Multiprocessors
Mesh-Connected Illiac Networks
Communication operations
Static Interconnection Networks
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Birds Eye View of Interconnection Networks
Advanced Computer and Parallel Processing
Static Interconnection Networks
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Introduction to Architecture for Parallel Computing

Models of parallel computers (sequential)

Taxonomy Control Address space Interconnection network Granularity SISD – Single instruction stream, single data stream Pros: 1. Single program less memory Cons: 1. Dedicated hardware (expensive) 2. Rigid

Taxonomy (continued) SIMD – Single instruction stream, multiple data stream MISD – Multiple instruction stream, single data stream MIMD – Multiple instruction stream, multiple data stream Pros: 1. Over the shelf hardware – inexpensive 2. Flexible Cons: 1. Program and data on all PEs

SIMD and MIMD architectures

Address –Space organization Two paradigms - Message passing - Shared-memory Message Passing - pairs of PE and memory modules communicate between each other Shared Address Space - each PE has access to any location in memory.

Message passing paradigm

Shared-memory paradigm (UMA)

Shared-memory paradigm (NUMA)

Interconnection Networks Static (direct) – for message passing Dynamic(indirect) - for shared-memory

Processor Granularity Coarse–grain - for algorithms that require frequent communication Medium-grain - for algorithms in which the ratio of the time required for basic communication to the time required for basic computation is small Fine–grain - required frequent communication.

PRAM model (idealized parallel computer) P is the number of processors - share common clock each works on its own instruction M is the global memory - uniformly accessible to all PES Synchronous shared memory MIMD computers Interaction between PEs occurs at no cost

PRAM model (continued) Depending on how memory is accessed for READ and WRITE operations following four subclasses: - EREW (exclusive read, exclusive write) - weakest, minimum concurrency - ERCW (excusive read, concurrent write) - CREW (concurrent read, exclusive write) - read access is concurrent, write access serialized - CRCW (concurrent read, concurrent write) - most powerful, can be simulated on a EREW model

PRAM model (continued) CR (concurrent read) – does not need. Modification in program CW (concurrent write) - access to memory requires arbitration Protocols for CR - Common - iff all values are identical - Arbitrary – an arbitrary PE proceeds, other PEs fail - Priority – PE with highest priority wins - Sum – sum of all quantities is written

Dynamic Interconnection Networks EREW PRAM - p processors - m words in global memory - switching elements (determine the memory word accessed) Each P can access any of the memory words Number of switching elements required (mp) for UMA. Very expensive (-) Very good performance (+)

Dynamic interconnection networks (continued) Solution - Reduce m by using memory banks (m words organized in b banks) - Each P switches between b banks - Total number of switching elements (bp) - Less expensive Note - This is a weak approximation of EREW because no 2 PE can access the same memory bank at the same time.

Crossbar switching networks

Crossbar switching networks (continued) From the previous slide - p processors are connected to b memory banks by a crossbar switch. - each PE accesses one memory bank - m words are stored in the memory banks - if m = b, crossbar simulates EREW PRAM Total number of switching elements (pb) If p is very large and p > b - total number of switching elements grows as (p2) - and more P are unable to access any memory bank - not scalable

Bus-based networks

Bus-based networks (continued) UMA - each PE accesses data from M over the same bus (UMA) - as p becomes very large, each PE spends an increasing amount of time waiting for memory access while the bus is used by other PEs. - not acceptable. Solution (NUMA) - provide local-cache to reduce total number of accesses to global memory - this implies replicating data - cache coherency problems. - the architecture is shown on the following slide

Bus-based networks (NUMA)

Multistage interconnection networks Multistage networks provides a trade-off between the two required metrics. network Performance Cost Crossbar Excellent Extremely high Bus-based Poor Most economical

Multistage interconnection networks (graphs) Cost Performance Crossbar Multistage Crossbar Multistage Bus Bus p p

Multistage interconnection networks

Multistage interconnection networks (continued) Cost is less than the Crossbar networks Performance better than the Bus-based networks Omega network - p is the # of PEs - b is the # memory banks - number of stages is log2p - p= b - a link exists between input i and output j if: j =

Omega network with blocking An example of blocking in Omega network; one of the messages (010 to 111 or 110 to 100) is blocked at link AB.

Switch configurations in one stage

Static interconnection networks Used in message passing Completely-connected network - similar to Crossbar but can support multiple channels - every PE connected to every other PE - communication takes place in 1 step Star-connected network - central PE (bottleneck) - similar to bus-based - communication takes place in 2 steps Examples are shown on the following slide

Completely-connected and Star-connected networks Completely connected (non-blocking) Star connected

Static interconnection networks (continued) Linear Array And Ring Mesh Network (2-D, 3-D) - linear array spread over either 2 dimensions or 3 dimensions - each internal PE connected to 4 (2-D) other PEs or 6 (3-D) other PEs - if periphery PEs are connected we have a wrap-around mesh or torus

Tree networks Only 1 path between any pair of PEs Linear arrays and star- connected networks are special cases of tree nets Static - each made corresponds to a PE Dynamic - only leaf nodes are PEs - intermediate nodes are switching elements Examples of tree networks are shown on the following slides

Static and dynamic tree

Tree networks (continued) Fat tree - increasing the number of communication links for PEs closer to the root. - in this way bottlenecks at higher levels in the tree are alleviated.

Hypercube networks Multidimensional mesh with 2 PEs in each of the dimensions A d-dimensional hypercube has p=2d PEs Can be constructed recursively - A (d+1)–dimensional hypercube is constructed by connecting the corresponding PEs of 2 separate d- dimensional hypercubes. - The labels of the PEs of one hypercube are prefixed with 0 and of the labels of the second hypercube with 1 - This is shown on the following slide

Hypercube recursive construction

HC1

Three distinct partitions of a 3D hypercube into two 2D subcubes. HC2 Three distinct partitions of a 3D hypercube into two 2D subcubes. Links connecting processors within a partition are indicated by bold lines.

HC3

Hypercube properties 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. In a d–dimensional HyC, each PE is directly connected to d other PEs A d-dimensional HyC can be partitioned into 2 (d-1)- dimensional subcubes (see figure HC2). Since PEs labels have d bits, d such partitions exist Fixing any k-bits in a d-dimensional HyC with d-bits => PEs that differ in the remaining (d-k) bit positions form a (d-k)-dimensional subcube composed of 2d-k 2PEs (2k such subcubes) see (fig HC3) Example: K=2, d=4 - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing any 2 bits

Hypercube properties (continued) The total of bit positions at which 2 labels differ is the HAMMING distance between the 2 PEs - s is the source, t is the destination - s t is the Hamming distance The number of communication links in the shortest path between 2 PEs is the Hamming distance between their labels. The shortest path between any 2 PEs in a HyC cannot have more than d links (since s t cannot contain more than d bits)

K-ary D-cube networks A d–dimensional HyC is a binary d-cube or 2-ary d-cube (2 PEs along each link) In general for K-ary d- cubes d is the dimension of the network k is the radix, i.e., the number of PEs in each dimension The number of PEs is p = Kd - e.g. 2–d mesh with p PEs : p=k2 or k=p1/2

Evaluating static interconnection networks (in terms of cost and performance) Diameter - the max distance between any 2 PEs in the network - the distance between 2 PEs is the shortest path between them - the distance determines communication time. Diameter of networks - completely connected network: 1 - star-connected network: 2 - ring: - 2-d mesh: - wraparound 2-d mesh: - hypercube connected net: log2p - complete binary tree: 2(1ag((p+1)/2)) . i.e. p is the # of processors which is equal to the # of nodes (15PE) . h = log((p+1)/2) and diameter is 2h (see figure on following slide)

Tree network diameter

Evaluating static interconnection networks (continued) Connectivity - is a measure of the multiplicity of paths between 2 PEs High connectivity – lower contention for communication resources Arc connectivity - is a measure of connectivity - the minimum number of arcs that must be removed to break the network into 2 disconnected networks Arc connectivity of networks - linear arrays, star, tree: 1 - rings, 2D mesh without wraparound: 2: - 2D mesh with wraparound: 4 - d-dimensional hypercube: d

Bisection Width The minimum number of communication links that have to be removed to partition the network into two equal halves. Bisection width of networks - Ring: 2 - 2D mesh w/0 wraparounds: p - 2D mesh w/ wraparounds: 2 p - Tree , star: 1 (BWstar is the same as the Bwtree which is 1, since the star is special case of tree) - Fully-connected network: p2/4 - Hypercube: p/2 . i.e. d-dimensional HyC (p processors) consists of 2 X (d-1)–dimensional HyCs (2X P/2 processors) . connect corresponding links to P/2 links

HyC bisection width figure P/2 links HyC1 (d-1) p/2 HyC2 (d-1) p/2 HyC d p

Bisection bandwidth Channel width Channel rate - The number of bits that can be simultaneously communicated over a link connecting 2PEs (# of wires) Channel rate - peak rate Channel bandwidth (channel rate X channel width) - the peak rate at which data can be communicated between the ends of a communicated link Bisection Bandwidth (bisection width x channel bandwidth) - minimum volume of communication allowed between any 2 halves of a network with an equal number of PEs

Cost Measured in terms of (a) number of communication links and (b) bisection bandwidth Number of communication links (number of wires required) - linear arrays, trees: p-1 (links) - d-dimensional mesh with wraparound: dp - HyC: (p/2) log2p Bisection bandwidth - measures cost by providing a lower bound on the area (2D) or the volume (3D) of the packaging - if the bisection width of a networks is w . the lower bond an the area in 2D is (w2) . lower bond an volume in 3D is (w3/2)

Embedding other networks into a hypercube Given 2 graphs : G(V,E) and G(V’,E’), embedding graph G into graph G’ maps each vertex in the set V onto a vertex (or a set of vertices) in set V’ and each edge in set E onto an edge (or a set of edges) in E’ - nodes correspond to PEs and edges corresponds to communication links. Why do we need Embedding? - Answer: It may be necessary to adapt one network to another (when an application is written for a specific network. Which is not available at present) Parameters - congestion (# of edges in E mapped to one edge in E’) - dilation (reverse of congestion) - expansion (ratio of # vertices in V’ corresponding to one vertex in V) - contraction (reverse of expansion)

Embedding a linear array into a HyC Linear array of 2d processors (labeled 0…2d-1) can be embedded into a d-dimensional HyC by mapping processors i of the linear array to the processor with label G(i, d) of a HyC - G(0,1)= 0 - G(1, 1)= 1 - G(i, x+1) = - The function G is the Binary Reflected Gray Code (RGC)

Embedding a Mesh into a Hypercube Processors - in the same column have identical LSB - in the same row have identical MSB Each row in the mesh is mapped to a unique subcube Each column in the mesh is mapped to a unique subcube

Embedding mesh networks into hypercubes A 4X4 mesh illustrating the mapping of mesh processors to processors in a 4D hypercube A 2X4 processor mesh embedded into a 3D hypercube

Reflected Gray code figure (b) (a)A 3 bit reflected Gray code ring and (b) its embedding into a 3D hypercube

Embedding a binary tree into a hypercube A Tree of depth d embedded in a 2D PE HyC - The root of the tree is mapped auto a HyC processor - For each node m at depth j in the tree the left child of m is mapped to the HyC PE where m itself is mapped, and the right child of m is mapped to the HyC processor whose label we obtain by inverting bit j of i. . expansion = 1 (but max 4 nodes in tree may correspond to 1 node in HyC) . congestion = . dilation = Note: For array and mesh . expansion =

Tree embedded into HyC figure A tree rooted at processor 011b (3d) and embedded into a 3D hypercube

Routing mechanisms for static networks - determines the path a message takes through the network to get from the source to the destination processor. - considers . source . destination . into about state of net - returns one or more path(s)

Routing mechanisms for static networks (continued) Classification (criteria) - congestion . minimal (shortest path between source and destination) - does not take congestion into consideration . non-minimal (avoids network congestion) - path may not be shortest - use of state of network information . deterministic routing( does not use network information) - determines a unique path - uses only source and destination information . Adaptive (uses network state information to avoid congestion)

Routing mechanisms for static networks (continued) Dimension ordered routing (deterministic minimal) - XY– routing - E–Cube routing X-Y routing - message sent along X dimension until it reaches the column of the destination PE, and then the message travels along Y dimension until it reaches its destination. The path length is |Sx– Dx| + |Sy – Dy|

Routing mechanisms for static networks (continued) E-cube routing -The minimum distance between 2 PE; Ps and Pd is given by the number of ones in the Ps Pd - Ps computes Ps Pd and sends a message along dimension K, where K is the position of the non-zero LSB in Ps Pd. - At each intermediate step, Pi (the receiving PE) computes Pi Pd and forwards the message along the dimension corresponding to the non-zero LSB. - Process continues until destination is reached.

E-Cube routing on a 3D hypercube 100 101 110 Pd = 111 000 001 Ps = 010 011 E-Cube routing on a 3D hypercube

E-Cube routing on hypercube network Ps = 010 and Pd = 111 Step 1: Ps Pd is 010 111 = 101. - Ps forwards the message along the dimension corresponding to the LSB. 010  011 Pi = 011 and Pd = 111 Step 2: Pi Pd is 011 111 = 100. - Pi forwards message along the dimension corresponds to the non-zero LSB. 011  111.

Communication costs in static interconnection networks Communication latency: Time taken to communicate a message between two processors in the network. Parameters: - Startup time – prepare message (add header, trailer, error correction information), execute routing algorithm, establish interface between PE and router. - Per-hop time (node latency) – time taken by the header to travel between two directly connected PEs. - Per word transfer time – if channel bandwidth is r words/sec, each transfer takes tw = 1/r seconds to traverse the link. Influenced by: - network topology - switching techniques.

Store-and-forward routing Each intermediate processor on the path forwards the message to the next processor after it has received and stored the entire message. m, message size. l, number of links to traverse. th, cost of header traversing the link. tw, per-word message transfer time. mtw, cost of the rest of the message traversing the link. Then communication cost is, tcomm = ts + (th + mtw)l Since th is small compared to mtw, we have tcomm = ts + mtwl. ts and tw are constants, therefore tcomm = (ml).

Store-and-forward routing figure m P0 P1 P2 P3 time Store-and-forward routing

Cut-through routing Store-and-forward routing disadvantages: - communication time is high. - poor utilization of communication resources. Cut-through routing: - message is forwarded from the incoming link to the outgoing link as it arrives. - message travels in small units called flow control digits or flits (pipelined through the network). - an intermediate processor does not wait for the entire message to arrive before forwarding. As soon as the flits arrive they are passed on to the next processor. - all flits are routed along the same path. Notes: - no need for buffer space to store entire message at intermediate PEs. - uses less memory storage and memory bandwidth. - it is faster.

Cut-through routing (continued) Notes: - no need for buffer space to store entire message at intermediate PEs. - uses less memory storage and memory bandwidth. - it is faster.(incomplete) From the figures on the next slide tcomm = (ts + thl+ mtw) < (ts + l(th + mtw)) = tSF if ts = constant, then tcomm = (m+l) if l = 1, then tcomm(CT) = tcomm(SF) : (m)

Cut-through routing (continued) P3 P2 P1 P0 time Store and forward routing tsaved P3 P2 P1 P0 time Cut-through routing

Cost-performance trade offs Performance analysis of a mesh and HyC with identical costs (based on various cost metrics) Cost of network is proportional to - number of wires - bisection width Assumption: lightly loaded networks and cut through routing

Cost-performance trade offs (continued) If cost of network is proportional to the number of wires - 2D wraparound mesh with (logp/4) wires per channel costs as much as a p-processor HyC with 1 wire per channel. (see page 38, table 2.1) - Average communication latencies (for mesh and HyC of the same cost) . lav, average distance between 2 PE - p/2, in 2D wraparound mesh - log p/2 in HyC . time to send a message of size m between PEs that are lav hops apart is given by ts + th lav + tw m, for cut–through routing

Cost-performance trade offs (continued) In mesh - channel width scaled up by (logp)/4 - per word transfer time reduced by (logp)/4 In HyC - if per-word transfer time is tw, the same time for mesh is: 4tw/log P, that is, tw(mesh) = (4/logp)tw(HyC) Average communication latency - HyC, ts + th(logp/2) + twm - Mesh, ts + th( p/2) + 4mtw/logp

Cost-performance trade offs (continued) Analysis p = constant and m is large, communication due to tw dominates 4mtw/log p (mesh) < twm (HyC), if p > 16 Paint-to-point communication of large messages between random pairs of PE takes less times on a wraparound mesh with cut-through routing, than on a HyC of the same cost Note – if stare-and–forward routing is used, the mesh is not more cost efficient than HyC - the above analysis was performed under light load conditions in the network - if the number of messages is large, contention increase and mesh is affected by contention more than HyC - for heavy load conditions HyC is better than mesh

Cost-performance trade offs (continued) 2. If cost of network is proportional to its bisection width - 2D wraparound mesh (p processors) with p/4 wires per channel has a cost equal to a p processor HyC with 1 wire per channel (see page 38, table 2.1) Average communication latencies (for mesh and HyC of same cost) - Mesh channels wider by p/4 which means that per word transfer time is reduced by the same factor of p/4 - communication latencies for HyC and mesh . HC, ts + th(log p)/2 + twm . Mesh, ts + th p/2 + (4mtw)(1/ p)

Cost-performance trade offs (continued) Analysis - p = constant and m is large, tw dominates for p > 16 - mesh outperforms HyC of the same cost, provided the network is lightly loaded. - for heavily loaded networks Perfmesh PerfHyC