Introduction to Architecture for Parallel Computing

Introduction to Architecture for Parallel Computing

Models of parallel computers (sequential)

Taxonomy Control Address space Interconnection network Granularity
SISD – Single instruction stream, single data stream Pros: 1. Single program less memory Cons: 1. Dedicated hardware (expensive) 2. Rigid

Taxonomy (continued) SIMD – Single instruction stream, multiple data stream MISD – Multiple instruction stream, single data stream MIMD – Multiple instruction stream, multiple data stream Pros: 1. Over the shelf hardware – inexpensive 2. Flexible Cons: 1. Program and data on all PEs

SIMD and MIMD architectures

Address –Space organization
Two paradigms - Message passing - Shared-memory Message Passing - pairs of PE and memory modules communicate between each other Shared Address Space - each PE has access to any location in memory.

Message passing paradigm

Shared-memory paradigm (UMA)

Shared-memory paradigm (NUMA)

Interconnection Networks
Static (direct) – for message passing Dynamic(indirect) - for shared-memory

Processor Granularity
Coarse–grain - for algorithms that require frequent communication Medium-grain - for algorithms in which the ratio of the time required for basic communication to the time required for basic computation is small Fine–grain - required frequent communication.

PRAM model (idealized parallel computer)
P is the number of processors - share common clock each works on its own instruction M is the global memory - uniformly accessible to all PES Synchronous shared memory MIMD computers Interaction between PEs occurs at no cost

PRAM model (continued)
Depending on how memory is accessed for READ and WRITE operations following four subclasses: - EREW (exclusive read, exclusive write) - weakest, minimum concurrency - ERCW (excusive read, concurrent write) - CREW (concurrent read, exclusive write) - read access is concurrent, write access serialized - CRCW (concurrent read, concurrent write) - most powerful, can be simulated on a EREW model

PRAM model (continued)
CR (concurrent read) – does not need. Modification in program CW (concurrent write) - access to memory requires arbitration Protocols for CR - Common - iff all values are identical - Arbitrary – an arbitrary PE proceeds, other PEs fail - Priority – PE with highest priority wins - Sum – sum of all quantities is written

Dynamic Interconnection Networks
EREW PRAM - p processors - m words in global memory - switching elements (determine the memory word accessed) Each P can access any of the memory words Number of switching elements required (mp) for UMA. Very expensive (-) Very good performance (+)

Dynamic interconnection networks (continued)
Solution - Reduce m by using memory banks (m words organized in b banks) - Each P switches between b banks - Total number of switching elements (bp) - Less expensive Note - This is a weak approximation of EREW because no 2 PE can access the same memory bank at the same time.

Crossbar switching networks

Crossbar switching networks (continued)
From the previous slide - p processors are connected to b memory banks by a crossbar switch. - each PE accesses one memory bank - m words are stored in the memory banks - if m = b, crossbar simulates EREW PRAM Total number of switching elements (pb) If p is very large and p > b - total number of switching elements grows as (p2) - and more P are unable to access any memory bank - not scalable

Bus-based networks

Bus-based networks (continued)
UMA - each PE accesses data from M over the same bus (UMA) - as p becomes very large, each PE spends an increasing amount of time waiting for memory access while the bus is used by other PEs. - not acceptable. Solution (NUMA) - provide local-cache to reduce total number of accesses to global memory - this implies replicating data - cache coherency problems. - the architecture is shown on the following slide

Bus-based networks (NUMA)

Multistage interconnection networks
Multistage networks provides a trade-off between the two required metrics. network Performance Cost Crossbar Excellent Extremely high Bus-based Poor Most economical

Multistage interconnection networks (graphs)
Cost Performance Crossbar Multistage Crossbar Multistage Bus Bus p p

Multistage interconnection networks

Multistage interconnection networks (continued)
Cost is less than the Crossbar networks Performance better than the Bus-based networks Omega network - p is the # of PEs - b is the # memory banks - number of stages is log2p - p= b - a link exists between input i and output j if: j =

Omega network with blocking
An example of blocking in Omega network; one of the messages (010 to 111 or 110 to 100) is blocked at link AB.

Switch configurations in one stage

Static interconnection networks
Used in message passing Completely-connected network - similar to Crossbar but can support multiple channels - every PE connected to every other PE - communication takes place in 1 step Star-connected network - central PE (bottleneck) - similar to bus-based - communication takes place in 2 steps Examples are shown on the following slide

Completely-connected and Star-connected networks
Completely connected (non-blocking) Star connected

Static interconnection networks (continued)
Linear Array And Ring Mesh Network (2-D, 3-D) - linear array spread over either 2 dimensions or 3 dimensions - each internal PE connected to 4 (2-D) other PEs or 6 (3-D) other PEs - if periphery PEs are connected we have a wrap-around mesh or torus

Tree networks Only 1 path between any pair of PEs
Linear arrays and star- connected networks are special cases of tree nets Static - each made corresponds to a PE Dynamic - only leaf nodes are PEs - intermediate nodes are switching elements Examples of tree networks are shown on the following slides

Static and dynamic tree

Tree networks (continued)
Fat tree - increasing the number of communication links for PEs closer to the root. - in this way bottlenecks at higher levels in the tree are alleviated.

Hypercube networks Multidimensional mesh with 2 PEs in each of the dimensions A d-dimensional hypercube has p=2d PEs Can be constructed recursively - A (d+1)–dimensional hypercube is constructed by connecting the corresponding PEs of 2 separate d- dimensional hypercubes. - The labels of the PEs of one hypercube are prefixed with 0 and of the labels of the second hypercube with 1 - This is shown on the following slide

Hypercube recursive construction

Three distinct partitions of a 3D hypercube into two 2D subcubes.
HC2 Three distinct partitions of a 3D hypercube into two 2D subcubes. Links connecting processors within a partition are indicated by bold lines.

Hypercube properties 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. In a d–dimensional HyC, each PE is directly connected to d other PEs A d-dimensional HyC can be partitioned into 2 (d-1)- dimensional subcubes (see figure HC2). Since PEs labels have d bits, d such partitions exist Fixing any k-bits in a d-dimensional HyC with d-bits => PEs that differ in the remaining (d-k) bit positions form a (d-k)-dimensional subcube composed of 2d-k 2PEs (2k such subcubes) see (fig HC3) Example: K=2, d=4 - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing any 2 bits

Hypercube properties (continued)
The total of bit positions at which 2 labels differ is the HAMMING distance between the 2 PEs - s is the source, t is the destination - s t is the Hamming distance The number of communication links in the shortest path between 2 PEs is the Hamming distance between their labels. The shortest path between any 2 PEs in a HyC cannot have more than d links (since s t cannot contain more than d bits)

K-ary D-cube networks A d–dimensional HyC is a binary d-cube or 2-ary d-cube (2 PEs along each link) In general for K-ary d- cubes d is the dimension of the network k is the radix, i.e., the number of PEs in each dimension The number of PEs is p = Kd - e.g. 2–d mesh with p PEs : p=k2 or k=p1/2

Evaluating static interconnection networks (in terms of cost and performance)
Diameter - the max distance between any 2 PEs in the network - the distance between 2 PEs is the shortest path between them - the distance determines communication time. Diameter of networks - completely connected network: 1 - star-connected network: 2 - ring: - 2-d mesh: - wraparound 2-d mesh: - hypercube connected net: log2p - complete binary tree: 2(1ag((p+1)/2)) . i.e. p is the # of processors which is equal to the # of nodes (15PE) . h = log((p+1)/2) and diameter is 2h (see figure on following slide)

Tree network diameter

Evaluating static interconnection networks (continued)
Connectivity - is a measure of the multiplicity of paths between 2 PEs High connectivity – lower contention for communication resources Arc connectivity - is a measure of connectivity - the minimum number of arcs that must be removed to break the network into 2 disconnected networks Arc connectivity of networks - linear arrays, star, tree: 1 - rings, 2D mesh without wraparound: 2: - 2D mesh with wraparound: 4 - d-dimensional hypercube: d

Bisection Width The minimum number of communication links that have to be removed to partition the network into two equal halves. Bisection width of networks - Ring: 2 - 2D mesh w/0 wraparounds: p - 2D mesh w/ wraparounds: 2 p - Tree , star: 1 (BWstar is the same as the Bwtree which is 1, since the star is special case of tree) - Fully-connected network: p2/4 - Hypercube: p/2 . i.e. d-dimensional HyC (p processors) consists of 2 X (d-1)–dimensional HyCs (2X P/2 processors) . connect corresponding links to P/2 links

HyC bisection width figure
P/2 links HyC1 (d-1) p/2 HyC2 (d-1) p/2 HyC d p

Bisection bandwidth Channel width Channel rate
- The number of bits that can be simultaneously communicated over a link connecting 2PEs (# of wires) Channel rate - peak rate Channel bandwidth (channel rate X channel width) - the peak rate at which data can be communicated between the ends of a communicated link Bisection Bandwidth (bisection width x channel bandwidth) - minimum volume of communication allowed between any 2 halves of a network with an equal number of PEs

Cost Measured in terms of (a) number of communication links and (b) bisection bandwidth Number of communication links (number of wires required) - linear arrays, trees: p-1 (links) - d-dimensional mesh with wraparound: dp - HyC: (p/2) log2p Bisection bandwidth - measures cost by providing a lower bound on the area (2D) or the volume (3D) of the packaging - if the bisection width of a networks is w . the lower bond an the area in 2D is (w2) . lower bond an volume in 3D is (w3/2)

Embedding other networks into a hypercube
Given 2 graphs : G(V,E) and G(V’,E’), embedding graph G into graph G’ maps each vertex in the set V onto a vertex (or a set of vertices) in set V’ and each edge in set E onto an edge (or a set of edges) in E’ - nodes correspond to PEs and edges corresponds to communication links. Why do we need Embedding? - Answer: It may be necessary to adapt one network to another (when an application is written for a specific network. Which is not available at present) Parameters - congestion (# of edges in E mapped to one edge in E’) - dilation (reverse of congestion) - expansion (ratio of # vertices in V’ corresponding to one vertex in V) - contraction (reverse of expansion)

Embedding a linear array into a HyC
Linear array of 2d processors (labeled 0…2d-1) can be embedded into a d-dimensional HyC by mapping processors i of the linear array to the processor with label G(i, d) of a HyC - G(0,1)= 0 - G(1, 1)= 1 - G(i, x+1) = - The function G is the Binary Reflected Gray Code (RGC)

Embedding a Mesh into a Hypercube
Processors - in the same column have identical LSB - in the same row have identical MSB Each row in the mesh is mapped to a unique subcube Each column in the mesh is mapped to a unique subcube

Embedding mesh networks into hypercubes
A 4X4 mesh illustrating the mapping of mesh processors to processors in a 4D hypercube A 2X4 processor mesh embedded into a 3D hypercube

Reflected Gray code figure
(b) (a)A 3 bit reflected Gray code ring and (b) its embedding into a 3D hypercube

Embedding a binary tree into a hypercube
A Tree of depth d embedded in a 2D PE HyC - The root of the tree is mapped auto a HyC processor - For each node m at depth j in the tree the left child of m is mapped to the HyC PE where m itself is mapped, and the right child of m is mapped to the HyC processor whose label we obtain by inverting bit j of i. . expansion = 1 (but max 4 nodes in tree may correspond to 1 node in HyC) . congestion = . dilation = Note: For array and mesh . expansion =

Tree embedded into HyC figure
A tree rooted at processor 011b (3d) and embedded into a 3D hypercube

Routing mechanisms for static networks
- determines the path a message takes through the network to get from the source to the destination processor. - considers . source . destination . into about state of net - returns one or more path(s)

Routing mechanisms for static networks (continued)
Classification (criteria) - congestion . minimal (shortest path between source and destination) - does not take congestion into consideration . non-minimal (avoids network congestion) - path may not be shortest - use of state of network information . deterministic routing( does not use network information) - determines a unique path - uses only source and destination information . Adaptive (uses network state information to avoid congestion)

Dimension ordered routing (deterministic minimal) - XY– routing - E–Cube routing X-Y routing - message sent along X dimension until it reaches the column of the destination PE, and then the message travels along Y dimension until it reaches its destination. The path length is |Sx– Dx| + |Sy – Dy|

E-cube routing -The minimum distance between 2 PE; Ps and Pd is given by the number of ones in the Ps Pd - Ps computes Ps Pd and sends a message along dimension K, where K is the position of the non-zero LSB in Ps Pd. - At each intermediate step, Pi (the receiving PE) computes Pi Pd and forwards the message along the dimension corresponding to the non-zero LSB. - Process continues until destination is reached.

E-Cube routing on a 3D hypercube
100 101 110 Pd = 111 000 001 Ps = 010 011 E-Cube routing on a 3D hypercube

E-Cube routing on hypercube network
Ps = 010 and Pd = 111 Step 1: Ps Pd is = 101. - Ps forwards the message along the dimension corresponding to the LSB. 010  011 Pi = 011 and Pd = 111 Step 2: Pi Pd is = 100. - Pi forwards message along the dimension corresponds to the non-zero LSB. 011  111.

Communication costs in static interconnection networks
Communication latency: Time taken to communicate a message between two processors in the network. Parameters: - Startup time – prepare message (add header, trailer, error correction information), execute routing algorithm, establish interface between PE and router. - Per-hop time (node latency) – time taken by the header to travel between two directly connected PEs. - Per word transfer time – if channel bandwidth is r words/sec, each transfer takes tw = 1/r seconds to traverse the link. Influenced by: - network topology - switching techniques.

Store-and-forward routing
Each intermediate processor on the path forwards the message to the next processor after it has received and stored the entire message. m, message size. l, number of links to traverse. th, cost of header traversing the link. tw, per-word message transfer time. mtw, cost of the rest of the message traversing the link. Then communication cost is, tcomm = ts + (th + mtw)l Since th is small compared to mtw, we have tcomm = ts + mtwl. ts and tw are constants, therefore tcomm = (ml).

Store-and-forward routing figure
m P0 P1 P2 P3 time Store-and-forward routing

Cut-through routing Store-and-forward routing disadvantages:
- communication time is high. - poor utilization of communication resources. Cut-through routing: - message is forwarded from the incoming link to the outgoing link as it arrives. - message travels in small units called flow control digits or flits (pipelined through the network). - an intermediate processor does not wait for the entire message to arrive before forwarding. As soon as the flits arrive they are passed on to the next processor. - all flits are routed along the same path. Notes: - no need for buffer space to store entire message at intermediate PEs. - uses less memory storage and memory bandwidth. - it is faster.

Cut-through routing (continued)
Notes: - no need for buffer space to store entire message at intermediate PEs. - uses less memory storage and memory bandwidth. - it is faster.(incomplete) From the figures on the next slide tcomm = (ts + thl+ mtw) < (ts + l(th + mtw)) = tSF if ts = constant, then tcomm = (m+l) if l = 1, then tcomm(CT) = tcomm(SF) : (m)

Cut-through routing (continued)
P3 P2 P1 P0 time Store and forward routing tsaved P3 P2 P1 P0 time Cut-through routing

Cost-performance trade offs
Performance analysis of a mesh and HyC with identical costs (based on various cost metrics) Cost of network is proportional to - number of wires - bisection width Assumption: lightly loaded networks and cut through routing

Cost-performance trade offs (continued)
If cost of network is proportional to the number of wires - 2D wraparound mesh with (logp/4) wires per channel costs as much as a p-processor HyC with 1 wire per channel. (see page 38, table 2.1) - Average communication latencies (for mesh and HyC of the same cost) . lav, average distance between 2 PE - p/2, in 2D wraparound mesh - log p/2 in HyC . time to send a message of size m between PEs that are lav hops apart is given by ts + th lav + tw m, for cut–through routing

In mesh - channel width scaled up by (logp)/4 - per word transfer time reduced by (logp)/4 In HyC - if per-word transfer time is tw, the same time for mesh is: 4tw/log P, that is, tw(mesh) = (4/logp)tw(HyC) Average communication latency - HyC, ts + th(logp/2) + twm - Mesh, ts + th( p/2) + 4mtw/logp

Analysis p = constant and m is large, communication due to tw dominates 4mtw/log p (mesh) < twm (HyC), if p > 16 Paint-to-point communication of large messages between random pairs of PE takes less times on a wraparound mesh with cut-through routing, than on a HyC of the same cost Note – if stare-and–forward routing is used, the mesh is not more cost efficient than HyC - the above analysis was performed under light load conditions in the network - if the number of messages is large, contention increase and mesh is affected by contention more than HyC - for heavy load conditions HyC is better than mesh

2. If cost of network is proportional to its bisection width - 2D wraparound mesh (p processors) with p/4 wires per channel has a cost equal to a p processor HyC with 1 wire per channel (see page 38, table 2.1) Average communication latencies (for mesh and HyC of same cost) - Mesh channels wider by p/4 which means that per word transfer time is reduced by the same factor of p/4 - communication latencies for HyC and mesh . HC, ts + th(log p)/2 + twm . Mesh, ts + th p/2 + (4mtw)(1/ p)

Analysis - p = constant and m is large, tw dominates for p > 16 - mesh outperforms HyC of the same cost, provided the network is lightly loaded. - for heavily loaded networks Perfmesh PerfHyC

Introduction to Architecture for Parallel Computing

Similar presentations

Presentation on theme: "Introduction to Architecture for Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Architecture for Parallel Computing

Similar presentations

Presentation on theme: "Introduction to Architecture for Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback