Introduction to Parallel Computing
Models of parallel computers: SEQUENTIAL 1. PE M PE M 2.
Code M PE 3. 4. With Pipelining : f1 f2 f3 f4 2M1i 1M2i 1M1i 3M1i 2M2i 1M3i --------------------------------------------------------------------------- ---------------------------------------------------------------------------
TAXONOMY Control Address Space Interconnect. Net. Granularity Control: SISD – Single Instruction Stream Single Data Stream. - Single program less memory. SIMD – Single Instruction Stream Multiple Data Stream. - Dedicated hardware (expensive). MISD – Multiple Instruction Stream Single Data Stream. - Rigid. MIMD – Multiple Instruction Stream Multiple Data Stream. - program & data all PEs. - Own the self hardware (inexpensive). - flexible
SIMD MIMD global Control Local Control
Address – Space: - Two paradigms: Message passing - Pain PE + Memory. Shared Memory - Each PE accesses any location in memory. Message Passing: INN P1/M P2/M P/M
Shared Memory: P1 P2 P3 Mn Global Memory INN M1 Easier to program need specialized hardware(expensive). To minimize communication: Hiding latency technique.
P M Global Memory INN Local Memory NUMA Local m
Inter connection Networks: Static (direct) – for message passing Dynamic (indirect) – for shared memory Processor Granularity : - Coarse –grain(for alg. that require frequent communication.) - Medium grain( for algorithms in which: time required basic communication time required basic comp. =𝑠𝑚𝑎𝑙𝑙 - Fine grain- do not require frequent communication. PRAM- model- (idealized parallel computer.) P- processor (shave common clock- each works on its own instruction) M- global memory uniformly accessible to all PEs. Synchronous shared- memory computers MIMD. Interaction between PE occurs at no cost.
How is memory accessed by R & W? EREW – weakest, minimum concurrency. CREW – write access serialized, read is concurrent. ERCW – write access is concurrent, multiple read access is serialized. CRCW – short powerful – can be simulated on a EREW model. CR – does not need modification in program. CW – access to memory requires arbitration. Protocols: Common – iff all values are identical. Arbitrary – an arbitrary PE proceeds other PEs fail. Priority – PE with highest priority written. Sum – sum of all quantities is written.
Dynamic Inter Connection Networks : i.e. EREW PRAM: – P- Processor m- words in global memory switching elements (Determine the memory word accessed.) each P can access any of the memory words :-> # member of switching elements Ѳ(mp). (UMA) => expensive => performance Solution: Reduce m try using memory banks (m words organized in b banks) Each P switches between b banks Total # of switching elements Ѳ(bp) Less expensive Note: This is a weak approximate. Of EREW because no 2 PE can access the same memory bank at the same time.
Crossbars Switching Nets Mb-1 M0 M1 M2 P0 P1 Pp-1
Bus-Based Networks M, b, p For m>= p, each PE accesses one memory bank If b = m Crossbar simulates EREW PRAM. -> Total # of switching elements: Ѳ(pb) -> If p ↑↑ and p>b => total # of switching elements shows as Ω(p2) (and more p are unable to access any memory bank) => Not scalable. Bus-Based Networks Global M 𝑃 1 𝑃 2 𝑃 3 𝑃 𝑝−1 𝑃 𝑝 BUS UMA
- Each PE accesses data from M over the same but (UMA) As P↑↑ => each PE spends an increasing a moment of time waiting for memory access. While the bus is used by other PEs. Not scalable Solution- provide local-cache -> reduces the total # of access to global memory - This implies replicating data => Cache coherency problems Global M 𝑃 1 𝑃 2 𝑃 𝑝 Cache BUS
Multistage Interconnection Nets. High # of switching elements performance costs Crossbar (+) (−) Bus (−) (+) Bottle neck Multi stage – provides best trade off. Cost better than CROSS BAR Perf. Better than BUS cost Cross bar Perf. Cross bar Multistage Multistage BUS BUS P P
Multistage Interconnect Net. 6mega Network: P = # PE B = # memory banks log P = # stages P = b
Model of Parallel Computers A link between input i and output j if: j = 2𝑖 0≤𝑖 ≤ 𝑃 2 −1 2𝑖+1−𝑃 𝑃 2 ≤𝑖 ≤𝑃−1 Model of Parallel Computers
- Switch configuration in one Stage. Pass- through Cross - Over Given by the MSB corresponding to that stage.
Static Interconnection Nets. link is When 2 PE access a link = 𝑢𝑠𝑒𝑑 𝑏𝑦 2 communication paths ⇒Blocking networks Static Interconnection Nets. - Used in message - passing Completely - Connected Net Similar to cross-bar but can support multiple channels. Every PE connected to every other PE. Communication ≡1 step
Star-Connected Net. Linear Array and Ring Central PE(bottle neck) Similar to but Communication 2-steps. Linear Array and Ring Linear Array Ring Mesh Network(2d, 3d ) Linear array with (2d) dimensions or (3d) Each PE connected to 4 PEs or 6 PEs. If periphery PEs are connected => Wraparound mesh ≡Trees
Tree Net Only 1 path between any pair of PEs. Linear arrays and star-connected nets. are special cases of tree nets Static – each node Corresponds to a PE Dynamic – only leaf nodes are Pes - Intermediate nodes are Switching elements S D
Fat Tree : increasing the number of communication links for PEs closer to the root => In this way bottlenecks at higher levels in the tree are alleviated. HyperCube Net Multidimensional mesh with 2 PEs in each dimension A d-dimentional hypercube has :𝑃= 2 𝑑 PEs . Can be recursively constructed: 𝟐 𝟎 =1 0-d 𝟐 𝟑 =𝟖 𝟐 𝟐 =𝟒 𝟐 𝟏 =𝟐 1-d 2-d 3-d
- A (d+1)- dimensional hyper cube(HyC) is constructed by connecting the corresponding PEs of 2 d – dimentional HyC.(The labels of the PEs of one HyC are prefixed with (∅) and of the second HyC with (1) )
000 100 110 010 101 011 111 001 000 100 110 010 101 011 111 001 000 100 110 010 101 011 111 001 Figure HC2: Three distinct partition of three dimensional hypercube in two two-dimensional cubes . Link connecting processors within a partition are indicated y bold lines.
Fig: b) HC3
Properties: 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. In a d-dimensional HyC, each PE is directly connected to d other PEs. A d-dimensional HyC can be partitioned in (d-1)- dimensional sub cubes(see fig HC2) Since PEs labels have d bits, d such partitions exists. Fixing any k – bits in a d- dimensional HC with d – bits. => PEs that differ in the remaining (d-k) bit positions form a (d-k)- dimensional subcube composed of 2 𝑑−𝑘 PEs (such subcubes) see (fig HC3) Eg: k = 2, d = 4 => - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing 2 bits - etc
The total number of bit positions at which 2 labels differ = HAMMING distance s = source 𝑠 ⨁ t Hamming distance t = destination ⨁ = EOR (exclusive OR) The number of communication links in the shortes path between 2 PEs is the hamming distance between their labels. The shortest path between any 2 PEs is a HyC cannot have more than d links.(Since 𝑠 ⨁ t cannot contain more than d bits) k- ary d- cubes Nets d- dimensional HyC is a(2 PEs along each link): binary d-cube or 2-ary d-cube k-ary d-cubes => the number of PEs d- dimension of the network k- radix- ver.of PE in each direction 𝑝= 𝑘 𝑑 Eg: 2-d. mesh with p PEs : p = 𝑘 2 => 𝑘= 𝑝
Evaluating static Interconnection Nets (in terms of costs and performance) Diameter: The max distance between any 2PEs in the network distance: between 2PEs is the shortest path between them. determines communication time. Diameter: completely connected net = 1 star-connected net = 2 Diameter: ring = p / 2 2-d mesh = 2( 𝑝 −1) Wraparound 2-d mesh = 2 𝑝 /2 Hyper cube connected net = log p Complete binary tree = 2(log(p+1)/2) i.e. p = # processors = # nodes (15 PE) h = log( (𝑝+1) 2 ) Diameter = 2h d 𝑝+1 2 𝑝−1 2 S
Connectivity: It is a measure of the multiplicity of paths between PEs. high connectivity:- lowers contention for communication resources ave connectivity:- is a measure of connectivity the minimum number of arcs that must be removed to break the net into 2 disconnected nets. 1: linear arrays, star, tree 2: rings, 2D- meshes without wraparound 4: 2D- meshes w/ wraparround d: d- hypercubes Bisection width: the minimum number of communication links that have to be removed to partition the network into 2 equal halves. ring : :2 2-d mesh w/o wraparround: 𝑝 2-d mesh w/o wraparround: 2 𝑝 tree , star : 1 ⟶ (BWstar = BW tree = 1) fuely – connected net. : 𝑝 2 4 hypercube : 𝑝 2
Hypercube : i.e dim. HyC (P processors )consists of (connects corresponding links to 𝑃 2 𝐻𝑦𝐶1 (d-1) 𝑃 2 𝐻𝑦𝐶2 (d-1) 𝑃 2 𝐻𝑦𝐶 d P
Bisection Bandwidth: measures cost by providing a lower bound on the (area in a 2-d/volume in a 3-d) packaging. If the bisection width of a network is: w => the lower bound on the area in 2-d is 𝜃( 𝑤 2 ). lower bound on the volume in 3-d is ( 𝑤 3 2 ) Channel width: the number of bits that can be simultaneously communicated over a link connecting 2PEs. = no of wires Channel rate = peak rate Channel bandwidth: channel rate ×𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑤𝑖𝑑𝑡ℎ : Peak rate at which data can be communicated between the ends of a communication link. Bisection Bandwidth: bisection width × channel bandwidth minimum volume of communication allowed between any two halves of a net. with an equal no. of PEs. Cost: - in terms of a) no. of communication links b) bisection bandwidth
Embedding other networks into A Hypercube # Communication links(# wires required) linear arrays, trees : p-1 (links) d-dimensional mesh wraparound :dp Hyc :p log 𝑝 2 Embedding other networks into A Hypercube D) Given 2 graph: G(V,E) and G’(V’,E’) Embedding graph G into graph G’ maps each vertex in the set V onto a vertex(or a set of vertices) in a set V’ and each edge in set E onto an edge (or a set of edges) in E’. Nodes corresponds to PEs and edges to communication links. Why need EMBEDDING? Answer: It may be necessary to adapt one network to another (when an application is written for a specific network which is not available at present) Parameters: - congestion (# edges in Emapped to one edge in E’) - dilation(reverse of congestion) - expansion(ratio of # of vertices in V’ corresp. to an vertex in V) - contraction(reverse of expansion)
Embedding a linear array into a Hyc Linear array of 2 𝑑 processors(labeled “0” to “ 2 𝑑 -1”) can be embedded into a “d”- dimensional Hyc by mapping processors “i” of linear array to → processor G(i,d) of a Hyc. G(0,1) = 0 G(1,1,) = 1 G(i, x+1) = G(i, x) i< 2 𝑥 2 𝑥 + G( 2 𝑥+1 −1−1,𝑥) i≥ 2 𝑥 This function G is the Binary Reflected Gray Code(RGC) Embedding a Mesh into a Hypercube: processors in: - the same column – have identical LSB - the same row – have identical MSB - each row and each column in the mesh is mapped to a unique subcube.
Embedding A Binary Tree into A Hypercube A tree of depth “d” embedded in a 2 𝑑 PE Hyc The root of the tree is mapped onto a Hyc processor For each node m at depth j in the tree the left child of m is mapped to the Hyc PE to which node 𝑚 is, and the right child of m is mapped to the Hyc processor whose label we obtain by inverting bit j of i expansion = 1 (but max 4 nodes in tree may corresp and to 1 node in congestion = Hyc) dilation = Note: for array and mesh expansion = congestion =
(b) (a) Fig: A tree rooted at processor 011(=3) and embedded into a three-dimensional hypercube: a) the organization of the tree rooted at processor 011, and b) the tree embedded into a three-dimensional hypercube.
Routing Mechanism for Static Networks determines the path a message takes through the network to get from the source to the destination processor. Considers : Source Destination Info about state of net Returns one or more path(s) Classification (criteria): -congestion – minimal : (shortest path between source and destination) - does not take in consideration congestion - non minimal: (avoid network congestion)(path may not be shortest) Use of state of network info – deterministic routing(does not use net. info) Determines a unique path Uses only source and destination info. - adaptive (uses network state info. Avoids congestion)
X-Y routing E-cube routing: Eg. Dimension ordered routing(deterministic minimal) (xy-routing & E-cube routing.) X-Y routing - message sent along dimension until it reaches the column of the destination PE, and then sent along dimension until it reaches its destination path length: | 𝑆 𝑥 − 𝐷 𝑥 |+| 𝑆 𝑦 − 𝐷 𝑦 | E-cube routing: - The minimum distance between 2 PE : 𝑃 𝑠 and 𝑃 𝑑 is given by the number of ones in the 𝑃 𝑠 ⨁ 𝑃 𝑑 - 𝑃 𝑠 computes 𝑃 𝑠 ⨁ 𝑃 𝑑 and sends a message along dimension k, where k is the position of the LSB in 𝑃 𝑠 ⨁ 𝑃 𝑑 . At each intermediate step, 𝑃 𝑖 (the PE which receives the message), computes 𝑃 𝑖 ⨁ 𝑃 𝑑 and forwards the message along the dimension corresponding to the LSB routers. Process continues until destination is reached. X Y
E- cube routing on a Hypercube network 𝑃 𝑠 =010 𝑃 𝑑 =111 Step 1: 𝑃 𝑠 ⨁ 𝑃 𝑑 : 010 111 = 101 𝑃 𝑠 forwards message along the dimension corresp. to the LSB 010……..>> 011 Step 2: 𝑃 𝑖 = 011 𝑃 𝑖 ⨁ 𝑃 𝑑 : 011 111 = 100 𝑃 𝑖 forwards message along the dimension corresp. to the LSB that is not zero. 011 ……>> 111 110 100 101 111 000 011 001 010 Pd Ps
Communication costs in static Interconnection Networks Communication latency: time taken to communicate a message between two processors in the network. Parameters : Startup time: - prepare message(add header, trailer, error correction info.) - execute routing algorithm - establish interface between PE & router Per- hap time: - time taken by header to travel between 2 directly (node latency) connected PEs Per word transfer time: - if channel bandwidth is “r” words/sec, each transfer takes 𝑡 𝑤 = 1 𝑟 to traverse the link Communication latency is influenced by: 1. network topology and 2. switching techniques
Store - and - forward Routing Each intermediate processor on the path forwards the message to the next processor after it has received and stored the entire message. 𝑚 - size of message 𝑙 - no of size to be travelled 𝑡 ℎ - cost for the header to traverse link 𝑚𝑡 𝑤 - cost for the rest of the message to traverse the link 𝑡 𝑐𝑜𝑚𝑚 = (𝑡 ℎ + 𝑚𝑡 𝑤 )𝑙 Since 𝑡 ℎ is small compared to 𝑚𝑡 𝑤 𝑙 𝑡 𝑐𝑜𝑚𝑚 = 𝑡 𝑠 + 𝑚𝑡 𝑤 𝑙 𝑡 𝑠 .constant :θ(𝑚𝑙) time capacity 𝑝 0 𝑝 1 𝑝 2 𝑝 3 time
Cut- through routing store and forward routing: - communication time is high - poor utilization of communication resources Cut through routing: message is advanced from the incoming link to the outgoing link as it arrives - message travels in small units called flow control digits a flits (pipelined through the net) - an intermediate processor does not wait for the entire message to arrive before forwarding it. As soon as flit arrives, it is passed on to the next processor. - all flits are routed along the same path Note: - no need for buffer space to store entire message at intermediate PEs . - uses – less memory (storage) - less memory bandwidth - it is faster
𝑝 0 𝑝 1 𝑝 2 𝑝 3 ← Store – and - forward 𝑝 0 𝑝 1 𝑝 2 𝑝 3 ← cut- through – routing (with message bother in 4 parts) ← 𝑡 𝑠𝑎𝑣𝑒𝑑 ⟶ t 𝑡 𝑐𝑜𝑚𝑚 𝐶𝑅 = 𝑡 𝑠 + 𝑙 𝑡 ℎ +𝑚𝑡 𝑤 < 𝑡 𝑠 + 𝑙 𝑡 ℎ +𝑚𝑡 𝑤 𝑙 = 𝑡 𝑐𝑜𝑚𝑚 𝑆𝐹 If 𝑡 𝑠 = constatant θ(𝑚+𝑙) time complexity if 𝑙=1⇒ 𝑡 𝑐𝑜𝑚𝑚 𝐶𝑅 = 𝑡 𝑐𝑜𝑚𝑚 𝑆𝐹 : θ(𝑚)
Cost-performance tradeoffs Performance analysis of a mesh and Hyc with identical costs. (based on various cost metrics) Cost of network proportional to - # wires - Bisection bandwidth Assumption : lightly loaded nets. And cut-through routing. If cost of Net proportional to # wires: - 2-d wraparound mesh with log 𝑃 4 wires per channel costs as much as P processor Hyc with 1 wire per channel.(see page 38 table 2.1) Average commnic latencies (for mesh and HC of the same cost) 𝑙 𝑎𝑣 − average distance between 2 PE. 𝑝 2 → in 2-d wraparround mesh log 𝑃 4 → in Hyc Time to send a message of size m between PEs that are 𝑙 𝑎𝑣 keps apart: 𝑡 𝑠 + 𝑡 ℎ 𝑙 𝑎𝑣 +𝑡 𝑤 𝑚 ←𝑓𝑜𝑟 𝑐𝑢𝑡 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑟𝑜𝑢𝑡𝑖𝑛𝑔
In mesh: channel width scaled up by: log 𝑃 4 per-word transfer reduced by log 𝑃 4 In Hyc : if per-word transfer time is 𝑡 𝑤 => the same time for mesh is : 4𝑡 𝑤 𝑙𝑜𝑔𝑝 𝑡 𝑤 𝑚𝑒𝑠ℎ = 4 𝑙𝑜𝑔𝑝 𝑡 𝑤 𝐻𝑦𝑐 Average communication latency: Hyc : 𝑡 𝑠 + 𝑡 ℎ log 𝑃 2 + 𝑡 𝑤 𝑚 Mesh: : 𝑡 𝑠 + 𝑡 ℎ 𝑝 2 + 4 m 𝑡 𝑤 /(𝑙𝑜𝑔𝑝) Analysis: p = et m ↑ => 4𝑚𝑡 𝑤 𝑙𝑜𝑔𝑝 < 𝑡 𝑤 m if p > 16 (mesh ) (HC) Communication due to 𝑡 𝑤 dominates
Point- to- point communication of large messages between random pairs of PE takes less times on a map around mesh with cut-through routing, than on a HC of the same cost. Note: (1) - if state - and – forward routing is used, the mesh is not make cost – efficient than HC (2) The above analysis was performed under light load conditions in the network. If the number of messages => contention Mesh is affected by contention more than HC. => for high load contentions HC better than mesh 2) IF cost of Net proportional to its bisection width - 2-d wraparound mesh (p processors) with 𝑝 4 wires per channel has a cost equal to a p processor HC with 1 wire per channel (see page 38, Table 2.1)
Average communication latencies (for mesh & HC of same cost) Mesh channel wider by 𝑝 4 => per word transfer is reduced by the same factor: 𝑝 4 Communication latencies by mesh & HC : HC: 𝑡 𝑠 + 𝑡 ℎ log 𝑝 2 + 𝑡 𝑤 𝑚 Mesh: 𝑡 𝑠 + 𝑡 ℎ 𝑝 2 +4 𝑚 𝑡 𝑤 1 𝑝 Analysis: p = et m ↑ Mesh outperforms HC. Of the same cost provided the network is lightly loaded For heavily loaded nets perf. Mesh ≈ perf. HC 𝑡 𝑤 𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠⇒𝑓𝑜𝑟 𝑝>16