Download presentation
Presentation is loading. Please wait.
Published byShannon Hancock Modified over 8 years ago
1
Parallel Programming Sathish S. Vadhiyar Course Web Page: http://www.serc.iisc.ernet.in/~vss/courses/PPP2009
2
Motivation for Parallel Programming Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10% Grand challenge problems (more later)
3
Challenges / Problems in Parallel Algorithms Building efficient algorithms. Avoiding Communication delay Idling Synchronization
4
Challenges P0 P1 Idle time Computation Communication Synchronization
5
How do we evaluate a parallel program? Execution time, T p Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup) Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1 Scalability – Limitations in parallel computing, relation to n and p.
6
Speedups and efficiency Ideal p S Practical p E
7
Limitations on speedup – Amdahl’s law Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement. Places a limit on the speedup due to parallelism. Speedup = 1 (f s + (f p /P))
8
Amdahl’s law Illustration Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm S = 1 / (s + (1-s)/p)
9
Amdahl’s law analysis fP=1P=4P=8P=16P=32 1.001.04.008.0016.0032.00 0.991.03.887.4813.9124.43 0.981.03.777.0212.3119.75 0.961.03.576.2510.0014.29 For the same fraction, speedup numbers keep moving away from processor size. Thus Amdahl’s law is a bit depressing for parallel programming. In practice, the number of parallel portions of work has to be large enough to match a given number of processors.
10
Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time For a particular number of processors, find the problem size for which parallel time is equal to the constant time For that problem size, find the sequential time and the corresponding speedup Thus speedup is scaled or scaled speedup
11
Metrics (Contd..) NP=1P=4P=8P=16P=32 641.00.800.570.33 1921.00.920.800.60 5121.00.970.910.80 Table 5.1: Efficiency as a function of n and p.
12
Scalability Efficiency decreases with increasing P; increases with increasing N How effectively the parallel algorithm can use an increasing number of processors How the amount of computation performed must scale with P to keep E constant This function of computation in terms of P is called isoefficiency function. An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable
13
Scalability Analysis – Finite Difference algorithm with 1D decomposition Hence isoefficiency function = O(P 2 ) since computation is O(N 2 ) Can be satisfied with N = P, except for small P. For constant efficiency, a function of P, when substituted for N must satisfy the following relation for increasing P and constant E.
14
Scalability Analysis – Finite Difference algorithm with 2D decomposition Hence isoefficiency function = O(P) Can be satisfied with N = sqroot(P) 2D algorithm is more scalable than 1D
15
Parallel Algorithm Design
16
Steps Decomposition – Splitting the problem into tasks or modules Mapping – Assigning tasks to processor Mapping’s contradictory objectives To minimize idle times To reduce communications
17
Mapping Static mapping Mapping based on Data partitioning Applicable to dense matrix computations Block distribution Block-cyclic distribution Graph partitioning based mapping Applicable for sparse matrix computations Mapping based on task partitioning 000111222 010120122
18
Based on Task Partitioning Based on task dependency graph In general the problem is NP complete 0 04 0246 01234567
19
Mapping Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing
20
Interaction Overheads In spite of the best efforts in mapping, there can be interaction overheads Due to frequent communications, exchanging large volume of data, interaction with the farthest processors etc. Some techniques can be used to minimize interactions
21
Parallel Algorithm Design - Containing Interaction Overheads Maximizing data locality Minimizing volume of data exchange Using higher dimensional mapping Not communicating intermediate results Minimizing frequency of interactions Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes
22
Parallel Algorithm Design - Containing Interaction Overheads Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2 Overlapping interactions with interactions Replicating data or computations Balancing the extra computation or storage cost with the gain due to less communication
23
Parallel Algorithm Classification – Types - Models
24
Parallel Algorithm Types Divide and conquer Data partitioning / decomposition Pipelining
25
Divide-and-Conquer Recursive in structure Divide the problem into sub-problems that are similar to the original, smaller in size Conquer the sub-problems by solving them recursively. If small enough, solve them in a straight forward manner Combine the solutions to create a solution to the original problem
26
Divide-and-Conquer Example: Merge Sort Problem: Sort a sequence of n elements Divide the sequence into two subsequences of n/2 elements each Conquer: Sort the two subsequences recursively using merge sort Combine: Merge the two sorted subsequences to produce sorted answer
27
Partitioning 1.Breaking up the given problem into p independent subproblems of almost equal sizes 2.Solving the p subproblems concurrently Mostly splitting the input or output into non-overlapping pieces Example: Matrix multiplication Either the inputs (A or B) or output (C) can be partitioned.
28
Pipelining Occurs with image processing applications where a number of images undergoes a sequence of transformations.
29
Parallel Algorithm Models Data parallel model Processes perform identical tasks on different data Task parallel model Different processes perform different tasks on same or different data – based on task dependency graph Work pool model Any task can be performed by any process. Tasks are added to a work pool dynamically Pipeline model A stream of data passes through a chain of processes – stream parallelism
30
Parallel Program Classification - Models - Structure - Paradigms
31
Parallel Program Models Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
32
Parallel Program Structure Types Master-Worker / parameter sweep / task farming Embarassingly/plea singly parallel Pipleline / systolic / wavefront Tightly-coupled Workflow P0 P1P2P3P4 P0P1P2P3P4
33
Programming Paradigms Shared memory model – Threads, OpenMP Message passing model – MPI Data parallel model – HPF Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
34
Parallel Architectures Classification - Classification - Cache coherence in shared memory platforms - Interconnection networks
35
Classification of Architectures – Flynn’s classification Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
36
Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
37
Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA UMA NUMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
38
Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
39
Cache Coherence - for details, read 2.4.6 of book Interconnection networks - for details, read 2.4.2-2.4.5 of book
40
Cache Coherence in SMPs Main Memory CPU0CPU1CPU2CPU3 cache0 cache1cache2cache3 a aaaa All processes read variable ‘x’ residing in cache line ‘a’ Each process updates ‘x’ at different points of time Challenge: To maintain consistent view of the data Protocols: Write update Write invalidate
41
Caches Coherence Protocols and Implementations Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor get the updated cache line whenever it reads stale data Which is better??
42
Caches –False sharing Main Memory CPU0 CPU1 cache1 A0 – A8 A1, A3, A5… A9 – A15 cache0 A0, A2, A4… Modify the algorithm to change the stride Different processors update different parts of the same cache line Leads to ping-pong of cache lines between processors Situation better in update protocols than invalidate protocols. Why?
43
Caches Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0 Implementations Snoopy for bus based architectures Memory operations are propagated over the bus and snooped Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors Directory-based A central directory maintains states of cache blocks, associated processors Implemented with presence bits
44
Interconnection Networks An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches
45
Interconnection Networks Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D), Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network. For more details, and evaluation of topologies, refer to book
46
Evaluating Interconnection topologies Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube - Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d
47
Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P 2 /4 P/2
48
Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
49
END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.