Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Similar presentations


Presentation on theme: "Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations."— Presentation transcript:

1 Parallel Programming Sathish S. Vadhiyar

2 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity Resource constraints. Large databases. Certain class of algorithms lend themselves Aggregate bandwidth to memory/disk. Increase in data throughput. Clock rate improvement in the past decade – 40% Memory access time improvement in the past decade – 10%

3 3 Parallel Programming and Challenges  Recall the advantages and motivation of parallelism  But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization

4 4 Challenges P0 P1 Idle time Computation Communication Synchronization

5 5 How do we evaluate a parallel program?  Execution time, T p  Speedup, S S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)  Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1  Scalability – Limitations in parallel computing, relation to n and p.

6 6 Speedups and efficiency Ideal p S Practical p E

7 7 Limitations on speedup – Amdahl’s law  Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.  Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.  Places a limit on the speedup due to parallelism.  Speedup = 1 (f s + (f p /P))

8 Amdahl’s law Illustration Courtesy: http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm S = 1 / (s + (1-s)/p)

9 Amdahl’s law analysis fP=1P=4P=8P=16P=32 1.001.04.008.0016.0032.00 0.991.03.887.4813.9124.43 0.981.03.777.0212.3119.75 0.961.03.576.2510.0014.29 For the same fraction, speedup numbers keep moving away from processor size. Thus Amdahl’s law is a bit depressing for parallel programming. In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

10 Gustafson’s Law  Amdahl’s law – keep the parallel work fixed  Gustafson’s law – keep computation time on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time  For a particular number of processors, find the problem size for which parallel time is equal to the constant time  For that problem size, find the sequential time and the corresponding speedup  Thus speedup is scaled or scaled speedup. Also called weak speedup

11 11 Metrics (Contd..) NP=1P=4P=8P=16P=32 641.00.800.570.33 1921.00.920.800.60 5121.00.970.910.80 Table 5.1: Efficiency as a function of n and p.

12 12 Scalability  Efficiency decreases with increasing P; increases with increasing N  How effectively the parallel algorithm can use an increasing number of processors  How the amount of computation performed must scale with P to keep E constant  This function of computation in terms of P is called isoefficiency function.  An algorithm with an isoefficiency function of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

13 13 Parallel Program Models  Single Program Multiple Data (SPMD)  Multiple Program Multiple Data (MPMD) Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

14 14 Programming Paradigms  Shared memory model – Threads, OpenMP, CUDA  Message passing model – MPI

15 PARALLELIZATION

16 16 Parallelizing a Program Given a sequential program/algorithm, how to go about producing a parallel version Four steps in program parallelization 1.Decomposition Identifying parallel tasks with large extent of possible concurrent activity; splitting the problem into tasks 2.Assignment Grouping the tasks into processes with best load balancing 3.Orchestration Reducing synchronization and communication costs 4.Mapping Mapping of processes to processors (if possible)

17 17 Steps in Creating a Parallel Program

18 18 Orchestration  Goals Structuring communication Synchronization  Challenges Organizing data structures – packing Small or large messages? How to organize communication and synchronization ?

19 19 Orchestration  Maximizing data locality Minimizing volume of data exchange  Not communicating intermediate results – e.g. dot product Minimizing frequency of interactions - packing  Minimizing contention and hot spots Do not use the same communication pattern with the other processes in all the processes  Overlapping computations with interactions Split computations into phases: those that depend on communicated data (type 1) and those that do not (type 2) Initiate communication for type 1; During communication, perform type 2  Replicating data or computations Balancing the extra computation or storage cost with the gain due to less communication

20 20 Mapping  Which process runs on which particular processor? Can depend on network topology, communication pattern of processors On processor speeds in case of heterogeneous systems

21 21 Mapping  Static mapping Mapping based on Data partitioning  Applicable to dense matrix computations  Block distribution  Block-cyclic distribution Graph partitioning based mapping  Applicable for sparse matrix computations Mapping based on task partitioning 000111222 010120122

22 22 Based on Task Partitioning  Based on task dependency graph  In general the problem is NP complete 0 04 0246 01234567

23 23 Mapping  Dynamic Mapping A process/global memory can hold a set of tasks Distribute some tasks to all processes Once a process completes its tasks, it asks the coordinator process for more tasks Referred to as self-scheduling, work- stealing

24 24 High-level Goals

25 PARALLEL ARCHITECTURE 25

26 26 Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream  Single Instruction Single Data (SISD): Serial Computers  Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

27 27 Classification of Architectures – Flynn’s classification  Multiple Instruction Single Data (MISD): Not popular  Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

28 28 Classification of Architectures – Based on Memory  Shared memory  2 types – UMA and NUMA UMA NUMA Examples: HP- Exemplar, SGI Origin, Sequent NUMA-Q Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

29 29 Classification 2: Shared Memory vs Message Passing  Shared memory machine: The n processors share physical address space Communication can be done through this shared memory  The alternative is sometimes referred to as a message passing machine or a distributed memory machine P P PP PP P Interconnect Main Memory P P PP PP P Interconnect MMMMMMM

30 30 Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture  Non-Uniform Memory Access  Uniform Memory Access

31 31 Classification of Architectures – Based on Memory  Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

32 32 Parallel Architecture: Interconnection Networks  An interconnection network defined by switches, links and interfaces Switches – provide mapping between input and output ports, buffering, routing etc. Interfaces – connects nodes with network

33 33 Parallel Architecture: Interconnections  Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN  Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided  Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches

34 34 Interconnection Networks  Static Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D), Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network  Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.  For more details, and evaluation of topologies, refer to book by Grama et al.

35 35 Indirect Interconnects Shared bus Multiple bus Crossbar switch Multistage Interconnection Network 2x2 crossbar

36 36 Direct Interconnect Topologies Linea r Ring Star Mesh 2D Torus Hypercube (binary n- cube) n= 2 n= 3

37 37 Evaluating Interconnection topologies  Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube -  Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

38 38 Evaluating Interconnection topologies  bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P 2 /4 P/2

39 39 Evaluating Interconnection topologies  channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes  channel rate – performance of a single physical wire  channel bandwidth – channel rate times channel width  bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

40 40 X: 0 Shared Memory Architecture: Caches X: 0 Read X X: 0 Read X Cache hit: Wrong data!! P1 P2 Write X=1 X: 1

41 41 Cache Coherence Problem  If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache  Objective: processes shouldn’t read `stale’ data  Solutions Hardware: cache coherence mechanisms

42 42 Cache Coherence Protocols  Write update – propagate cache line to other processors on every write to a processor  Write invalidate – each processor gets the updated cache line whenever it reads stale data  Which is better?

43 43 X: 0 Invalidation Based Cache Coherence X: 0 Read X X: 0 Read X Invalidat e P1 P2 Write X=1 X: 1

44 44 Cache Coherence using invalidate protocols  3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0  Implementations Snoopy  for bus based architectures  shared bus interconnect where all cache controllers monitor all bus activity  There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches  Memory operations are propagated over the bus and snooped Directory-based  Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors  A central directory maintains states of cache blocks, associated processors  Implemented with presence bits

45  END


Download ppt "Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations."

Similar presentations


Ads by Google