Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9
2 Overview of MPI MPI – a distributed space model Decomposition can be static or dynamic –Static: fixed once and for all –Dynamic: changing in response to the simulation When partitioning, consider minimizing communication
3 Implementing High Performance, Parallel Applications for MPI Choose algorithm with sufficient parallelism Optimize a sequential version of the algorithm Use simplest possible MPI operations –usually blocking, standard mode procedures Profiling and analysis –find what operations take the most time Attack the most time-consuming components
4 Steps to Good Parallel Implementation Understand strengths and weaknesses of sequential solutions Choose a good sequential algorithm Design a strategy for parallelization Develop a rough semi-analytic model of how the parallel algorithm should perform
5 Steps to Good Parallel Implementation (cont’d) Implement MPI for interprocessor communication Carry out measurements for verifying performance Identify bottlenecks, sources or overhead, etc.; minimize their impact Iterate to improve performance if possible
6 Investigate Sorting Sorting is a multi-faceted, irregular, non- grid-based problem Used widely in database servers Issues of sorting (size of elements, form of result, storage etc.) Two approaches will be discussed 1.A fairly restricted domain 2.More general approach
7 Sorting (Simple Approach) Assumptions Elements are positive integers Secondary storage (disk, tape) not used Auxiliary primary storage is available Input data uniformly distributed over range of integers
8 Sorting (Simple Approach) The Approach Most problems for parallel computation has already been solved for the traditional sequential architectures Sequential solutions exist as libraries, system calls or language constructs – will be used as building blocks for a parallel solution
9 Sorting (Simple Approach) The Approach (cont’d) Advantages of this approach leverages the design, debugging and optimization that has been performed for the sequential case. Assume that we have at disposal an optimized, debugged, sequential function ‘ isort ’ that sorts arrays of integers in the memory of a single processor.
10 Sorting (Simple Approach) The Algorithm Partially pre-sort the elements so that all elements in processor p are less than all those in higher-numbered processors –Recall that on Beowulf systems, the high latency of network communication favors transmission of large messages over small ones –Determine range of values to each processor Values between p*(INT_MAX/commsize) and (p+1)*(INT_MAX/commsize)-1, inclusive
11 Sorting (Simple Approach) The Algorithm (cont’d) –Each processor scans its own list and for other elements, labels them to a destination processor –Elements placed in buffer specific to that processor Communicate data between processors
12 Sorting (Simple Approach) The Algorithm (cont’d) MPI provides communication tools –MPI_Alltoallv –Requires each processor to argue how much data is incoming from every partner, and exactly where it should go First distribute lengths with MPI_Alltoall Allocate contiguous space for all outgoing elements (temporarily) Pacreate - Initialize the returned structure
13 Analysis of Integer Sort Quality of a parallel implementation is assessed by measuring speedup s(P) = T(1)/T(P) efficiency (P) = T(1)/(P×T(P)) = s(P)/P overhead (P) = (P×T(P)-T(1))/T(1) = (1- )/ –where T(1) = best available implementation on a single processor Overhead is useful because it is additive
14 Sources of Overhead Communication Redundancy Extra Work Load Imbalance Waiting
15 Sources of Overhead Communication Time spent communicating in parallel code (exclude sequential implementation) Easy to estimate; largest contribution to overhead (in sort example) –examine MPI_Alltoall and MPI_Alltoallv calls for MPICH, calls are implemented as loops over point-to-point calls, hence T comm = 2 × P × t latency + sizeof(local arrays)/bandwidth
16 Sources of Overhead Redundancy Performs same computation on many processors P-1 processors not carrying out useful work Negligible for sort1 Some O(1) operations (calling malloc to obtain temporary space) do not impact performance
17 Sources of Overhead Extra Work Parallel computation that does not take place in a sequential implementation –e.g. for sort1 : computing processor destination for every input element
18 Sources of Overhead Load Imbalance Measures extra time spent by the slowest processor, in excess of the mean over all processors Load balance should satisfy with a Gaussian distribution of ~ N(N/P,N/P×sqrt((P-1)/N) imbal=(n largest -n mean )/n mean >O(1)×sqrt((P-1)/N)
19 Sources of Overhead Waiting Fine-grained imbalance even though the overall load may be balanced –e.g. frequent synchronization between short computations For sort, synchronization occurs during calls to MPI_Alltoall and MPI_Alltoallv –occurs immediately after initial decomposition; overhead negligible
20 Measurement of Integer Sort upshot - (MPICH) –Graphical tool to render a visual representation of parallel program behavior –Logs the time spent in different phases –Goal’s to improve the performance
21 More General Sorting Performance Improvement Faster sequential sort routines –D.E. Knuth, The Art of Computer Programming volume 3: Sorting and Searching, Addison Wesley, 1973 Relax restrictions on input data –sort more general objects May no long use an integer key –use of compar function –choosing “fenceposts”
22 More General Sorting Approach Solution to choosing fencepost: oversample Select nfence objects –larger value of nfence, better load balance –but results in more work –nfence value difficult to determine a priori MPI_Allgather and bsearch to divide data into more bins than processors
23 Look at population of each bin and bins to processors, achieving load balance MPI_Allreduce computes sum over all processors of the count in each bucket Finally, MPI_Alltoallv delivers elements to the correct destinations and a call to qsort completes the local sort in each processor More General Sorting Approach (cont’d)
24 Analysis of General Sort More complicated program, as it invokes more MPI routines Tradeoff between cost of selecting more fencepost and improving load balance by using more samples Author chooses an intermediate case: P=12, N=1M, object size=32 bytes
25 Summary Trust no one A performance model Instrumentation and graphs Graphical tools Superlinear speed up Enough is enough