Presentation is loading. Please wait.

Presentation is loading. Please wait.

Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific.

Similar presentations


Presentation on theme: "Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific."— Presentation transcript:

1

2 Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific software based on the algorithms »Very little done u Why? –Implementation of the algorithms in a portable and self-adaptable form »A non-trivial and very tedious task itself »Poses additional challenges

3 Heterogeneous and Grid Compuitng3 Implementation issues (ctd) u Accuracy of the hardware model –A lot of extra code needed to provide accurate values of parameters of the heterogeneous hardware u Portability –Automatic tuning of the program to any executing platform »Possibly dynamically changing performance characteristics –More complex extra code needed

4 Heterogeneous and Grid Compuitng4 Implementation issues (ctd) u Heterogeneous parallel algorithm –Designed in a generic, parameterized form u Parameters –Problem parameters »Parameters of the problem to be solved u The size of matrix to be factorized »Can be only provided by the user

5 Heterogeneous and Grid Compuitng5 Implementation issues (ctd) u Parameters (ctd) –Algorithmic parameters »Represent different variations and configurations of the algorithm u The size of matrix block in local computations u The total number of processes executing the algorithm u The arrangement of the processes »Do not change the result of computations »Have an impact on the performance »(Optimal) values can be provided by the user, or found by the software implementing the algorithm

6 Heterogeneous and Grid Compuitng6 Implementation issues (ctd) u Parameters (ctd) –Platform parameters »Parameters of the performance model of the executing heterogeneous platform u The speed of the processors u The bandwidth and latency of the links »Have a major impact on the performance of the program

7 Heterogeneous and Grid Compuitng7 Implementation issues (ctd) u A good program implementing a heterogeneous parallel algorithm –Should provide accurate platform parameters –Should provide optimal values of (some) algorithmic parameters

8 Heterogeneous and Grid Compuitng8 Implementation issues (ctd) u Program code –Core code »Implements the algorithm for each valid combination of the values of its parameters –Extra code »Solving the problems of finding accurate platform and optimal algorithmic parameters »Non-trivial and significant in amount

9 Heterogeneous and Grid Compuitng9 Implementation issues (ctd) u How programming systems can help –Core code »Automation = automatic design of heterogeneous parallel algorithms => unrealistic –Extra code »Can and should be provided by such systems u Application specific code (generated by compiler from the specification of the algorithm) u Not application specific code –Run-time system and libraries

10 Heterogeneous and Grid Compuitng10 Implementation issues (ctd) u Programming systems for heterogeneous parallel computing –No miracles »Nothing for dummies –Help qualified algorithm designers implement their algorithms »Automates non-trivial but routine computations and communications

11 Heterogeneous and Grid Compuitng11 Implementation issues (ctd) u Heterogeneous programming systems –Can also help in efficient implementation of traditional homogeneous parallel algorithms »The whole computation is partitioned into equal chunks »Each chunk is performed by a separate process »The number of processes run by each processor is proportional to the relative speed of the processor –The code for accurate estimation of platform parameters, optimization of algorithmic parameters and optimal mapping of processes to the processors can be provided by the programming system –The programmer just specifies the algorithm

12 Heterogeneous and Grid Compuitng12 Estimation of performance models u Accurate estimation of performance parameters –A key to efficient implementation of the algorithm »Wrong estimation => poor performance u Estimation of constant models of processors –p, S={s 1,…,s p } –s i – relative speeds, »Also used absolute speeds but only for convenience u General approach –Running the same benchmark code on each processor »Use the execution time to calculate its relative speed

13 Heterogeneous and Grid Compuitng13 Estimation of constant performance models of heterogeneous processors u No single universal benchmark code –Should be carefully designed for each application u Efficiency –Not an issue if »application to be run multiple times on the same cluster with stable and reproducible performance characteristics »Benchmark code can be separated from the application and run once u Its execution time can be neglected compared to the total time of all subsequent executions of application –Issue otherwise »Each execution is in a unique environment »Benchmark code should a part of the application

14 Heterogeneous and Grid Compuitng14 Estimation of constant performance models of heterogeneous processors (ctd) u Simple case: data parallel applications –One-process-per-processor –Iterative computations »Static data layout »The same task of the same size solved at each iteration u Processed data may be different but of the same pattern »All processors solve a task of the same size at any one iteration of the main loop u Load is balanced by different numbers of iterations –Benchmark: any one iteration of the main loop »Efficient »Representative

15 Heterogeneous and Grid Compuitng15 Estimation of constant performance models of heterogeneous processors (ctd) u Sample application –Parallel matrix multiplication, C=A×B, based on one- dimensional horizontal partitioning of A and C »One-to-one mapping between slices and processors »All processors compute their slices in parallel by executing a loop, each iteration of which computes one row of C

16 Heterogeneous and Grid Compuitng16 Estimation of constant performance models of heterogeneous processors (ctd) u Benchmark code –Multiplication of one n-element row by an n×n matrix –Relative speed during the execution of this benchmark and the application is the same » For each processor, the execution time = the execution time of one iteration × the number of iterations –Can be done even more efficient »Multiplication of one row by a number of adjacent columns u Balance between accuracy (fluctuations) and efficiency

17 Heterogeneous and Grid Compuitng17 Estimation of constant performance models of heterogeneous processors (ctd) u Not that simple case: data parallel applications –One-process-per-processor –Iterative computations »Static data layout »For each processor, the same task of the same size solved at each iteration »At each iteration of the main loop, different processors solve a task of different sizes u Load is balanced by different task sizes –Given the same number of iterations –Benchmark code »Extra problem: the most representative task size

18 Heterogeneous and Grid Compuitng18 Estimation of constant performance models of heterogeneous processors (ctd) u Example –Parallel matrix multiplication, C=A×B, based on the two- dimensional q×t Cartesian partitioning of matrices »One-to-one mapping between rectangles and processors »At each step k of the main loop of the algorithm, u The pivot column of r×r blocks of matrix A is broadcast horizontally u The pivot row of r×r blocks of matrix B is broadcast vertically u Each processor P ij updates its rectangle c ij of matrix C with the product of its parts of the pivot column and the pivot row –At each iteration, processor P ij updates an h i ×w j matrix by the product of h i ×r and r×w j matrices »the same task size for all iterations

19 Heterogeneous and Grid Compuitng19 Estimation of constant performance models of heterogeneous processors (ctd) u Benchmark code –All processors perform the same number of iterations »Load is balanced by using different task sizes »=> Any task size – not fully representative u Does not reproduce in full the real layout »For any heterogeneous platform u There exists a range of task sizes with approximately constant relative speeds u If matrix partitions fall into this range, any task size from this range can be used for the benchmark code –For example,

20 Heterogeneous and Grid Compuitng20 Estimation of constant performance models of heterogeneous processors (ctd) u More difficult case: data parallel applications –One-process-per-processor –Iterative computations »Static data layout »For each processor, tasks of different sizes solved at different iterations »At each iteration of the main loop, different processors solve tasks of different sizes

21 Heterogeneous and Grid Compuitng21 Estimation of constant performance models of heterogeneous processors (ctd) u Example –Heterogeneous parallel LU factorization »At each iteration, the main body of computations of each processor falls into the update »Task sizes are asymptotically decreasing to zero u => Task sizes vary in a very wide range u Unrealistic to assume that the relative speed will remain constant within such a wide range u => No task size will accurately estimate the realtive speed for all processors

22 Heterogeneous and Grid Compuitng22 Estimation of constant performance models of heterogeneous processors (ctd) u Benchmark code –Different iterations have different computation cost –We can focus on most costly iterations »Some number of first iterations u Assume that task sizes for the iteration fall into the range where the relative speed is approx. constant

23 Heterogeneous and Grid Compuitng23 Estimation of constant performance models of heterogeneous processors (ctd) u Summary The benchmark code solving the task of some fixed size, which represents one iteration of the main loop of the application, can be efficient and accurate for many data parallel applications performing iterative computations

24 Heterogeneous and Grid Compuitng24 Estimation of constant performance models of heterogeneous processors (ctd) u Programming systems providing basic support for accurate estimation of relative speeds –mpC, HeteroMPI »recon statement, HMPI_Recon() function u Benchmark code is provided by the programmer u Execution of the statement –The code is executed by all processors in parallel –Execution time is used to obtain their relative speed u Programmer fully controls the accuracy of estimation –What code, where and when to run

25 Heterogeneous and Grid Compuitng25 Estimation of non-constant performance models of heterogeneous processors  Non-constant models –Functional, band u Straightforward estimation of functional model –Assume a single performance variable, x –Interval [a, b] divided into equal subintervals [x i, x i+1 ] »Execute the application for each task size x i »Build a piecewise linear approximation of the speed function f(x) –At each next step »Bisect the subintervals and run the application in the midpoints »Build the next piecewise approximation »Stop if error criterion satisfied. Otherwise, repeat the step.

26 Heterogeneous and Grid Compuitng26 Estimation of non-constant performance models of heterogeneous processors (ctd) u The straightforward estimation procedure –Can be very expensive »A big number of points may be needed for convergence u Minimization of the cost –Open problem –One approach so far (based on the use of speed band)

27 Heterogeneous and Grid Compuitng27 Estimation of non-constant performance models of heterogeneous processors (ctd) u Obtaining a cut

28 Heterogeneous and Grid Compuitng28 Estimation of non-constant performance models of heterogeneous processors (ctd) Two types of speed bands

29 Heterogeneous and Grid Compuitng29 Estimation of non-constant performance models of heterogeneous processors (ctd) Initial approximation

30 Heterogeneous and Grid Compuitng30 Estimation of non-constant performance models of heterogeneous processors (ctd) Approximation of the increasing section

31 Heterogeneous and Grid Compuitng31 Estimation of non-constant performance models of heterogeneous processors (ctd) Approximation of the non-increasing section

32 Heterogeneous and Grid Compuitng32 Estimation of non-constant performance models of heterogeneous processors (ctd) Possible scenarios when the next experimental point falls in the area of the current trapezoidal approximation

33 Heterogeneous and Grid Compuitng33 Optimization of algorithmic parameters u Algorithmic parameters –Have a significant impact on the performance –Two types »Not changing the volume of computation and communication u Example: size of matrix block in local computations u Main optimization approach: –Locally run a benchmark code for its different values –Can be done once (upon installation) or at runtime u Example: ATLAS –Optimizes the parameters for some performace critical operation (matrix multiplication) –Design: Highly parameterized code generator –Effect: 10s times faster than BLAS »Changing the volume of computation or/and communication

34 Heterogeneous and Grid Compuitng34 Optimization of algorithmic parameters (ctd) u Algorithmic parameters with a direct effect on performance –Example: logical shape of processors arrangement »If the shape is an input parameter of the algorithm, then its self- adaptable implementation should include finding its (sub)optimal value –Typical: the number and the ordering of processors »Involving all available processors may not be optimal u Due to high communication cost »=> self-adaptable implementation should find the optimal subset of processors and properly order it –Optimization approaches »Straightforward: running benchmarks »More efficient: based on the use of performance models of implemented algorithms

35 Heterogeneous and Grid Compuitng35 Optimization of algorithmic parameters (ctd) u Model-based optimization of algorithmic parameters –Originally proposed in the mpC programming language u The idea: allow the programmer to describe main performance-related features of the algorithm –The number of processors executing the algorithm –The total volume of computations performed by each of the processors during the execution of the algorithm –The total volume of data communicated between each pair of the processors during the execution of the algorithm u The description –Parameterized by the problem and algorithmic parameters –Defines a performance model of the algorithm

36 Heterogeneous and Grid Compuitng36 Optimization of algorithmic parameters (ctd) u The description –Translated into code used at runtime to estimate the execution time of the algorithm (without real execution) »For each combination of performance and algorithmic parameters –mpC: provides the timeof operator »The only operand is a fully specified algorithm »The result is the estimated execution time »Can be used to implement self-adaptable applications u Not only to different platforms but to different states of the same platform for different runs –HeteroMPI: HMPI_Timeof() has the same functionality

37 Heterogeneous and Grid Compuitng37 Optimization of algorithmic parameters (ctd) u Example. Matrix multiplication, C=A×B, on heterogeneous processors u Algorithm parameters: n, p, s 1,…,s p u Application implementing the algorithm –Goal: minimize the execution time, not the computation time

38 Heterogeneous and Grid Compuitng38 Optimization of algorithmic parameters (ctd) u Self-adaptable application design –It should find the optimal subset of the processors minimizing the execution time »Assume for simplicity, a homogeneous communication layet u The optimal subset will always include the fastest processors u Finding p optimal processors out of q available: –Use the benchmark multiplying one n-element row and a dense n×n matrix to estimate s 1,...,s q –Re-arrange the processors such that s 1 ≥…≥s q –Given t 0 =∞, for i=1 until i≤q: »Estimate t i : the execution time given the optimal partitioning of the matrices over processors P 1,…,P i »If t i <t i-1 then i=i+1 and continue else p=i and stop

39 Heterogeneous and Grid Compuitng39 Optimization of algorithmic parameters (ctd) u Code in mpC. Performance model of the algorithm: algorithm AxB(int p, int n, int d[p]) { coord I=p; node { I>=0: bench*(d[I]); }; link (J=p) { I!=J: length(double)*(d[I]*n) [J]->[I];}; };

40 Heterogeneous and Grid Compuitng40 Optimization of algorithmic parameters (ctd) u mpC. The rest of relevant code: // Run a benchmark code in parallel by all physical // processors to update the estimation of their speeds { repl double *row, *matrix, *result; // memory allocation for row, matrix, and result // initialization of row, matrix, and result... recon RowxMatrix(row, matrix, result, n); }

41 Heterogeneous and Grid Compuitng41 Optimization of algorithmic parameters (ctd) u mpC. The rest of relevant code: // Get the total number of physical processors q = MPC_Get_number_of_processors(); // Get the speed of the physical processors speeds = calloc(q, sizeof(double)); MPC_Get_processors_info(NULL, speeds); // Sort the speeds in descending order qsort(speeds+1, q-1, sizeof(double), compar);

42 Heterogeneous and Grid Compuitng42 Optimization of algorithmic parameters (ctd) u mpC. The rest of relevant code: // Calculate the optimal number of physical processors [host]: { int p, *d; struct {int p; double t;} min; double t; d = calloc(q, sizeof(int)); min.p = 0; min.t = DBL_MAX; for(p=1; p<=q; p++) { // Partition C over p involved physical processors Partition(p, speeds, d, n); // Estimate the execution time of matrix multiplication // on m physical processors t = timeof(algorithm AxB(p, n, d)); if(t<min.t) { min.p = p; min.t = t; } } p = min.p; }

43 Heterogeneous and Grid Compuitng43 Implementation of homogeneous algorithms for heterogeneous platforms u The HeHo approach –Multiple processes per processor –Problem: The optimal configuration of the application »The optimal subset of heterogeneous processors »The optimal distribution of processes over the processors –mpC/HeteroMPI automate »Accurate estimation of platform parameters »Optimization of algorithmic parameters u Including the number parallel processes and their arrangement »Optimal mapping of the parallel processes to the heterogeneous processors


Download ppt "Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific."

Similar presentations


Ads by Google