Presentation is loading. Please wait.

Presentation is loading. Please wait.

Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the.

Similar presentations


Presentation on theme: "Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the."— Presentation transcript:

1

2 Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the extra challenges of heterogeneous parallel computing »mpC, HeteroMPI –For high performance distributed computing »NetSolve/GridSolve

3 Heterogeneous and Grid Compuitng3 mpC u mpC –An extension of ANSI C for programming parallel computations on networks of heterogeneous computers –Support efficient, portable and modular heterogeneous parallel programming –Addresses the heterogeneity of both processors and communication network

4 Heterogeneous and Grid Compuitng4 mpC (ctd) u A parallel mpC program is a set of parallel processes interacting (that is, synchronizing their work and transferring data) by means of message passing u The mpC programmer cannot determine how many processes make up the program and which computers execute which processes –This is specified by some means external to the mpC language –Source mpC code only determines, which process of the program performs which computations.

5 Heterogeneous and Grid Compuitng5 mpC (ctd) u The programmer describes the algorithm –The number of processes executing the algorithm –The total volume of computation to be performed by each process »A formula including the parameters of the algorithm »The volume is measured in computation units provided by the application programmer u The very code that has been used to measure the speed of the processors

6 Heterogeneous and Grid Compuitng6 mpC (ctd) u The programmer describes the algorithm (ctd) –The total volume of data transferred between each pair of the processes –How the processes perform the computations and communications and interact »In terms of traditional algorithmic patterns (for, while, parallel for, etc) »Expressions in the statements specify not the computations and communications themselves but rather their amount u Parameters of the algorithm and locally declared variables can be used

7 Heterogeneous and Grid Compuitng7 mpC (ctd) u The abstract processes of the algorithm are mapped to the real parallel processes of the program –The mapping of the abstract processes should minimize the execution time of this program

8 Heterogeneous and Grid Compuitng8 mpC (ctd) u Example (see handouts for full code): algorithm HeteroAlgorithm(int n, double v[n]) { coord I=n; node { I>=0: v[I]; }; }; … int [*]main(int [host]argc, char **[host]argv) { … { net HeteroAlgorithm(N, volumes) g; … }

9 Heterogeneous and Grid Compuitng9 mpC (ctd) u The program calculates the mass of a metallic construction welded from N heterogeneous rails –It defines group g consisting of N abstract processes, each calculating the mass of one of the rails –The calculation is performed by numerical 3D integration of the density function Density with a constant integration step »The volume of computation to calculate the mass of each rail is proportional to the volume of this rail –i -th element of array volumes contains the volume of i -th rail »The program specifies that the volume of computation performed by each abstract process of g is proportional to the volume of its rail

10 Heterogeneous and Grid Compuitng10 mpC (ctd)  The library nodal function MPC_Wtime is used to measure the wall time elapsed to execute the calculations u Mapping of abstract processes to real processes –Based on the information about the speed, at which the real processes run on physical processors of the executing network

11 Heterogeneous and Grid Compuitng11 mpC (ctd) u By default, the speed estimation obtained on initialization of the mpC system on the network is used –The estimation is obtained by running a special test program u mpC allows the programmer to change at runtime the default estimation of processor speed by tuning it to the computations, which will be really executed –The recon statement

12 Heterogeneous and Grid Compuitng12 mpC (ctd) u An irregular problem –Characterized by inherent coarse/large- grained structure –This structure determines a natural decomposition of the problem into a small number of subtasks »Of different size »Can be solved in parallel

13 Heterogeneous and Grid Compuitng13 mpC (ctd) u The whole program solving the irregular problem –A set of parallel processes –Each process solves its subtask »As sizes of subtasks are different, processes perform different volumes of computation –The processes are interacting via message passing u Calculation of the mass of a mettalic «hedgehog» is an example of irregular problem

14 Heterogeneous and Grid Compuitng14 mpC (ctd) u A regular problem –The most natural decomposition is a large number of small identical subtasks that can be solved in parallel –As the subtasks are identical, they are of the same size u Multiplication of two n x n dense matrices is an example of a regular problem –Naturally decomposed into n 2 identical subtasks »Computation of one element of the resulting matrix u How to efficiently solve a regular problem on a network of heterogeneous computers?

15 Heterogeneous and Grid Compuitng15 mpC (ctd) u Main idea –Transform the problem into an irregular problem »Whose structure is determined by the structure of the executing network u The whole problem –Decomposed into a set of relatively large subproblems –Each subproblem is made of a number of small identical subtasks stuck together –The size of each subproblem depends on the speed of the processor solving this subproblem

16 Heterogeneous and Grid Compuitng16 mpC (ctd) u The parallel program –A set of parallel processes –Each process solves one subproblem on a separate physical processor »The volume of computation performed by each of these processes should be proportional to its speed –The processes are interacting via message passing

17 Heterogeneous and Grid Compuitng17 mpC (ctd) u Example. Parallel multiplication on a heterogeneous network of matrix A and the transposition of matrix B, where A, B are dense square n x n matrices.

18 Heterogeneous and Grid Compuitng18 mpC (ctd) u One step of parallel multiplication of matrices A and B T. The pivot row of blocks of matrix B (shown slashed) is first broadcast to all processors. Then, each processor in parallel with others computes its part of the corresponding column of blocks of the resulting matrix C.

19 Heterogeneous and Grid Compuitng19 mpC (ctd) u See handouts for the mpC program implementing this algorithm –The program first update the estimation of the speeds of processors with the code »Executed at each step of the main loop –The program first detects the number of physical processors

20 Heterogeneous and Grid Compuitng20 mpC: inter-process communication u Basic subset of mpC is based on the performance model of parallel algorithm ignoring communication operations –It presumes that »contribution of the communications into the total execution time of the algorithm is negligibly small compared to that of the computations –It is acceptable for »Computing on heterogeneous clusters »MP algorithms not frequently sending short messages –Not acceptable for “normal” algorithms running on common heterogeneous networks of computers

21 Heterogeneous and Grid Compuitng21 mpC: inter-process communication (ctd) u Compiler can optimally map parallel algorithms with substantial contribution of communication operations into the execution time only if programmers can specify –Absolute volumes of computation performed by processes –Volumes of data transferred between the processes

22 Heterogeneous and Grid Compuitng22 mpC: inter-process communication (ctd) u Volume of communication –Can be naturally measured in bytes u Volume of computation –What is the natural unit of measurement? »To allow the compiler to accurately estimate the execution time –In mpC, the unit is the very code which has been most recently used to estimate the speed of physical processors »Normally specified as part of the recon statement

23 Heterogeneous and Grid Compuitng23 mpC: N-body problem The system of bodies consists of large groups of bodies, with different groups at a good distance from each other. The bodies move under the influence of Newtonian gravitational attraction

24 Heterogeneous and Grid Compuitng24 mpC: N-body problem (ctd) u Parallel N-body algorithm –There is one-to-one mapping between groups of bodies and parallel processes of the algorithm –Each process »Holds in its memory all data characterising bodies of its group u Masses, positions and velocities of bodies »Responsible for its updating

25 Heterogeneous and Grid Compuitng25 mpC: N-body problem (ctd) u Parallel N-body algorithm (ctd) –The effect of each remote group is approximated by a single equivalent body »To update its group, each process requires the total mass and the center of mass of all remote groups u The total mass of each group of bodies is constant. It is calculated once. Each process receives from each of other processes its calculated total mass, and stores all the masses. u The center of mass of each group is a function of time. At each step of simulation, each process computes its center and sends it to other processes.

26 Heterogeneous and Grid Compuitng26 mpC: N-body problem (ctd) u Parallel N-body algorithm (ctd) –At each step of simulation the updated system of bodies is visualised »To do it, all groups of bodies are gathered to the process responsible for the visualisation, which is the host- process. –In general different groups have different sizes »Different processes perform different volumes of computation »different volumes of data are transferred between different pairs of processes

27 Heterogeneous and Grid Compuitng27 mpC: N-body problem (ctd) u Parallel N-body algorithm (ctd) The POV of each individual process: the system includes all bodies of its group, with each remote group approximated by a single equivalent body.

28 Heterogeneous and Grid Compuitng28 mpC: N-body problem (ctd) u Pseudocode of the N-body algorithm: Initialise groups of bodies on the host-process Visualize the groups of bodies Scatter the groups across processes Compute masses of the groups in parallel Communicate to share the masses among processes while(1) { Compute centers of mass in parallel Communicate the centers among processes Update the state of the groups in parallel Gather the groups to the host-process Visualize the groups of bodies }

29 Heterogeneous and Grid Compuitng29 mpC N-body application u The core is the specification of the performance model of the algorithm: algorithm Nbody(int m, int k, int n[m]) { coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];}; parent [0]; };

30 Heterogeneous and Grid Compuitng30 mpC N-body application (ctd) u The most principle fragments of the rest of code: void [*] main(int [host]argc, char **[host]argv) {... // Make the test group consist of first Tgsize // bodies of the very first group of the system OldTestGroup[] = (*(pTestGroup)Groups[0])[]; recon Update_group(TGsize, &OldTestGroup, &TestGroup, 1,NULL,NULL,0 ); { net Nbody(NofGroups, TGsize, NofBodies) g; … }

31 Heterogeneous and Grid Compuitng31 mpC: algorithmic patterns u One more important feature of parallel algorithm is still not reflected in the performance model –The order of execution of computations and communications u As the model says nothing about how parallel processes interact during execution of the algorithm, the compiler assumes that –First, all processes execute all their computations in parallel –Then the processes execute all the communications in parallel –There is a synchronisation barrier between execution of the computations and communications

32 Heterogeneous and Grid Compuitng32 mpC: algorithmic patterns (ctd) u These assumption are unsatisfactory in case of –Data dependencies between computations performed by different processes »One process may need data computed by other processes in order to start its computations »This serialises some computations performed by different parallel processes ==> The real execution time of the algorithm will be longer –Overlapping of computations and communications »The real execution time of the algorithm will be shorter

33 Heterogeneous and Grid Compuitng33 mpC: algorithmic patterns (ctd) u Thus, if estimation is not based on the actual scenario of interaction of parallel processes –It may be not accurate which leads to non-optimal mapping of the algorithm to the executing network u Example. An algorithm with fully serialised computations. –Optimal mapping: »All the processes are asigned to the fastest physical processor –Mapping based on the above assumptions: »Involves all available physical processors

34 Heterogeneous and Grid Compuitng34 mpC: algorithmic patterns (ctd) u mpC addresses the problem –The programmer can specify the scenario of interaction of parallel processes during execution of the parallel algorithm –That specification is a part of the network type definition »The scheme declaration

35 Heterogeneous and Grid Compuitng35 mpC: algorithmic patterns (ctd) u Example 1. N-body algorithm algorithm Nbody(int m, int k, int n[m]) { coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];}; parent [0]; scheme { int i; par (i=0; i<m; i++) 100%[i]; par (i=1; i [0]; };

36 Heterogeneous and Grid Compuitng36 mpC: algorithmic patterns (ctd) u Example 2. Matrix multiplication. algorithm ParallelAxBT(int p, int n, int r, int d[p]) { coord I=p; node { I>=0: bench*((d[I]*n)/(r*r)); }; link (J=p) { I!=J: length*(d[I]*n*sizeof(double)) [J]->[I]; }; parent [0];

37 Heterogeneous and Grid Compuitng37 mpC: algorithmic patterns (ctd) u Example 2. Matrix multiplication (ctd) scheme { int i, j, PivotProc=0, PivotRow=0; for(i=0; i<n/r; i++, PivotRow+=r) { if(PivotRow>=d[PivotProc]) { PivotProc++; PivotRow=0; } for(j=0; j<p; j++) if(j!=PivotProc) (100.*r/d[PivotProc])%[PivotProc]->[j]; par(j=0; j<p; j++) (100.*r/n)%[j]; } };

38 Heterogeneous and Grid Compuitng38 mpC: the timeof operator u Further modification of the matrix multiplication program: [host]: { int m; struct {int p; double t;} min; double t; min.p = 0; min.t = DBL_MAX; for(m=1; m<=p; m++) { Partition(m, speeds, d, n, r); t = timeof(net ParallelAxBT(m, n, r, d) w); if(t<min.t) { min.p = m; min.t = t; } } p = min.p; }

39 Heterogeneous and Grid Compuitng39 mpC: the timeof operator (ctd)  O perator timeof estimates the execution time of the parallel algorithm without its real execution –The only operand specifies a fully specified network type »The value of all parametrs of the network type must be specified –The operator does not create an mpC network of this type –Instead, it calculates the time of execution of the corresponding parallel algorithm on the executing network »Based on u the provided performance model of the algorithm u the most recent performance characteristics of physical processors and communication links

40 Heterogeneous and Grid Compuitng40 mpC: mapping u Dispatcher maps abstract processes of the mpC network to the processes of the parallel program –At runtime –Trying to minimize of the execution time u The mapping is based on »The model of the executing network of computers »A map of processes of the parallel program u The total number of processes running on each computer u The number of free processes

41 Heterogeneous and Grid Compuitng41 mpC: mapping (ctd) u The mapping is based on (ctd) »The performance model of the parallel algorithm represented by this mpC network u The number of parallel processes executing the algorithm u The absolute volume of computations performed by each of the processes u The absolute volume of data transferred between each pair of processes u The scenario of interaction between the parallel processes during the algorithm execution

42 Heterogeneous and Grid Compuitng42 mpC: mapping (ctd) u Two main features: –Estimation of each particular mapping »Based on u Formulas for –Each computation unit in the scheme declaration –Each communication unit in the scheme declaration u Rules for each sequential and parallel algorithmic pattern –for, if, par, etc.

43 Heterogeneous and Grid Compuitng43 HeteroMPI u HeteroMPI –An extension of MPI –Programmer can describe the performance model of the implemented algorithm »In a small model definition language shared with mpC –Given this description »HeteroMPI tries to create a group of processes executing the algorithm faster than any other group

44 Heterogeneous and Grid Compuitng44 HeteroMPI (ctd) u Standard MPI approach to group creation –Acceptable in homogeneous environments »If there is one process per processor »Any group will execute the algorithm with the same speed –Not acceptable »In heterogeneous environments »If there are more that one process per processor u In HeteroMPI –The programmer can describe the algorithm –The description is translated into a set of functions »Making up an algorithm-specific part of HeteroMPI run-time system

45 Heterogeneous and Grid Compuitng45 HeteroMPI (ctd) u A new operation to create a group of processes: HMPI_Group_create( HMPI_Group* gid, const HMPI_Model* perf_model, const void* model_parameters) u Collective operation –In the simplest case, called by all processes HMPI_COMM_WORLD

46 Heterogeneous and Grid Compuitng46 HeteroMPI (ctd) u Dynamic update of the estimation of the processors speed can be performed by HMPI_Recon( HMPI_Benchmark_function func, const void* input_p, int num_of_parameters, const void* output_p) u Collective operation –Called by all processes of HMPI_COMM_WORLD

47 Heterogeneous and Grid Compuitng47 HeteroMPI (ctd) u Prediction of the execution time of the algorithm HMPI_Timeof( HMPI_Model *perf_model, const void* model_parameters) u Local operation –Can be called by any processes

48 Heterogeneous and Grid Compuitng48 HeteroMPI (ctd) u Another collective operation to create a group of processes: HMPI_Group_auto_create( HMPI_Group* gid, const HMPI_Model* perf_model, const void* model_parameters) u Used if the programmer wants HeteroMPI to find the optimal number of processes

49 Heterogeneous and Grid Compuitng49 HeteroMPI (ctd) u Other HMPI operations HMPI_Init() HMPI_Finalize() HMPI_Group_free() HMPI_Group_rank() HMPI_Group_size() MPI_Comm *HMPI_Get_comm(HMPI_Group *gid)  HMPI_Get_comm –Creates an MPI communicator with the group defined by gid

50 Heterogeneous and Grid Compuitng50 Grid Computing vs Distributed Computing u Definitions of Grid computing are various and vague –A new computing model for better use of many separate computers connected by a network –=> Grid computing targets heterogeneous networks u What is the difference between Grid-based heterogeneous platforms and traditional distributed heterogeneous platforms? –A single login to a group of resources is the core –Grid operating environment – services built on top of this »Different models of GOE supported by different Grid middleware (Globus, Unicore)

51 Heterogeneous and Grid Compuitng51 GridRPC u High-performance Grid programming systems are based on GridRPC –RPC – Remote Procedure Call »Task, input data, output data, remote computer –GridRPC »Task, input data, output data »Remote computer is picked by the system

52 Heterogeneous and Grid Compuitng52 NetSolve u NetSolve –Programming system for HPDC on global networks »Based on the GridRPC mechanism –Some components of the application are only available on remote computers u NetSolve application –The user writes a client program »Any program (in C, Fortran, etc) with calls the NetSolve client interface »Each call specifies u Remote task u Location of the input data on the user’s computer u Location of the output data (on the user’s computer)

53 Heterogeneous and Grid Compuitng53 NetSolve (ctd) u Execution of the NetSolve application –A NetSolve call results in »A task to be executed on a remote computer »The NetSolve programming system u Selects the remote computer u Transfers input data to the remote computer u Delivers output data to the user’s computer –The mapping of the remote tasks to computers »The core operation having an impact on the performance of the application

54 Heterogeneous and Grid Compuitng54 NetSolve (ctd) 1. Assign (“task”) netslInfo() Agent Server A Proxy Client netsl (“task”, in, out) Server B netslX() 2. Upload (in) 3. Download (out)

55 Heterogeneous and Grid Compuitng55 NetSolve (ctd) u Mapping algorithm –Each task is scheduled separately and independently on other tasks »A NetSolve application is seen as a sequence of independent tasks –Based on two performance models (PMs) »The PM of heterogeneous network of computers »The PM of a task

56 Heterogeneous and Grid Compuitng56 NetSolve (ctd) u Client interface –User’s command line interface »NS_problems, NS_probdesc – C program interface »Blocking call u int netsl(char *problem_name, … …) »Non-blocking call u request=netslnb(…); u info = netslpr(request); u info = netslwt(request);

57 Heterogeneous and Grid Compuitng57 NetSolve (ctd) u Network of computers –A set of interconnected heterogeneous processors »Each processor is characterized by the execution time of the same serial code u Matrix multiplication of two 200×200 matrices u Obtained once on the installation of NetSolve and does not change »Communication links u The same way as in NWS (latency + bandwidth) u Dynamic (periodically updated)

58 Heterogeneous and Grid Compuitng58 NetSolve (ctd) u The performance model of a task –Provided by the person installing the task on a remote computer –A formula to calculate the execution time of the task by the solver »Uses parameters of the task and the execution time of the standard computation unit (matrix multiplication) –The size of input and output data –The PM = a distributed set of performance models

59 Heterogeneous and Grid Compuitng59 NetSolve (ctd) u The mapping algorithm –Performed by the agent –Minimizes the total execution time, T total »T total = T computation + T communication »T computation u Uses the formulas of the PM of the task »T communication = T input delivery + T output receive u Uses characteristics of the communication link and the size of input and output data

60 Heterogeneous and Grid Compuitng60 NetSolve (ctd) u Link to NetSolve software and documentation –http://icl.cs.utk.edu/netsolve/http://icl.cs.utk.edu/netsolve/


Download ppt "Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the."

Similar presentations


Ads by Google