Presentation is loading. Please wait.

Presentation is loading. Please wait.

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Similar presentations


Presentation on theme: "16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de."— Presentation transcript:

1 16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de Ingeniería y Tecnología de Computadores Universidad de Murciahttp://dis.um.es/~domingohttp://dis.um.es/~domingo... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia), J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)

2 16 December 2005Universidad de Murcia2 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

3 16 December 2005Universidad de Murcia3 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

4 16 December 2005Universidad de Murcia4 A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users

5 16 December 2005Universidad de Murcia5 A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests

6 16 December 2005Universidad de Murcia6 A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Brewer Frigo (1997): FFTW: The Fastest Fourier Transform in the WestFFTW LAWRA (1997): Linear Algebra With Recursive Algorithms LAWRA

7 16 December 2005Universidad de Murcia7 A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra PHiPAC

8 16 December 2005Universidad de Murcia8 A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential ATLAS Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines I-LIB

9 16 December 2005Universidad de Murcia9 A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

10 16 December 2005Universidad de Murcia10 A little history Optimization based on computational kernels : Efficient kernels (BLAS) and algorithms based on these kernelsBLAS Auto-optimization of the basic kernels (ATLAS)

11 16 December 2005Universidad de Murcia11 A little history Systematic development of routines : FLAME project FLAME R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA Dense Linear Algebra For Shared Memory Systems

12 16 December 2005Universidad de Murcia12 A little history Auto-optimization of routines : At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche LFC At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky)Kalinov + Lastovetsky Use a system evaluation tool (NWS)NWS

13 16 December 2005Universidad de Murcia13 A little history Middleware for auto-optimization : LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: FIBER Proposal of general middleware Evolution of I-LIB mpC: mpC For heterogeneous systems

14 16 December 2005Universidad de Murcia14 A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing

15 16 December 2005Universidad de Murcia15 A little history Skeletons and languages : Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)MALLBAP3L

16 16 December 2005Universidad de Murcia16 A little history Heterogeneous and variable-load systems : Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous

17 16 December 2005Universidad de Murcia17 A little history Distributed systems : Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS)Globus There can be servers to attend queries of clients

18 16 December 2005Universidad de Murcia18 A little history P2P computing : Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable- load But special middleware is necessary

19 16 December 2005Universidad de Murcia19 Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

20 16 December 2005Universidad de Murcia20 Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

21 16 December 2005Universidad de Murcia21 Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation,... : overlapping of communication and computation Modelling Linear Algebra Routines

22 16 December 2005Universidad de Murcia22 Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Modelling Linear Algebra Routines

23 16 December 2005Universidad de Murcia23 The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Modelling Linear Algebra Routines

24 16 December 2005Universidad de Murcia24 And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (t c ) and the start-up (t s ) and word-sending time (t w ) Modelling Linear Algebra Routines

25 16 December 2005Universidad de Murcia25 LU factorisation (Golub - Van Loan) : = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) Modelling Linear Algebra Routines A 11 A 22 A 33 A 32 A 31 A 23 A 21 A 13 A 12 L 11 L 22 L 33 L 32 L 31 L 21 U 11 U 22 U 33 U 23 U 13 U 12

26 16 December 2005Universidad de Murcia26 The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is With k 3 and k 2 the cost of operations performed with BLAS 3 or 2 Modelling Linear Algebra Routines

27 16 December 2005Universidad de Murcia27 But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and... Modelling Linear Algebra Routines

28 16 December 2005Universidad de Murcia28 The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Modelling Linear Algebra Routines

29 16 December 2005Universidad de Murcia29 The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine Modelling Linear Algebra Routines

30 16 December 2005Universidad de Murcia30 These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Modelling Linear Algebra Routines

31 16 December 2005Universidad de Murcia31 Parallel block LU factorisation: matrix distribution of computations in the first step processors Modelling Linear Algebra Routines

32 16 December 2005Universidad de Murcia32 Distribution of computations on successive steps: second stepthird step Modelling Linear Algebra Routines

33 16 December 2005Universidad de Murcia33 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Modelling Linear Algebra Routines

34 16 December 2005Universidad de Murcia34 The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  c System Parameters: cost of arithmetic operations: k 2,geqr2 k 2,larft k 3,gemm k 3,trmm communication parameters: t s t w Modelling Linear Algebra Routines

35 16 December 2005Universidad de Murcia35 The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Modelling Linear Algebra Routines

36 16 December 2005Universidad de Murcia36 Modelling Linear Algebra Routines IBM-SP2. 8 processors 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 512102415362048256030723584 problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model

37 16 December 2005Universidad de Murcia37 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

38 16 December 2005Universidad de Murcia38 In the formulas (parallel block LU factorisation) The values of the System Parameters ( k 2,getf2, k 3,trsmm, k 3,gemm, t s, t w ) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) Installation Routines

39 16 December 2005Universidad de Murcia39 Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed

40 16 December 2005Universidad de Murcia40 is estimated by performing matrix-matrix multiplications and updatings of size (n/r  b)  (b  n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Installation Routines

41 16 December 2005Universidad de Murcia41 two multiple triangular systems are solved, one upper triangular of size b  n/c, and another lower triangular of sizen/r  b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Installation Routines

42 16 December 2005Universidad de Murcia42 corresponds to a level 2 LU sequential factorisation of size b  b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation Installation Routines

43 16 December 2005Universidad de Murcia43 andappear in communications of three types, In one of them a block of size b  b is broadcast in a row, and this parameter depends on b and c In another a block of size b  b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b  n/c and n/r  b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c Installation Routines

44 16 December 2005Universidad de Murcia44 In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. Installation Routines

45 16 December 2005Universidad de Murcia45 Some results in different systems (physical and logical platform) Values of k 3_DTRMM (≈ k 3_DGEMM ) on the different platforms (in microseconds) Installation Routines 0.0025 0.00300.0070512,.., 4096macBLASR10K 0.0018 0.00190.0023512,.., 4096macBLASPPC 0.0030 0.00330.0038512,.., 4096ATLASPIII 0.0150 0.0050 0.0025 0.0140 0.0050 0.0025 0.0130 0.0050 0.0032 0.0120 0.0060 0.0040 512,.., 4096 refBLAS macBLAS ATLAS SUN5 0.0280 0.0110 0.0060 0.0220 0.0110 0.0060 0.0200 0.0110 0.0060 0.0200 0.0120 0.0070 512,.., 4096 refBLAS macBLAS ATLAS SUN1 128643216nSystem Block size

46 16 December 2005Universidad de Murcia46 Installation Routines Values of k 2_DGEQR2 (≈ k 2_DLARFT ) on the different platforms (in microseconds) 0.0250512,.., 4096macBLASR10K 0.0100512,.., 4096macBLASPPC 0.0150512,.., 4096ATLASPIII 0.0050 0.0300 0.0500 512,.., 4096 refBLAS macBLAS ATLAS SUN5 0.0200 0.0500 0.0700 512,.., 4096 refBLAS macBLAS ATLAS SUN1 128643216nSystem Block size

47 16 December 2005Universidad de Murcia47 Typically the values of the communication parameters are well estimated with a ping-pong Installation Routines 20 / 0.1512,.., 4096Mac-MPIOrigin 2K 75 / 0.3512,.., 4096Mac-MPIIBM-SP2 60 / 0.7512,.., 4096MPICHcPIII 170 / 7.0512,.., 4096MPICHcSUN1 128643216nSystem Block size

48 16 December 2005Universidad de Murcia48 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

49 16 December 2005Universidad de Murcia49 Modelling the Linear Algebra Routine (LAR) Obtaining information from the System Selection of paramete rs values Execution of LAR DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME Autotuning routines

50 16 December 2005Universidad de Murcia50 DESIGN PROCESS DESIGNDESIGN LAR: Linear Algebra Routine Made by the LAR Designer Example of LAR: Parallel Block LU factorisation LAR

51 16 December 2005Universidad de Murcia51 Modelling the LAR DESIGNDESIGN LAR Modelling the LAR MODEL

52 16 December 2005Universidad de Murcia52 Modelling the LAR DESIGNDESIGN MODEL T exec = f (SP, AP, n) SP: System Parameters AP: Algorithmic Parameters n : Problem size Made by the LAR-Designer Only once per LAR LAR Modelling the LAR MODEL

53 16 December 2005Universidad de Murcia53 Modelling the LAR DESIGNDESIGN SP: k 3, k 2, t s, t w AP: p = r x c, b n : Problem size MODEL LAR: Parallel Block LU factorisation LAR Modelling the LAR MODEL

54 16 December 2005Universidad de Murcia54 Implementation of SP- Estimators DESIGNDESIGN LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators

55 16 December 2005Universidad de Murcia55 Implementation of SP- Estimators DESIGNDESIGN LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication Similar quantity of data

56 16 December 2005Universidad de Murcia56 INSTALLATION PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators DESIGNDESIGN Installation Process Only once per Platform Done by the System Manager

57 16 December 2005Universidad de Murcia57 Estimation of Static-SP INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN

58 16 December 2005Universidad de Murcia58 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN Estimation of Static-SP Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS Installation File SP values are obtained using the information (n and AP values) of this file.

59 16 December 2005Universidad de Murcia59 Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Estimation of the Static-SP t w-static (in  sec) Message size (Kbytes)3225610242048 t w-static 0.7000.6900.6800.675 Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI Estimation of the Static-SP k 3-static (in  sec) Block size163264128 k 3-static 0.00380.00330.00300.0027

60 16 December 2005Universidad de Murcia60 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME

61 16 December 2005Universidad de Murcia61 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Optimum-AP Selection of Optimum AP DESIGNDESIGN RUN-TIMERUN-TIME

62 16 December 2005Universidad de Murcia62 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Optimum-AP Selection of Optimum AP Execution of LAR DESIGNDESIGN RUN-TIMERUN-TIME

63 16 December 2005Universidad de Murcia63 Autotuning routines Experiments LAR: block LU factorization. Platforms:IBM SP2, SGI Origin 2000, NoW Basic Libraries: reference BLAS, machine BLAS, ATLAS

64 16 December 2005Universidad de Murcia64 Autotuning routines LU on IBM SP2 Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

65 16 December 2005Universidad de Murcia65 Autotuning routines LU on Origin 2000 Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

66 16 December 2005Universidad de Murcia66 Autotuning routines LU on NoW Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

67 16 December 2005Universidad de Murcia67 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

68 16 December 2005Universidad de Murcia68 Modifications to libraries’ hierarchy In the optimization of routines individual basic operations appear repeatedly: LU: QR:

69 16 December 2005Universidad de Murcia69 Modifications to libraries’ hierarchy The information generated to instal a routine could be used for another different routine without additional experiments: t s and t w are obtained when the communication library (MPI, PVM, …) is installed K 3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed

70 16 December 2005Universidad de Murcia70 Modifications to libraries’ hierarchy To determine: the type of experiments necessary for the different routines in the library: t s and t w ¿obtained with ping-pong, broadcast, … ? K 3,gemm ¿obtained for small block sizes, … ? the format in which the data will be stored, to facilitate the use of them when installing other routines

71 16 December 2005Universidad de Murcia71 Modifications to libraries’ hierarchy The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored must be decided by the Parallel Linear Algebra Community … and the typical hierarchy of libraries would change

72 16 December 2005Universidad de Murcia72 Modifications to libraries’ hierarchy typical hierarchy of Parallel Linear Algebra libraries ScaLAPACK LAPACK BLAS PBLAS BLACS Communications

73 16 December 2005Universidad de Murcia73 Modifications to libraries’ hierarchy To include installation information in the lowest levels of the hierarchy ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

74 16 December 2005Universidad de Murcia74 Modifications to libraries’ hierarchy When installing libraries in a higher level this information can be used, and new information is generated … ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

75 16 December 2005Universidad de Murcia75 Modifications to libraries’ hierarchy And so in higher levels ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

76 16 December 2005Universidad de Murcia76 Modifications to libraries’ hierarchy And new libraries with autotunig capacity could be developed ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information Inverse Eigenvalue ProblemLeast Square ProblemPDE Solver Self-Optimisation Information

77 16 December 2005Universidad de Murcia77 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy GETRF from LAPACK (level 1) GETRF_manager k 3 _information Model GETRF { } GEMM from BLAS (level 0) GEMM_manager k 3 _information Model GEMM { }

78 16 December 2005Universidad de Murcia78 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

79 16 December 2005Universidad de Murcia79 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

80 16 December 2005Universidad de Murcia80 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

81 16 December 2005Universidad de Murcia81 Modifications to libraries’ hierarchy Architecture of a Self Optimized Linear Algebra Routine manager SP 1 _information SP 1 _manager Installation_SP 1 _values AP 1.......... AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP 1.......... AP z n c SP 1 c,1.... SP 1 c,z SP 1 _information SP 1 _manager Installation_SP 1 _values AP 1.......... AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP 1.......... AP z n c SP 1 c,1.... SP 1 c,z... LAR(n, AP) {... } Model T exec = f (SP,AP, n) SP = f(AP,n) Installation_information n 1... n w AP 1... AP z Current_problem_size ncnc Current_system_information Current_CPUs_availability %CPU 1... %CPU p Current_network_availability % net 1-1...%net 1-p... % net P-1..%net p-p SOLAR_manager Optimum_AP AP 0 SP 1 _information SP 1 _manager Installation_SP 1 _values AP 1.......... AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP 1.......... AP z n c SP 1 c,1.... SP 1 c,z SP t _information SP t _manager Installation_SP 1 _values AP 1.......... AP z n 1 SP t 1,1.... SP t 1,z n w SP t w,1.... SP t w,z Current_SP 1 _values AP 1.......... AP z n c SP t c,1.... SP t c,z

82 16 December 2005Universidad de Murcia82 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

83 16 December 2005Universidad de Murcia83 Polylibraries ● Different basic libraries can be available: ● Reference BLAS, machine specific BLAS, ATLAS, … ● MPICH, machine specific MPI, PVM, …MPIPVM ● Reference LAPACK, machine specific LAPACK, …LAPACK ● ScaLAPACK, PLAPACK, … ScaLAPACKPLAPACK ● To use a number of different basic libraries to develop a polylibrary

84 16 December 2005Universidad de Murcia84 Polylibraries Typical parallel linear algebra libraries hierarchy ScaLAPACK LAPACK BLAS PBLAS BLACS MPI, PVM,...

85 16 December 2005Universidad de Murcia85 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACKPBLAS BLACS MPI, PVM,... ref. BLAS mac. BLAS ATLAS

86 16 December 2005Universidad de Murcia86 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACKPBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM

87 16 December 2005Universidad de Murcia87 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK mac. LAPACK PBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM ESSL ref. LAPACK

88 16 December 2005Universidad de Murcia88 Polylibraries BLACS PBLAS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM mac. LAPACK ESSL ref. LAPACK mac. ScaLAPACK ESSL ref. ScaLAPACK

89 16 December 2005Universidad de Murcia89 Polylibraries ● The advantage of Polylibraries ● A library optimised for the system might not be available ● The characteristics of the system can change ● Which library is the best may vary according to the routines and the systems ● Even for different problem sizes or different data access schemes the preferred library can change ● In parallel system with the file system shared by processors of different types

90 16 December 2005Universidad de Murcia90 Architecture of a Polylibrary Library_1

91 16 December 2005Universidad de Murcia91 Architecture of a Polylibrary Library_1 LIF_1 Installation

92 16 December 2005Universidad de Murcia92 Architecture of a Polylibrary Library_1 LIF_1 Installation X Mflops 80 X Mflops 40n X Mflops 20 804020 m Routine: DGEMM

93 16 December 2005Universidad de Murcia93 Architecture of a Polylibrary Library_1 LIF_1 Installation X Mflops 400 X Mflops 200n X Mflops 100 2001001 Leading dimension Routine: DROT

94 16 December 2005Universidad de Murcia94 Architecture of a Polylibrary Library_2 Library_1 LIF_1 Installation

95 16 December 2005Universidad de Murcia95 Architecture of a Polylibrary Library_2 LIF_2 Library_1 LIF_1 Installation

96 16 December 2005Universidad de Murcia96 Architecture of a Polylibrary Library_2 LIF_2 Library_3 Library_1 LIF_1 Installation

97 16 December 2005Universidad de Murcia97 Architecture of a Polylibrary Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

98 16 December 2005Universidad de Murcia98 Architecture of a Polylibrary PolyLibrary interface routine_1 interface routine_2... Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

99 16 December 2005Universidad de Murcia99 Architecture of a Polylibrary PolyLibrary interface routine_1 interface routine_2... interface routine_1 if n<value call routine_1 from Library_1 else depending on data storage call routine_1 from Library_1 or call routine_1 from Library_2... Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

100 16 December 2005Universidad de Murcia100 Polylibraries ● Combining Polylibraries with other Optimisation Techniques: ● Polyalgorithms ● Algorithmic Parameters ● Block size ● Number of processors ● Logical topology of processors

101 16 December 2005Universidad de Murcia101 Experimental Results Routines of different levels in the hierarchy: ● Lowest level: ● GEMM: matrix-matrix multiplication ● Medium level: ● LU and QR factorisations ● Highest level: ● a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem ● an algorithm to solve the Toeplitz least square problem

102 16 December 2005Universidad de Murcia102 Experimental Results The platforms: ● SGI Origin 2000 ● IBM-SP2 ● Different networks of processors ● SUN Workstations + Ethernet ● PCs + Fast-Ethernet ● PCs + Myrinet

103 16 December 2005Universidad de Murcia103 Experimental Results: GEMM Routine: GEMM ( matrix-matrix multiplication) Platform: five SUN Ultra 1 / one SUN Ultra 5 Libraries: refBLASmacBLAS ATLAS1ATLAS2ATLAS5 Algorithms and Parameters: Strassen  base size By blocks  block size Direct method

104 16 December 2005Universidad de Murcia104 Experimental Results: GEMM MATRIX-MATRIX MULTIPLICATION INTERFACE: if processor is SUN Ultra 5 if problem-size<600 solve using ATLAS5 and Strassen method with base size half of problem size else if problem-size<1000 solve using ATLAS5 and block method with block size 400 else solve using ATLAS5 and Strassen method with base size half of problem size endif else if processor is SUN Ultra 1 if problem-size<600 solve using ATLAS5 and direct method else if problem-size<1000 solve using ATLAS5 and Strassen method with base size half of problem size else solve using ATLAS5 and direct method endif

105 16 December 2005Universidad de Murcia105 Experimental Results: GEMM 20.03 ATL5 blocks 400 12.53 ATL2 Strass 2 4.68 ATL5 Strass 2 1.06 ATL5 direct 0.04 ATL5 direct Time Library Method Parameter Low 31.0213.504.831.060.04Time ATLAS5 Direct 26.57 ATL5 Strass 2 12.58 ATL5 Strass 2 4.68 ATL5 Strass 2 1.11 ATL5 blocks 400 0.04 ATL5 Strass 2 Time Library Method Parameter Mod 160014001000600200 n

106 16 December 2005Universidad de Murcia106 Experimental Results: LU Routine: LU factorisation Platform: 4 PentiumIII + Myrinet Libraries: ATLAS BLAS for Pentium II BLAS for Pentium III

107 16 December 2005Universidad de Murcia107 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Experimental Results: LU

108 16 December 2005Universidad de Murcia108 Experimental Results: LU BLAS-III BLAS-II ATLAS 512 1024 1536 512 1024 1536 512 1024 1536 n 322.13322.13322.30 320.70320.70320.77 320.11320.11320.13 322.13322.13322.30 320.71320.71320.77 320.11320.11320.13 322.27642.21322.36 320.74320.74320.79 320.12320.12320.13 btimeb b mod.low.the.

109 16 December 2005Universidad de Murcia109 Experimental Results: L&P Routine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob Platform: dual Pentium III Libraries combinations: reference LAPACK and the BLAS installed which uses threadsLa_Re+B_In_Th reference LAPACK and a freely available BLAS for Pentium II using threads La_Re+B_II_Th LAPACK and BLAS installed for the use of threadsLa_In_Th+B_In_Th reference LAPACK and the installed BLASLa_Re+B_In reference LAPACK and a freely available BLAS for Pentium IILa_Re+B_II reference LAPACK and a freely available BLAS for PentiumIIILa_Re+B_III LAPACK and BLAS installed in the system and supposedly optimized for the machine La_In+B_In

110 16 December 2005Universidad de Murcia110 The theoretical model of the sequential algorithm cost : System Parameters: k syev  LAPACK k 3, gemm k 3, diaggemm  BLAS-3 k 1,dot k 1,scal k 1,axpy  BLAS-1 Experimental Results: L&P

111 16 December 2005Universidad de Murcia111 Experimental Results: L&P

112 16 December 2005Universidad de Murcia112 Experimental Results: L&P

113 16 December 2005Universidad de Murcia113 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

114 16 December 2005Universidad de Murcia114 Algorithmic schemes ● To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: ● Design libraries to solve problems in different fields. ● Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna)La Laguna ● Develop SKELETONS which could be used in parallel programming languages.SKELETONS ● Skil, Skipper, CARAML, P3L, … SkilSkipperCARAMLP3L

115 16 December 2005Universidad de Murcia115 Dynamic Programming ● There are different Parallel Dynamic Programming Schemes. ● The simple scheme of the “coins problem” is used: ● A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. ● But the granularity of the computation has been varied to study the scheme, not the problem.

116 16 December 2005Universidad de Murcia116 Dynamic Programming ● Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 12........j.....N 1 2 …. i … n

117 16 December 2005Universidad de Murcia117 Dynamic Programming ● Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel endfor 12.......j..... 1 2... i … n P O P 1 P 2...... P S... P K-1 P K

118 16 December 2005Universidad de Murcia118 Dynamic Programming ● Message-passing scheme: In each processor P j for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes P j has assigned endfor endInEachProcessor 12........j..... 1 2... i … n P O P 1 P 2.................... P K-1 P K N

119 16 December 2005Universidad de Murcia119 Dynamic Programming ● Theoretical model: Sequential cost: Computational parallel cost (q i large): Communication cost: ● The only AP is p ● The SPs are t c, t s and t w one step Process P p

120 16 December 2005Universidad de Murcia120 Dynamic Programming ● How to estimate arithmetic SPs: Solving a small problem ● How to estimate communication SPs: ● Using a ping-pong (CP1) ● Solving a small problem varying the number of processors (CP2) ● Solving problems of selected sizes in systems of selected sizes (CP3)

121 16 December 2005Universidad de Murcia121 Dynamic Programming ● Experimental results: ● Systems: ● SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet ● PenET: seven Pentium III + FastEthernet ● Varying: ● The problem size C = 10000, 50000, 100000, 500000 ● Large value of q i ● The granularity of the computation (the cost of a computational step)

122 16 December 2005Universidad de Murcia122 Dynamic Programming ● Experimental results: ● CP1: ● ping-pong (point-to-point communication). ● Does not reflect the characteristics of the system ● CP2: ● Executions with the smallest problem (C =10000) and varying the number of processors ● Reflects the characteristics of the system, but the time also changes with C ● Larger installation time (6 and 9 seconds) ● CP2: ● Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes ● Larger installation time (76 and 35 seconds)

123 16 December 2005Universidad de Murcia123 5161516551666166100 416141614161616650 111111111111111110 CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra 7776777677777576100 617761746177717550 116111611161116110 CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra 500.000100.00050.00010.000 SUNEt PenFE Parameter selection Dynamic Programming

124 16 December 2005Universidad de Murcia124 Dynamic Programming ● Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt:

125 16 December 2005Universidad de Murcia125 Dynamic Programming ● Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE:

126 16 December 2005Universidad de Murcia126 Dynamic Programming ● Three types of users are considered: ● GU (greedy user): ● Uses all the available processors. ● CU (conservative user): ● Uses half of the available processors ● EU (expert user): ● Uses a different number of processors depending on the granularity: ● 1 for low granularity ● Half of the available processors for middle granularity ● All the processors for high granularity

127 16 December 2005Universidad de Murcia127 Dynamic Programming ● Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt:

128 16 December 2005Universidad de Murcia128 Dynamic Programming ● Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE:

129 16 December 2005Universidad de Murcia129 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

130 16 December 2005Universidad de Murcia130 Heterogeneous algorithms ● Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0

131 16 December 2005Universidad de Murcia131 Heterogeneous algorithms ● Parameter selection: ● RI-THE: obtains p and b from the formula (homogeneous distribution) ● RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) ● RI-HET: ● obtains p and b through a reduced number of executions ● and each

132 16 December 2005Universidad de Murcia132 Heterogeneous algorithms ● Quotient with respect to the lowest experimental execution time: 0 0,5 1 1,5 2 500 10001500200025003000 RI-THEO RI-HOMO RI-HETE 0 0,5 1 1,5 2 500 10001500200025003000 0 0,5 1 1,5 2 500 10001500200025003000 Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5

133 16 December 2005Universidad de Murcia133 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

134 16 December 2005Universidad de Murcia134 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

135 16 December 2005Universidad de Murcia135 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME The NWS is called and it reports:  the fraction of available CPU (f CPU )  the current word sending time (t w_current ) for a specific n and AP values (n 0, AP 0 ). Then the fraction of available network is calculated: Parameter selection at running time

136 16 December 2005Universidad de Murcia136 node1node2node3node4node5node6node7node8 Situation A CPU avail.100%100%100%100%100%100%100%100% t w-current 0.7  sec Situation B CPU avail.80%80%80%80%100% 100%100%100% t w-current 0.8  sec0.7  sec Situation C CPU avail.60%60%60%60%100%100%100%100% t w-current 1.8  sec0.7  sec Situation D CPU avail.60%60%60%60%100%100%80%80% t w-current 1.8  sec0.7  sec0.8  sec Situation E CPU avail.60%60%60%60%100%100%50%50% t w-current 1.8  sec0.7  sec4.0  sec Parameter selection at running time

137 16 December 2005Universidad de Murcia137 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

138 16 December 2005Universidad de Murcia138 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

139 16 December 2005Universidad de Murcia139 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME The values of the SP are tuned, according to the current situation: Parameter selection at running time

140 16 December 2005Universidad de Murcia140 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

141 16 December 2005Universidad de Murcia141 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

142 16 December 2005Universidad de Murcia142 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Block size Situation of the Platform Load nABCDE 10243232646464 2048646464128128 30726464128128128 Number of nodes to use p = r  c Situation of the Platform Load nABCDE 10244  24  22  22  2 2  1 20484  24  22  22  22  1 30724  24  22  2 2  22  1 Parameter selection at running time

143 16 December 2005Universidad de Murcia143 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

144 16 December 2005Universidad de Murcia144 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

145 16 December 2005Universidad de Murcia145 Parameter selection at running time n = 1024 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ABCDE Static Model Dynamic Model n = 2048 0% 20% 40% 60% 80% 100% 120% 140% 160% ABCDE Situation of the platform load n = 3072 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ABCDE

146 16 December 2005Universidad de Murcia146 Work distribution ● There are different possibilities in heterogeneous systems: ● Heterogeneous algorithms (Gauss elimination). ● Homogeneous algorithms and assignation of: ● One process to each processor (LU factorization) ● A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations

147 16 December 2005Universidad de Murcia147 Work distribution ● Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithmdistribution 12.......j..... 1 2... i … n P 0 P 1 P 2...... P S... P K-1 P K 12.......j..... 1 2... i … n p 0 p 1 p 2 p 3 p 4 p 5... p s... p r-1 p r P 0 P 0 P 1 P 3 P 3 P 3... P S... P K P K

148 16 December 2005Universidad de Murcia148 Work distribution ● The model: t(n,C,v,q,t c (n,C,v,q,p,b,d),t s (n,C,v,q,p,b,d),t w (n,C,v,q,p,b,d)) ● Problem size: ● n number of types of coins ● C value to give ● v array of values of the coins ● q quantity of coins of each type ● Algorithmic parameters: ● p number of processes ● b block size (here n/p) ● d processes to processors assignment ● System parameters: ● t c cost of basic arithmetic operations ● t s start-up time ● t w word-sending time

149 16 December 2005Universidad de Murcia149 Work distribution ● Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (q i large): Communication cost: ● There is a new AP: d ● SPs are now unidimensional (t c ) or bidimensional (t s,t w ) tables Process P p

150 16 December 2005Universidad de Murcia150 Work distribution ● Assignment tree (P types of processors and p processes): 3P PP3P32P321 21... P processors p processes... Some limit in the height of the tree (the number of processes) is necessary

151 16 December 2005Universidad de Murcia151 Work distribution ● Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 5 10 10 5 1 1 4 6 4 1 1 3 3 1 1 2 1

152 16 December 2005Universidad de Murcia152 Work distribution ● Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change U5 U1U5 U1 2 processors p processes... U1 U5 U1 one process to each processor

153 16 December 2005Universidad de Murcia153 Work distribution ● Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 34 4434324321 21 4 processors p processes... not in the tree the values of SPs change two consecutive processes are assigned to a same node

154 16 December 2005Universidad de Murcia154 Work distribution ● Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: ● Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge:

155 16 December 2005Universidad de Murcia155 Work distribution ● Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: ● Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds s i, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations

156 16 December 2005Universidad de Murcia156 Work distribution ● Theoretical model: Sequential cost: Computational parallel cost (q i large): Communication cost: ● The APs are p and the assignation array d ● The SPs are the unidimensional array t c, and the bidimensional arrays t s and t w one step Maximum values

157 16 December 2005Universidad de Murcia157 Work distribution ● How to estimate arithmetic SPs: Solving a small problem on each type of processors ● How to estimate communication SPs: ● Using a ping-pong between each pair of processors, and processes in the same processor (CP1) ● Does not reflect the characteristics of the system ● Solving a small problem varying the number of processors, and with linear interpolation (CP2) ● Larger installation time

158 16 December 2005Universidad de Murcia158 Work distribution ● Three types of users are considered: ● GU (greedy user): ● Uses all the available processors, with one process per processor. ● CU (conservative user): ● Uses half of the available processors (the fastest), with one process per processor. ● EU (user expert in the problem, the system and heterogeneous computing): ● Uses a different number of processes and processors depending on the granularity: ● 1 process in the fastest processor, for low granularity ● The number of processes is half of the available processors, and in the appropriate processors, for middle granularity ● A number of processes equal to the number of processors, and in the appropriate processors, for large granularity

159 16 December 2005Universidad de Murcia159 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt:

160 16 December 2005Universidad de Murcia160 Work distribution ● Parameters selection, in TORC, with CP2: C gra LTCP2 5000010(1,2) 5000050(1,2)(1,2,4,4) 50000100(1,2)(1,2,4,4) 10000010(1,2) 10000050(1,2)(1,2,4,4) 100000100(1,2)(1,2,4,4) 50000010(1,2) 50000050(1,2)(1,2,3,4) 500000100(1,2)(1,2,3,4)

161 16 December 2005Universidad de Murcia161 Work distribution ● Parameters selection, in TORC ( without the 1.7 Ghz Pentium 4 ), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LTCP2 5000010(1,1,2)(1,1,2,3,3,3,3,3,3) 5000050(1,1,2)(1,1,2,3,3,3,3,3,3,3,3) 50000100(1,1,3,3)(1,1,2,3,3,3,3,3,3,3,3) 10000010(1,1,2) 10000050(1,1,3)(1,1,2,3,3,3,3,3,3,3,3) 100000100(1,1,3)(1,1,2,3,3,3,3,3,3,3,3) 50000010(1,1,2) 50000050(1,1,2)(1,1,2,3) 500000100(1,1,2)

162 16 December 2005Universidad de Murcia162 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC:

163 16 December 2005Universidad de Murcia163 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):

164 16 December 2005Universidad de Murcia164 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

165 16 December 2005Universidad de Murcia165 Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use

166 16 December 2005Universidad de Murcia166 Hybrid programming Advantages of Hybrid Programming ● To improve scalability ● When too many tasks produce load imbalance ● Applications with fine and coarse-grain parallelism ● Redution of the code development time ● When the number of MPI processes is fixed ● In case of a mixture of functional and data parallelism

167 16 December 2005Universidad de Murcia167 Hybrid programming Hybrid Programming in the literature ● Most of the papers are about particular applications ● Some papers present hybrid models ● No theoretical models of the execution time are available

168 16 December 2005Universidad de Murcia168 Hybrid programming Systems ● Networks of Dual Pentiums ● HPC160 (each node four processors) ● IBM SP ● Blue Horizon (144 nodes, each 8 processors) ● Earth Simulator (640x8 vector processors) …

169 16 December 2005Universidad de Murcia169 Hybrid programming

170 16 December 2005Universidad de Murcia170 Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications

171 16 December 2005Universidad de Murcia171 Hybrid programming

172 16 December 2005Universidad de Murcia172 Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISIO N, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0

173 16 December 2005Universidad de Murcia173 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient

174 16 December 2005Universidad de Murcia174 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver

175 16 December 2005Universidad de Murcia175 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system

176 16 December 2005Universidad de Murcia176 Hybrid programming ● Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N2 N1

177 16 December 2005Universidad de Murcia177 Hybrid programming ● In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations

178 16 December 2005Universidad de Murcia178 Hybrid programming ● And more System Parameters: ● The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) ● The cost of arithmetic operations can vary when the number of threads in the node varies ● Consequently, the algorithms must be recoded and new models of the execution time must be obtained

179 16 December 2005Universidad de Murcia179 Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads

180 16 December 2005Universidad de Murcia180 Hybrid programming ● Open problem ● Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. ● Or at least for some type of programs, such as matricial problems in meshes of processors? ● And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained?

181 16 December 2005Universidad de Murcia181 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

182 16 December 2005Universidad de Murcia182 Peer to peer computing ● Distributed systems: ● They are inherently heterogeneous and dynamic ● But there are other problems: ● Higher communication cost ● Special middleware is necessary ● The typical paradigms are master/slave, client/server, where different types of processors (users) are considered.

183 16 December 2005Universidad de Murcia183 Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002

184 16 December 2005Universidad de Murcia184 Peer to peer computing ● Peer to peer: ● All the processors (users) are at the same level (at least initially) ● The community selects, in a democratic and continuous way, the topology of the global network ● Would it be interesting to have a P2P system for computing? ● Is some system of this type available?

185 16 December 2005Universidad de Murcia185 Peer to peer computing ● Would it be interesting to have a P2P system for computing? ● I think it would be interesting to develop a system of this type ● And to leave the community to decide, in a democratic and continuous way, if it is worthwhile ● Is some system of this type available? ● I think there is no pure P2P dedicated to computation

186 16 December 2005Universidad de Murcia186 Peer to peer computing ● … and other people seem to think the same: ● Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” ● Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability”

187 16 December 2005Universidad de Murcia187 Peer to peer computing ● There are a lot of tools for Grid Computing: ● Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Globus ● Netsolve/Gridsolve. Uses a client/server structure. Netsolve/Gridsolve ● PlanetLab (at present PlanetLab 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator.

188 16 December 2005Universidad de Murcia188 Peer to peer computing ● For Computation on P2P the shared resources are: ● Information: books, papers, …, in a typical way. ● Libraries. One peer takes a library from another peer. ● Necessary description of the library and the system to know if the library fulfils our requests. ● Computation. One peer colaborates to solve a problem proposed by another peer. ● This is the central idea of Computation on P2P…

189 16 December 2005Universidad de Murcia189 Peer to peer computing ● Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1

190 16 December 2005Universidad de Murcia190 Peer to peer computing ● There are ● Different global hierarchies ● Different libraries PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1

191 16 December 2005Universidad de Murcia191 Peer to peer computing ● And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1 Inst. Inform.

192 16 December 2005Universidad de Murcia192 Peer to peer computing ● Trust problems appear: ● Does the library solve the problems we require to be solved? ● Is the library optimized for the system it claims to be optimized for? ● Is the installation information correct? ● Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems?

193 16 December 2005Universidad de Murcia193 Peer to peer computing ● Each peer would have the possibility of establishing a policy of use: ● The use of the resources could be payable ● The percentage of CPU dedicated to computations for the community ● The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes ?


Download ppt "16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de."

Similar presentations


Ads by Google