Presentation is loading. Please wait.

Presentation is loading. Please wait.

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

Similar presentations


Presentation on theme: "24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas."— Presentation transcript:

1 24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo

2 24 June 2015Universidad Politécnica de Valencia2 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

3 24 June 2015Universidad Politécnica de Valencia3 Collaborations and autoreferences Modelling Linear Algebra Routines + J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra Algorithms with Message-passing. 2001 Towards the Design of an Automatically Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J. González + A. Vidal: Empirical Modelling of Parallel Linear Algebra Routines. 2003

4 24 June 2015Universidad Politécnica de Valencia4 Colaborations and autoreferences Installation routines + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J. González: Optimización automática de rutinas paralelas de álgebra lineal. 2000

5 24 June 2015Universidad Politécnica de Valencia5 Colaborations and autoreferences Autotuning routines + J. Cuenca + J. González: Automatic parameterization of parallel linear algebra routines. 2001 + J. Cuenca: Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002

6 24 June 2015Universidad Politécnica de Valencia6 Colaborations and autoreferences Modifications to the libraries hierarchy + J. Cuenca + J. González: Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004

7 24 June 2015Universidad Politécnica de Valencia7 Colaborations and autoreferences Polylibraries + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: Designing Polylibraries to Speed Up Parallel Computations. 2003

8 24 June 2015Universidad Politécnica de Valencia8 Colaborations and autoreferences Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004

9 24 June 2015Universidad Politécnica de Valencia9 Colaborations and autoreferences Heterogeneous systems + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004

10 24 June 2015Universidad Politécnica de Valencia10 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

11 24 June 2015Universidad Politécnica de Valencia11 A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users

12 24 June 2015Universidad Politécnica de Valencia12 A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests

13 24 June 2015Universidad Politécnica de Valencia13 A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Brewer Frigo (1997): FFTW: The Fastest Fourier Transform in the WestFFTW LAWRA (1997): Linear Algebra With Recursive Algorithms LAWRA

14 24 June 2015Universidad Politécnica de Valencia14 A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra PHiPAC

15 24 June 2015Universidad Politécnica de Valencia15 A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential ATLAS Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines I-LIB

16 24 June 2015Universidad Politécnica de Valencia16 A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

17 24 June 2015Universidad Politécnica de Valencia17 A little history Optimization based on computational kernels : Efficient kernels (BLAS) and algorithms based on these kernelsBLAS Auto-optimization of the basic kernels (ATLAS)

18 24 June 2015Universidad Politécnica de Valencia18 A little history Systematic development of routines : FLAME project FLAME R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA Dense Linear Algebra For Shared Memory Systems

19 24 June 2015Universidad Politécnica de Valencia19 A little history Auto-optimization of routines : At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche LFC At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky)Kalinov + Lastovetsky Use a system evaluation tool (NWS)NWS

20 24 June 2015Universidad Politécnica de Valencia20 A little history Middleware for auto-optimization : LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: FIBER Proposal of general middleware Evolution of I-LIB mpC: mpC For heterogeneous systems

21 24 June 2015Universidad Politécnica de Valencia21 A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing

22 24 June 2015Universidad Politécnica de Valencia22 A little history Skeletons and languages : Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)MALLBAP3L

23 24 June 2015Universidad Politécnica de Valencia23 A little history Heterogeneous and variable-load systems : Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous

24 24 June 2015Universidad Politécnica de Valencia24 A little history Distributed systems : Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS)Globus There can be servers to attend queries of clients

25 24 June 2015Universidad Politécnica de Valencia25 A little history P2P computing : Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable- load But special middleware is necessary

26 24 June 2015Universidad Politécnica de Valencia26 Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

27 24 June 2015Universidad Politécnica de Valencia27 Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

28 24 June 2015Universidad Politécnica de Valencia28 Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation,... : overlapping of communication and computation Modelling Linear Algebra Routines

29 24 June 2015Universidad Politécnica de Valencia29 Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Modelling Linear Algebra Routines

30 24 June 2015Universidad Politécnica de Valencia30 The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Modelling Linear Algebra Routines

31 24 June 2015Universidad Politécnica de Valencia31 And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (t c ) and the start-up (t s ) and word-sending time (t w ) Modelling Linear Algebra Routines

32 24 June 2015Universidad Politécnica de Valencia32 LU factorisation (Golub - Van Loan) : = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) Modelling Linear Algebra Routines A 11 A 22 A 33 A 32 A 31 A 23 A 21 A 13 A 12 L 11 L 22 L 33 L 32 L 31 L 21 U 11 U 22 U 33 U 23 U 13 U 12

33 24 June 2015Universidad Politécnica de Valencia33 The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is With k 3 and k 2 the cost of operations performed with BLAS 3 or 2 Modelling Linear Algebra Routines

34 24 June 2015Universidad Politécnica de Valencia34 But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and... Modelling Linear Algebra Routines

35 24 June 2015Universidad Politécnica de Valencia35 The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Modelling Linear Algebra Routines

36 24 June 2015Universidad Politécnica de Valencia36 The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine Modelling Linear Algebra Routines

37 24 June 2015Universidad Politécnica de Valencia37 These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Modelling Linear Algebra Routines

38 24 June 2015Universidad Politécnica de Valencia38 Parallel block LU factorisation: matrix distribution of computations in the first step processors Modelling Linear Algebra Routines

39 24 June 2015Universidad Politécnica de Valencia39 Distribution of computations on successive steps: second stepthird step Modelling Linear Algebra Routines

40 24 June 2015Universidad Politécnica de Valencia40 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Modelling Linear Algebra Routines

41 24 June 2015Universidad Politécnica de Valencia41 The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  c System Parameters: cost of arithmetic operations: k 2,geqr2 k 2,larft k 3,gemm k 3,trmm communication parameters: t s t w Modelling Linear Algebra Routines

42 24 June 2015Universidad Politécnica de Valencia42 The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Modelling Linear Algebra Routines

43 24 June 2015Universidad Politécnica de Valencia43 Modelling Linear Algebra Routines

44 24 June 2015Universidad Politécnica de Valencia44 Modelling Linear Algebra Routines IBM-SP2. 8 processors 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 512102415362048256030723584 problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model

45 24 June 2015Universidad Politécnica de Valencia45 Modelling Linear Algebra Routines Parameter selection for the QR algorithm - Network of Pentium III with Fast Ethernet p=4 p=8 b r c b r c 1024 16 1 4 1 8 2048 16 1 4 1 8 3072 32 1 4 1 8 4096 32 1 4 1 8

46 24 June 2015Universidad Politécnica de Valencia46 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

47 24 June 2015Universidad Politécnica de Valencia47 In the formulas (parallel block LU factorisation) The values of the System Parameters ( k 2,getf2, k 3,trsmm, k 3,gemm, t s, t w ) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) Installation Routines

48 24 June 2015Universidad Politécnica de Valencia48 Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed

49 24 June 2015Universidad Politécnica de Valencia49 is estimated by performing matrix-matrix multiplications and updatings of size (n/r  b)  (b  n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Installation Routines

50 24 June 2015Universidad Politécnica de Valencia50 two multiple triangular systems are solved, one upper triangular of size b  n/c, and another lower triangular of sizen/r  b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Installation Routines

51 24 June 2015Universidad Politécnica de Valencia51 corresponds to a level 2 LU sequential factorisation of size b  b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation Installation Routines

52 24 June 2015Universidad Politécnica de Valencia52 andappear in communications of three types, In one of them a block of size b  b is broadcast in a row, and this parameter depends on b and c In another a block of size b  b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b  n/c and n/r  b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c Installation Routines

53 24 June 2015Universidad Politécnica de Valencia53 In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. Installation Routines

54 24 June 2015Universidad Politécnica de Valencia54 Some results in different systems (physical and logical platform) Values of k 3_DTRMM (≈ k 3_DGEMM ) on the different platforms (in microseconds) Installation Routines Block size Systemn163264128 SUN1refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0120 0.0070 0.0200 0.0110 0.0060 0.0220 0.0110 0.0060 0.0280 0.0110 0.0060 SUN5refBLAS macBLAS ATLAS 512,.., 4096 0.0120 0.0060 0.0040 0.0130 0.0050 0.0032 0.0140 0.0050 0.0025 0.0150 0.0050 0.0025 PIIIATLAS512,.., 40960.00380.00330.0030 PPCmacBLAS512,.., 40960.00230.00190.0018 R10KmacBLAS512,.., 40960.00700.00300.0025

55 24 June 2015Universidad Politécnica de Valencia55 Installation Routines Values of k 2_DGEQR2 (≈ k 2_DLARFT ) on the different platforms (in microseconds) Block size Systemn163264128 SUN1refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0500 0.0700 SUN5refBLAS macBLAS ATLAS 512,.., 4096 0.0050 0.0300 0.0500 PIIIATLAS512,.., 40960.0150 PPCmacBLAS512,.., 40960.0100 R10KmacBLAS512,.., 40960.0250

56 24 June 2015Universidad Politécnica de Valencia56 Typically the values of the communication parameters are well estimated with a ping-pong Installation Routines Block size Systemn163264128 cSUN1MPICH512,.., 4096170 / 7.0 cPIIIMPICH512,.., 409660 / 0.7 IBM-SP2Mac-MPI512,.., 409675 / 0.3 Origin 2KMac-MPI512,.., 409620 / 0.1


Download ppt "24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas."

Similar presentations


Ads by Google