Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

1 Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et Paralèle Rabat, Maroc. 28-31 Mai 2001

2 Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

3  Linear Algebra: highly optimizable operations  Optimizations are Platform Specific  Traditional method: Hand-Optimization for each platform Current Situation of Linear Algebra Parallel Routines (LAPRs)

4  Time-consuming  Incompatible with Hardware Evolution  Incompatible with changes in the system (architecture and basic libraries)  Unsuitable for dynamic systems  Misuse by non expert users Problems of traditional method

5 ATLAS, FLAME, I-LIB  Analyse platform characteristics in detail  Sequential code  Empirical results of the LAPR + Automation  High Installation Time Current approaches

6 Develop a methodology for obtaining Automatically Tuned Software Execution Environment Auto-tuning Software Our objective

7  Routines Parameterised: System parameters, Algorithmic parameters  System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters A reduced number of executions at installation time  Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process From the file with information generated in the installation process Methodology

8  System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters  Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process Analytical modelling

9 The behaviour of the algorithm on the platform is defined T exec = f (SPs, n, APs)  SPs = f(n, APs)System Parameters  APsAlgorithmic Parameters  nProblem Size Analytical Model

10 System Parameters (SPs): Hardware Platform  Physical Characteristics  Current Conditions Basic libraries How to estimate each SP? 1º.- Obtain the kernel of performance cost of LAPR 2º.- Make an Estimation Routine from this kernel Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs) Analytical Model LAPRs Performance

11 Arithmetic System Parameters (ASPs): t c arithmetic cost but using BLAS: k 1 k 2 and k 3. Computation Kernel of the LAPR  Estimation Routine  Similar storage scheme  Similar quantity of data Analytical Model

12 Communication System Parameters (CSPs): t s start-up time t w word-sending time Communication Kernel of the LAPR  Estimation Routine  Similar kind of communication  Similar quantity of data Analytical Model

13 Algorithmic Parameters (APs) Values chosen in each execution b block size pnumber of processors r  c logical topology grid configuration (logical 2D mesh) Analytical Model

14 Pre-installing (manual): 1º Make the Analytical Model: T exec = f (SPs, n, APs) 2º Write the Estimation Routines for the SPs Installing on a Platform (automatic): 3º Estimate the SPs using the Estimation Routines of step 2 4º Write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:

15 LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.  Message-passing with MPI  Logical Ring & Logical 2D-Mesh Platform:SGI Origin 2000 Application Example

16 Application Example. Algorithm Scheme 10 11 B 0001 20 21 10 00 20 11 01 21 WD 00 b n/r n

17 Application Example: Pre-installing. 1º Make the Analytical Model: T exec = f (SPs,n,APs)

18 Application Example: Pre-installing. 2º Write the Estimation Routines for the SPs k 3 matrix-matrix multiplication with DGEMM k 1 Givens Rotation to 2 vectors with DROT t s communications along the 2 directions of the 2D-mesh t w

19 Application Example: I nstalling 3º Estimate the SPs using the Estimation Routines k 1 0.01 µs 0.005 µsb = 32 k 3 0.004 µsb = 64 0.003 µsb = 128 t s 20 µs t w 0.1 µs

20 Comparison of execution times using different sets of Execution Parameters (4 processors) Application Example: Executing

21 Comparison of execution times using different sets of Execution Parameters (8 processors) Application Example: Executing

22 LAPR: One-sided Block Jacobi Method Algorithmic Parameters: block size mesh topology Platform: SGI Origin 2000 with message-passing System Parameters:arithmetic costs communication costs Satisfactory Reduction of the Execution Time: from 25% higher than the optimal to only 2% Application Example: Executing

23 Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

24  System parameters obtained at installation time Installation routines making a reduced number of executions at installation time  Algorithmic parameters obtained at running time From the file with information generated in the installation process Exhaustive Execution

25 The behaviour of the algorithm on the platform is defined (as in Analytical Modelling) T exec = f (SPs, n, APs)  SPs = f(n, APs)System Parameters  APsAlgorithmic Parameters  nProblem Size Exhaustive Execution

26 Identify Algorithmic Parameters (APs) (as in Analytical Modelling) Values chosen in each execution b block size pnumber of processors r  c logical topology grid configuration (logical 2D mesh) Exhaustive Execution

27 Pre-installing (manual): 1º Determine the APs 2º Decide heuristics to reduce execution time in the installation process Installing on a Platform (automatic): 3º Decide (the manager) the problem sizes to be analysed 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:

28 LAPR: Gaussian elimination.  Message-passing with MPI  Logical Ring, rowwise block-cyclic striped partitioning Platform:networks of processors (heterogeneous system) Application Example

29 Application Example: Pre-installing. 1º Determine the APs logical ring, rowwise block-cyclic striped partitioning pnumber of processors b block size for the data distribution different block sizes in heterogeneous systems b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0

30 Application Example: Pre-installing. 2º Decide heuristics to reduce execution time in the installation process u Execution time varies in a continuous way with the problem size and the APs u Consider the system as homogeneous u Installation can finish:  When Analytical and Experimental predictions coincide  When a certain time has been spent on the installation

31  Homogeneous Systems: 3º The manager decides the problem sizes 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec  Heterogeneous Systems: 3º The manager decides the problem sizes 4º Execute: write a Configuration File, for each n APs that minimize T exec write a Speed File, with the relative speeds of the processors in the system Application Example: I nstalling

32  RI-THE:Obtains p and b from the formula.  RI-HOM:Obtains p and b through a reduced number of executions.  RI-HET:1º. As RI-HOM. 2º. Obtains b i for each processor Application Example: I nstallation Routines

33 Three different configurations: PLA_HOM: 5 SUN Ultra-1 PLA_HYB: 5 SUN Ultra-1 1 SUN Ultra-5 PLA_HET: 1 SUN Ultra-1 1 SUN Ultra-5 1 SUN Ultra-1 (manages the file system) Application Example: Systems

34 Experimental results in PLA-HOM: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

35 Experimental results in PLA-HYB: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

36 Experimental results in PLA-HET: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

37 Two techniques for automatic tuning of Parallel Linear Algebra Routines: 1. Analytical Modelling For predictable systems (homogeneous, static,...) like Origin 2000 2. Exhaustive Execution For less predictable systems (heterogeneous, dynamic,...) like networks of workstations  Transparent to the user  Execution close to the optimum Comparison

38 Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

39 To validate the methodology it is necessary to experiment with:  More routines: block LU factorization  More systems: u Architectures: IBM SP2 and Origin 2000 u Libraries: reference BLAS, machine BLAS, ATLAS Validation with the LU factorization

40 Sequential LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations of different levels: k 1, k 2, k 3 APs: block size b LUES UM b

41 Quotient between different execution times and the optimum execution time Sequential LU. Comparison in IBM SP2

42 Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In SUN 1 Sequential LU. Model execution time/optimum execution time

43 Parallel LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations: k 1, k 2, k 3 cost of communications: t s, t w APs: block size b, number of processors p, grid configuration r  c 000102000102 101112101112 000102000102 101112101112 000102000102 101112101112 b

44 Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in IBM SP2

45 Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in Origin 2000

46  The modelling of the algorithm provides satisfactory results in different systems Origin 2000, IBM SP2 reference BLAS, machine BLAS, ATLAS  The prediction is worse in some cases: When the number of processors increases In multicomputers where communications are more important (IBM SP2)  Exhaustive Executions Parallel LU. Conclusions

47 If the manager installs the routine for sizes 512, 1536, 2560, and executions are performed for sizes 1024, 2048, 3072, the execution time is well predicted The same policy can be used in the installation of other software: Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in IBM SP2 Parallel LU. Exhaustive Execution

48  Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software  Two techniques can be used: Analytical Modelling Exhaustive Executions or a combination of both  Experiments performed in different systems and with different routines Conclusions

49  We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries: it is necessary to analyse the methodology in more systems and with more routines  Architecture of an Automatically Tuned Linear Algebra Library  At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes Future Works

50 Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager

51 Architecture of an Automatically Tuned Linear Algebra Library Installation routines Library designer

52 Architecture of an Automatically Tuned Linear Algebra Library Installation routines Basic routines library Library Basic routines declaration designer manager

53 Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library Library Basic routines declaration manager Installation designer manager

54 Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation designer manager

55 Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager

