Download presentation
Presentation is loading. Please wait.
Published byJoel Cross Modified over 9 years ago
1
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et Paralèle Rabat, Maroc. 28-31 Mai 2001
2
Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works
3
Linear Algebra: highly optimizable operations Optimizations are Platform Specific Traditional method: Hand-Optimization for each platform Current Situation of Linear Algebra Parallel Routines (LAPRs)
4
Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for dynamic systems Misuse by non expert users Problems of traditional method
5
ATLAS, FLAME, I-LIB Analyse platform characteristics in detail Sequential code Empirical results of the LAPR + Automation High Installation Time Current approaches
6
Develop a methodology for obtaining Automatically Tuned Software Execution Environment Auto-tuning Software Our objective
7
Routines Parameterised: System parameters, Algorithmic parameters System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters A reduced number of executions at installation time Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process From the file with information generated in the installation process Methodology
8
System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process Analytical modelling
9
The behaviour of the algorithm on the platform is defined T exec = f (SPs, n, APs) SPs = f(n, APs)System Parameters APsAlgorithmic Parameters nProblem Size Analytical Model
10
System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries How to estimate each SP? 1º.- Obtain the kernel of performance cost of LAPR 2º.- Make an Estimation Routine from this kernel Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs) Analytical Model LAPRs Performance
11
Arithmetic System Parameters (ASPs): t c arithmetic cost but using BLAS: k 1 k 2 and k 3. Computation Kernel of the LAPR Estimation Routine Similar storage scheme Similar quantity of data Analytical Model
12
Communication System Parameters (CSPs): t s start-up time t w word-sending time Communication Kernel of the LAPR Estimation Routine Similar kind of communication Similar quantity of data Analytical Model
13
Algorithmic Parameters (APs) Values chosen in each execution b block size pnumber of processors r c logical topology grid configuration (logical 2D mesh) Analytical Model
14
Pre-installing (manual): 1º Make the Analytical Model: T exec = f (SPs, n, APs) 2º Write the Estimation Routines for the SPs Installing on a Platform (automatic): 3º Estimate the SPs using the Estimation Routines of step 2 4º Write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:
15
LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem. Message-passing with MPI Logical Ring & Logical 2D-Mesh Platform:SGI Origin 2000 Application Example
16
Application Example. Algorithm Scheme 10 11 B 0001 20 21 10 00 20 11 01 21 WD 00 b n/r n
17
Application Example: Pre-installing. 1º Make the Analytical Model: T exec = f (SPs,n,APs)
18
Application Example: Pre-installing. 2º Write the Estimation Routines for the SPs k 3 matrix-matrix multiplication with DGEMM k 1 Givens Rotation to 2 vectors with DROT t s communications along the 2 directions of the 2D-mesh t w
19
Application Example: I nstalling 3º Estimate the SPs using the Estimation Routines k 1 0.01 µs 0.005 µsb = 32 k 3 0.004 µsb = 64 0.003 µsb = 128 t s 20 µs t w 0.1 µs
20
Comparison of execution times using different sets of Execution Parameters (4 processors) Application Example: Executing
21
Comparison of execution times using different sets of Execution Parameters (8 processors) Application Example: Executing
22
LAPR: One-sided Block Jacobi Method Algorithmic Parameters: block size mesh topology Platform: SGI Origin 2000 with message-passing System Parameters:arithmetic costs communication costs Satisfactory Reduction of the Execution Time: from 25% higher than the optimal to only 2% Application Example: Executing
23
Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works
24
System parameters obtained at installation time Installation routines making a reduced number of executions at installation time Algorithmic parameters obtained at running time From the file with information generated in the installation process Exhaustive Execution
25
The behaviour of the algorithm on the platform is defined (as in Analytical Modelling) T exec = f (SPs, n, APs) SPs = f(n, APs)System Parameters APsAlgorithmic Parameters nProblem Size Exhaustive Execution
26
Identify Algorithmic Parameters (APs) (as in Analytical Modelling) Values chosen in each execution b block size pnumber of processors r c logical topology grid configuration (logical 2D mesh) Exhaustive Execution
27
Pre-installing (manual): 1º Determine the APs 2º Decide heuristics to reduce execution time in the installation process Installing on a Platform (automatic): 3º Decide (the manager) the problem sizes to be analysed 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:
28
LAPR: Gaussian elimination. Message-passing with MPI Logical Ring, rowwise block-cyclic striped partitioning Platform:networks of processors (heterogeneous system) Application Example
29
Application Example: Pre-installing. 1º Determine the APs logical ring, rowwise block-cyclic striped partitioning pnumber of processors b block size for the data distribution different block sizes in heterogeneous systems b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0
30
Application Example: Pre-installing. 2º Decide heuristics to reduce execution time in the installation process u Execution time varies in a continuous way with the problem size and the APs u Consider the system as homogeneous u Installation can finish: When Analytical and Experimental predictions coincide When a certain time has been spent on the installation
31
Homogeneous Systems: 3º The manager decides the problem sizes 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Heterogeneous Systems: 3º The manager decides the problem sizes 4º Execute: write a Configuration File, for each n APs that minimize T exec write a Speed File, with the relative speeds of the processors in the system Application Example: I nstalling
32
RI-THE:Obtains p and b from the formula. RI-HOM:Obtains p and b through a reduced number of executions. RI-HET:1º. As RI-HOM. 2º. Obtains b i for each processor Application Example: I nstallation Routines
33
Three different configurations: PLA_HOM: 5 SUN Ultra-1 PLA_HYB: 5 SUN Ultra-1 1 SUN Ultra-5 PLA_HET: 1 SUN Ultra-1 1 SUN Ultra-5 1 SUN Ultra-1 (manages the file system) Application Example: Systems
34
Experimental results in PLA-HOM: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing
35
Experimental results in PLA-HYB: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing
36
Experimental results in PLA-HET: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing
37
Two techniques for automatic tuning of Parallel Linear Algebra Routines: 1. Analytical Modelling For predictable systems (homogeneous, static,...) like Origin 2000 2. Exhaustive Execution For less predictable systems (heterogeneous, dynamic,...) like networks of workstations Transparent to the user Execution close to the optimum Comparison
38
Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works
39
To validate the methodology it is necessary to experiment with: More routines: block LU factorization More systems: u Architectures: IBM SP2 and Origin 2000 u Libraries: reference BLAS, machine BLAS, ATLAS Validation with the LU factorization
40
Sequential LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations of different levels: k 1, k 2, k 3 APs: block size b LUES UM b
41
Quotient between different execution times and the optimum execution time Sequential LU. Comparison in IBM SP2
42
Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In SUN 1 Sequential LU. Model execution time/optimum execution time
43
Parallel LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations: k 1, k 2, k 3 cost of communications: t s, t w APs: block size b, number of processors p, grid configuration r c 000102000102 101112101112 000102000102 101112101112 000102000102 101112101112 b
44
Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in IBM SP2
45
Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in Origin 2000
46
The modelling of the algorithm provides satisfactory results in different systems Origin 2000, IBM SP2 reference BLAS, machine BLAS, ATLAS The prediction is worse in some cases: When the number of processors increases In multicomputers where communications are more important (IBM SP2) Exhaustive Executions Parallel LU. Conclusions
47
If the manager installs the routine for sizes 512, 1536, 2560, and executions are performed for sizes 1024, 2048, 3072, the execution time is well predicted The same policy can be used in the installation of other software: Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in IBM SP2 Parallel LU. Exhaustive Execution
48
Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software Two techniques can be used: Analytical Modelling Exhaustive Executions or a combination of both Experiments performed in different systems and with different routines Conclusions
49
We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries: it is necessary to analyse the methodology in more systems and with more routines Architecture of an Automatically Tuned Linear Algebra Library At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes Future Works
50
Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager
51
Architecture of an Automatically Tuned Linear Algebra Library Installation routines Library designer
52
Architecture of an Automatically Tuned Linear Algebra Library Installation routines Basic routines library Library Basic routines declaration designer manager
53
Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library Library Basic routines declaration manager Installation designer manager
54
Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation designer manager
55
Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.