Download presentation
Presentation is loading. Please wait.
Published byGermain Barrette Modified over 6 years ago
1
Advances in the Optimization of Parallel Routines (I)
Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo 07 November 2018 Universidad Politécnica de Valencia
2
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
3
Collaborations and autoreferences
Modelling Linear Algebra Routines + J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra Algorithms with Message-passing. 2001 Towards the Design of an Automatically Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J. González + A. Vidal: Empirical Modelling of Parallel Linear Algebra Routines. 2003 07 November 2018 Universidad Politécnica de Valencia
4
Colaborations and autoreferences
Installation routines + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J. González: Optimización automática de rutinas paralelas de álgebra lineal. 2000 07 November 2018 Universidad Politécnica de Valencia
5
Colaborations and autoreferences
Autotuning routines + J. Cuenca + J. González: Automatic parameterization of parallel linear algebra routines. 2001 + J. Cuenca: Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002 07 November 2018 Universidad Politécnica de Valencia
6
Colaborations and autoreferences
Modifications to the libraries hierarchy + J. Cuenca + J. González: Architecture of an Automatic Tuned Linear Algebra Library 07 November 2018 Universidad Politécnica de Valencia
7
Colaborations and autoreferences
Polylibraries + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: Designing Polylibraries to Speed Up Parallel Computations. 2003 07 November 2018 Universidad Politécnica de Valencia
8
Colaborations and autoreferences
Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 07 November 2018 Universidad Politécnica de Valencia
9
Colaborations and autoreferences
Heterogeneous systems + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 07 November 2018 Universidad Politécnica de Valencia
10
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
11
Universidad Politécnica de Valencia
A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users 07 November 2018 Universidad Politécnica de Valencia
12
Universidad Politécnica de Valencia
A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests 07 November 2018 Universidad Politécnica de Valencia
13
Universidad Politécnica de Valencia
A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Frigo (1997): FFTW: The Fastest Fourier Transform in the West LAWRA (1997): Linear Algebra With Recursive Algorithms 07 November 2018 Universidad Politécnica de Valencia
14
Universidad Politécnica de Valencia
A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra 07 November 2018 Universidad Politécnica de Valencia
15
Universidad Politécnica de Valencia
A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines 07 November 2018 Universidad Politécnica de Valencia
16
Universidad Politécnica de Valencia
A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization 07 November 2018 Universidad Politécnica de Valencia
17
Universidad Politécnica de Valencia
A little history Optimization based on computational kernels: Efficient kernels (BLAS) and algorithms based on these kernels Auto-optimization of the basic kernels (ATLAS) 07 November 2018 Universidad Politécnica de Valencia
18
Universidad Politécnica de Valencia
A little history Systematic development of routines: FLAME project R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA For Shared Memory Systems 07 November 2018 Universidad Politécnica de Valencia
19
Universidad Politécnica de Valencia
A little history Auto-optimization of routines: At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky) Use a system evaluation tool (NWS) 07 November 2018 Universidad Politécnica de Valencia
20
Universidad Politécnica de Valencia
A little history Middleware for auto-optimization: LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: Proposal of general middleware Evolution of I-LIB mpC: For heterogeneous systems 07 November 2018 Universidad Politécnica de Valencia
21
Universidad Politécnica de Valencia
A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing 07 November 2018 Universidad Politécnica de Valencia
22
Universidad Politécnica de Valencia
A little history Skeletons and languages: Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa) 07 November 2018 Universidad Politécnica de Valencia
23
Universidad Politécnica de Valencia
A little history Heterogeneous and variable-load systems: Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous 07 November 2018 Universidad Politécnica de Valencia
24
Universidad Politécnica de Valencia
A little history Distributed systems: Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS) There can be servers to attend queries of clients 07 November 2018 Universidad Politécnica de Valencia
25
Universidad Politécnica de Valencia
A little history P2P computing: Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable-load But special middleware is necessary 07 November 2018 Universidad Politécnica de Valencia
26
Universidad Politécnica de Valencia
Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
27
Modelling Linear Algebra Routines
Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries) 07 November 2018 Universidad Politécnica de Valencia
28
Modelling Linear Algebra Routines
Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation 07 November 2018 Universidad Politécnica de Valencia
29
Modelling Linear Algebra Routines
Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. 07 November 2018 Universidad Politécnica de Valencia
30
Modelling Linear Algebra Routines
The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors 07 November 2018 Universidad Politécnica de Valencia
31
Modelling Linear Algebra Routines
And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) 07 November 2018 Universidad Politécnica de Valencia
32
Modelling Linear Algebra Routines
LU factorisation (Golub - Van Loan): = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) U11 U12 U13 A11 A22 A33 A32 A31 A23 A21 A13 A12 L11 U22 U23 L21 L22 U33 L31 L32 L33 07 November 2018 Universidad Politécnica de Valencia
33
Modelling Linear Algebra Routines
The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is With k3 and k2 the cost of operations performed with BLAS 3 or 2 07 November 2018 Universidad Politécnica de Valencia
34
Modelling Linear Algebra Routines
But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ... 07 November 2018 Universidad Politécnica de Valencia
35
Modelling Linear Algebra Routines
The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained 07 November 2018 Universidad Politécnica de Valencia
36
Modelling Linear Algebra Routines
The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine 07 November 2018 Universidad Politécnica de Valencia
37
Modelling Linear Algebra Routines
These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time 07 November 2018 Universidad Politécnica de Valencia
38
Modelling Linear Algebra Routines
Parallel block LU factorisation: matrix distribution of computations in the first step processors 07 November 2018 Universidad Politécnica de Valencia
39
Modelling Linear Algebra Routines
Distribution of computations on successive steps: second step third step 07 November 2018 Universidad Politécnica de Valencia
40
Modelling Linear Algebra Routines
The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c d=max(r,c) System Parameters: cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm communication parameters: ts tw 07 November 2018 Universidad Politécnica de Valencia
41
Modelling Linear Algebra Routines
The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c System Parameters: cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm communication parameters: ts tw 07 November 2018 Universidad Politécnica de Valencia
42
Modelling Linear Algebra Routines
The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information 07 November 2018 Universidad Politécnica de Valencia
43
Modelling Linear Algebra Routines
07 November 2018 Universidad Politécnica de Valencia
44
Modelling Linear Algebra Routines
0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 512 1024 1536 2048 2560 3072 3584 problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model IBM-SP2. 8 processors 07 November 2018 Universidad Politécnica de Valencia
45
Modelling Linear Algebra Routines
Parameter selection for the QR algorithm IBM SP2 p=4 p=8 b r c 1024 16 1 4 8 2048 32 3072 2 4096 - Origin 2000 p=4 p=8 b r c 1024 32 4 1 2 2048 64 3072 4096 - p=4 p=8 b r c b r c 1024 16 1 4 16 1 8 2048 16 1 4 16 1 8 Network of Pentium III with Fast Ethernet 3072 32 1 4 32 1 8 4096 32 1 4 32 1 8 07 November 2018 Universidad Politécnica de Valencia
46
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
47
Installation Routines
In the formulas (parallel block LU factorisation) The values of the System Parameters (k2,getf2 , k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) 07 November 2018 Universidad Politécnica de Valencia
48
Installation Routines
By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed 07 November 2018 Universidad Politécnica de Valencia
49
Installation Routines
is estimated by performing matrix-matrix multiplications and updatings of size (n/r b) (b n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes 07 November 2018 Universidad Politécnica de Valencia
50
Installation Routines
two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes 07 November 2018 Universidad Politécnica de Valencia
51
Installation Routines
corresponds to a level 2 LU sequential factorisation of size b b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation 07 November 2018 Universidad Politécnica de Valencia
52
Installation Routines
and appear in communications of three types, In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c In another a block of size b b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c 07 November 2018 Universidad Politécnica de Valencia
53
Installation Routines
In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. 07 November 2018 Universidad Politécnica de Valencia
54
Installation Routines
Some results in different systems (physical and logical platform) Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds) Block size System n 16 32 64 128 SUN1 refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0120 0.0070 0.0110 0.0060 0.0220 0.0280 SUN5 0.0040 0.0130 0.0050 0.0032 0.0140 0.0025 0.0150 PIII 0.0038 0.0033 0.0030 PPC 0.0023 0.0019 0.0018 R10K 07 November 2018 Universidad Politécnica de Valencia
55
Installation Routines
Values of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds) Block size System n 16 32 64 128 SUN1 refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0500 0.0700 SUN5 0.0050 0.0300 PIII 0.0150 PPC 0.0100 R10K 0.0250 07 November 2018 Universidad Politécnica de Valencia
56
Installation Routines
Typically the values of the communication parameters are well estimated with a ping-pong Block size System n 16 32 64 128 cSUN1 MPICH 512,.., 4096 170 / 7.0 cPIII 60 / 0.7 IBM-SP2 Mac-MPI 75 / 0.3 Origin 2K 20 / 0.1 07 November 2018 Universidad Politécnica de Valencia
57
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
58
Universidad Politécnica de Valencia
Autotuning routines Our approach Routines Parameterised: System parameters, Algorithmic parameters System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters A reduced number of executions at installation time Algorithmic parameters From the analytical model with the system parameters obtained in the installation process 07 November 2018 Universidad Politécnica de Valencia
59
Universidad Politécnica de Valencia
Autotuning routines LAR-IF EXECUT. OF LAR-ERs BL LIBRARY INCLUSION PROCESS LAR-OAPF OAP SELECTION LAR-SPF I N S T A L O SYSTEM MANAGER IMPLEMEN. OF LAR-ERs LAR-DESIGNER MODELLING LAR LAR-MOD D E G LAR LAR-ERs the scheme 07 November 2018 Universidad Politécnica de Valencia
60
Universidad Politécnica de Valencia
Autotuning routines Modelling the LAR LAR-DESIGNER MODELLING LAR LAR-MOD D E S I G N LAR 07 November 2018 Universidad Politécnica de Valencia
61
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD: Analytical Model of LAR The behaviour of the algorithm on the platform is defined Texec = f (SPs, n, APs) SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size 07 November 2018 Universidad Politécnica de Valencia
62
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD: Analytical Model of LAR System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries LARs Performance 07 November 2018 Universidad Politécnica de Valencia
63
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs) LARs Performance 07 November 2018 Universidad Politécnica de Valencia
64
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs) ts start-up time tw word-sending time LARs Performance 07 November 2018 Universidad Politécnica de Valencia
65
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs) LARs Performance Arithmetic System Parameters (ASPs): tc arithmetic cost. Using BLAS: k1 k2 and k3 07 November 2018 Universidad Politécnica de Valencia
66
Universidad Politécnica de Valencia
Autotuning routines LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform Physical Characteristics Current Conditions Basic libraries LARs Performance How to estimate each SP? 1º.- Obtain the kernel of performance cost of LAR 2º.- Make an Estimation Routine from this kernel 07 November 2018 Universidad Politécnica de Valencia
67
Universidad Politécnica de Valencia
Autotuning routines Design LAR-DESIGNER MODELLING LAR LAR-MOD D E S I G N LAR 07 November 2018 Universidad Politécnica de Valencia
68
Universidad Politécnica de Valencia
Autotuning routines Design: Making the LAR-ERs IMPLEMEN. OF LAR-ERs LAR-DESIGNER MODELLING LAR LAR-MOD D E S I G N LAR LAR-ERs 07 November 2018 Universidad Politécnica de Valencia
69
Universidad Politécnica de Valencia
Autotuning routines LAR-ERs: Estimation Routines Arithmetic System Parameters (ASPs): Computation Kernel of the LAR Estimation Routine Similar storage scheme Similar quantity of data Communication System Parameters (CSPs): Communication Kernel of the LAR Estimation Routine Similar kind of communication 07 November 2018 Universidad Politécnica de Valencia
70
Universidad Politécnica de Valencia
Autotuning routines Design: Process has finished IMPLEMEN. OF LAR-ERs LAR-DESIGNER HAND-MADE ONLY ONCE MODELLING LAR LAR-MOD D E S I G N LAR LAR-ERs 07 November 2018 Universidad Politécnica de Valencia
71
Universidad Politécnica de Valencia
Autotuning routines Installation: Runing the LAR-ERs LAR-IF EXECUT. OF LAR-ERs BL LAR-SPF I N S T A L O SYSTEM MANAGER IMPLEMEN. OF LAR-ERs LAR-DESIGNER MODELLING LAR LAR-MOD D E G LAR LAR-ERs 07 November 2018 Universidad Politécnica de Valencia
72
Universidad Politécnica de Valencia
Autotuning routines Installation: obtaining the OAP LAR-IF EXECUT. OF LAR-ERs BL LAR-OAPF OAP SELECTION LAR-SPF I N S T A L O SYSTEM MANAGER IMPLEMEN. OF LAR-ERs LAR-DESIGNER MODELLING LAR LAR-MOD D E G LAR LAR-ERs 07 November 2018 Universidad Politécnica de Valencia
73
Universidad Politécnica de Valencia
Autotuning routines Installation: obtaining the OAP Algorithmic Parameters (APs) Known the SPs values, the Optimum Values for the APs are calculated (OAP): b block size p number of processors r c logical topology grid configuration (logical 2D mesh) 07 November 2018 Universidad Politécnica de Valencia
74
Universidad Politécnica de Valencia
Autotuning routines Installation: putting it all together LAR-IF EXECUT. OF LAR-ERs BL LIBRARY INCLUSION PROCESS LAR-OAPF OAP SELECTION LAR-SPF I N S T A L O IMPLEMEN. OF LAR-ERs LAR-DESIGNER MODELLING LAR LAR-MOD D E G LAR LAR-ERs SYSTEM MANAGER 07 November 2018 Universidad Politécnica de Valencia
75
Universidad Politécnica de Valencia
Autotuning routines Experiments LAR: block LU factorization. Platforms: IBM SP2, SGI Origin 2000, NoW Basic Libraries: reference BLAS, machine BLAS, ATLAS 07 November 2018 Universidad Politécnica de Valencia
76
Universidad Politécnica de Valencia
Autotuning routines LU on IBM SP2 Quotient between the execution time with the parameters selected by the model and the lowest experimental execution time (varying the value of the parameters) 07 November 2018 Universidad Politécnica de Valencia
77
Universidad Politécnica de Valencia
Autotuning routines LU on Origin 2000 Quotient between the execution time with the parameters selected by the model and the lowest experimental execution time (varying the value of the parameters) 07 November 2018 Universidad Politécnica de Valencia
78
Universidad Politécnica de Valencia
Autotuning routines LU on NoW Quotient between the execution time with the parameters selected by the model and the lowest experimental execution time (varying the value of the parameters) 07 November 2018 Universidad Politécnica de Valencia
79
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 07 November 2018 Universidad Politécnica de Valencia
80
Modifications to libraries’ hierarchy
In the optimization of routines individual basic operations appear repeatedly: LU: QR: 07 November 2018 Universidad Politécnica de Valencia
81
Modifications to libraries’ hierarchy
The information generated to instal a routine could be used for another different routine without additional experiments: ts and tw are obtained when the communication library (MPI, PVM, …) is installed K3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed 07 November 2018 Universidad Politécnica de Valencia
82
Modifications to libraries’ hierarchy
To determine: the type of experiments necessary for the different routines in the library: ts and tw ¿obtained with ping-pong, broadcast, … ? K3,gemm ¿obtained for small block sizes, … ? the format in which the data will be stored, to facilitate the use of them when installing other routines 07 November 2018 Universidad Politécnica de Valencia
83
Modifications to libraries’ hierarchy
The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored must be decided by the Parallel Linear Algebra Community … and the typical hierarchy of libraries would change 07 November 2018 Universidad Politécnica de Valencia
84
Modifications to libraries’ hierarchy
typical hierarchy of Parallel Linear Algebra libraries ScaLAPACK PBLAS LAPACK BLACS BLAS Communications 07 November 2018 Universidad Politécnica de Valencia
85
Modifications to libraries’ hierarchy
To include installation information in the lowest levels of the hierarchy ScaLAPACK PBLAS LAPACK BLACS BLAS Communications Self-Optimisation Information Self-Optimisation Information 07 November 2018 Universidad Politécnica de Valencia
86
Modifications to libraries’ hierarchy
When installing libraries in a higher level this information can be used, and new information is generated … ScaLAPACK PBLAS LAPACK BLACS Self-Optimisation Information Self-Optimisation Information BLAS Communications Self-Optimisation Information Self-Optimisation Information 07 November 2018 Universidad Politécnica de Valencia
87
Modifications to libraries’ hierarchy
And so in higher levels ScaLAPACK Self-Optimisation Information PBLAS Self-Optimisation Information LAPACK BLACS Self-Optimisation Information Self-Optimisation Information BLAS Communications Self-Optimisation Information Self-Optimisation Information 07 November 2018 Universidad Politécnica de Valencia
88
Modifications to libraries’ hierarchy
And new libraries with autotunig capacity could be developed PDE Solver Least Square Problem Inverse Eigenvalue Problem Self-Optimisation Information Self-Optimisation Information Self-Optimisation Information ScaLAPACK Self-Optimisation Information PBLAS Self-Optimisation Information LAPACK BLACS Self-Optimisation Information Self-Optimisation Information BLAS Communications Self-Optimisation Information Self-Optimisation Information 07 November 2018 Universidad Politécnica de Valencia
89
Modifications to libraries’ hierarchy
GETRF from LAPACK (level 1) GETRF_manager k3_information Model GETRF { } GEMM from BLAS (level 0) GEMM_manager GEMM Movement of information between routines in the different levels of the hierarchy 07 November 2018 Universidad Politécnica de Valencia
90
Modifications to libraries’ hierarchy
Movement of information between routines in the different levels of the hierarchy 07 November 2018 Universidad Politécnica de Valencia
91
Modifications to libraries’ hierarchy
Movement of information between routines in the different levels of the hierarchy 07 November 2018 Universidad Politécnica de Valencia
92
Modifications to libraries’ hierarchy
Movement of information between routines in the different levels of the hierarchy 07 November 2018 Universidad Politécnica de Valencia
93
Modifications to libraries’ hierarchy
SOLAR_manager LAR(n, AP) { ... } Texec = f (SP,AP, n) SP = f(AP,n) AP0 Architecture of a Self Optimized Linear Algebra Routine manager Optimum_AP Model nc Current_problem_size n1 ... nw AP1 APz % net %net1-p ... % netP-1 ..%netp-p Current_network_availability %CPU %CPUp Current_CPUs_availability Installation_information Current_system_information SP1_manager SP1_manager SP1_manager SPt_manager AP APz n SP11, SP11,z nw SP1w, SP1w,z AP APz n SP11, SP11,z nw SP1w, SP1w,z AP APz n SP11, SP11,z nw SP1w, SP1w,z AP APz n SPt1, SPt1,z nw SPtw, SPtw,z . . . Installation_SP1_values Installation_SP1_values Installation_SP1_values Installation_SP1_values AP APz nc SP1c, SP1c,z AP APz nc SP1c, SP1c,z AP APz nc SP1c, SP1c,z AP APz nc SPtc, SPtc,z Current_SP1_values Current_SP1_values Current_SP1_values Current_SP1_values SP1_information SP1_information SP1_information SPt_information 07 November 2018 Universidad Politécnica de Valencia
94
Modifications to libraries’ hierarchy
Modelling the Linear Algebra Routine (LAR) Obtaining information from the System Selection of parameters values Execution of LAR D E S I G N T A L O R U - M Lyfe cycle of a SOLAR Slide 94 : Our approach Our approach is to pick each LAR and for it do: First of all, in a design phase, a Analytical model : The execution time of an algorithm can be modelled by means of a function depending on the problem size, and the system and algorithmic parameters. We don’t considered the SP like constants of each platform. In order to a better modelling process, the values of the SPs are considered like functions, they depend on the values of the algorithmic parameters and the problem size. Algorithmic parameters are parameters whose value is taken at execution time and which influence the execution time. Typical algorithmic parameters may be: the arithmetic block size (in algorithms by blocks, as it is normal to have in dense linear algebra) the communication block size the number of processors to use the logical topology in parallel algorithms If the LAR call to other routines of more basic libraries, to choose the basic library to use,... In the Installation phase: the values of the SP for the current platform are estimated, measured. In the Run-time phase: What we intend is a method to obtain automatically the values of the algorithmic parameters (depending on the problem size and the SPs) which provide the lowest execution time. 07 November 2018 Universidad Politécnica de Valencia
95
Universidad Politécnica de Valencia
DESIGN PROCESS D E S I G N LAR LAR: Linear Algebra Routine Made by the LAR Designer Example of LAR: Parallel Block LU factorisation Slide 95: Let see now each phase of the process: Firstly, in the design phase we have the LAR made by the LAR designer. This routine don´t need to be modify. For example, let’s take a parallel version of the LU factorisation using blocks. Like the PGETRF routine of ScaLAPACK 07 November 2018 Universidad Politécnica de Valencia
96
Universidad Politécnica de Valencia
Modelling the LAR D E S I G N LAR Modelling the LAR MODEL Slide 96: In first place the Analytical Model of the LAR is made 07 November 2018 Universidad Politécnica de Valencia
97
Universidad Politécnica de Valencia
Modelling the LAR D E S I G N LAR Made by the LAR-Designer Only once per LAR Modelling the LAR MODEL Slide 97: The LAR Designer of other expert that has enough knowledge of the LAR has to construct an analytical model, that is, the execution time of the routine like a formula of the problem size, some parameters of the system, and some algorithmic parameters whose values could be chose at execution time. The system parameters reflect the characteristics of the system. They reflect not only the physical characteristics of the hardware but also the current conditions. Also the basic libraries used in the algorithm to perform basic operations influence the performance of the Linear Algebra Routine. The basic arithmetic system parameters are the arithmetic costs for different levels of basic linear algebra routines, and the basic communication system parameters are the start-up and the word-sending time. SP: System Parameters AP: Algorithmic Parameters n : Problem size MODEL Texec = f (SP, AP, n) 07 November 2018 Universidad Politécnica de Valencia
98
Universidad Politécnica de Valencia
Modelling the LAR D E S I G N LAR SP: k3, k2, ts, tw AP: p = r x c, b n : Problem size Modelling the LAR MODEL MODEL LAR: Parallel Block LU factorisation Slide 98: In the LU factorisation routine, the analytical model is this formula of the arithmetic time and this other of the communication time. System Parameters than appear are: K3: the cost of arithmetic operation of level 3 K2: the cost of arithmetic operation of level 2 Tw: word-sending time (the inverse of the bandwidth) Ts: start-up time (the latency) Arithmetic Parameters are: P: The number of processors to use B: The arithmetic block size 07 November 2018 Universidad Politécnica de Valencia
99
Implementation of SP-Estimators
D E S I G N LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Slide 99: The other task to do by the LAR designer in the DESIGN PHASE is: Identify the kernels of the Linear Algebra Routine of highest cost, and estimators (estimation routines) will be developed to estimate the values of the parameters in the conditions in which the operations are carried out in the linear algebra routine. 07 November 2018 Universidad Politécnica de Valencia
100
Implementation of SP-Estimators
D E S I G N LAR Modelling the LAR Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication MODEL Implementation of SP-Estimators SP-Estimators 07 November 2018 Universidad Politécnica de Valencia
101
Universidad Politécnica de Valencia
INSTALLATION PROCESS D E S I G N LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators I N S T A L O Installation Process Only once per Platform Done by the System Manager 07 November 2018 Universidad Politécnica de Valencia
102
Estimation of Static-SP
L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G 07 November 2018 Universidad Politécnica de Valencia
103
Estimation of Static-SP
L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS Installation File SP values are obtained using the information (n and AP values) of this file. 07 November 2018 Universidad Politécnica de Valencia
104
Estimation of Static-SP
D E S I G N Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI LAR Modelling the LAR Estimation of the Static-SP k3-static (in sec) Block size k3-static MODEL Implementation of SP-Estimators SP-Estimators I N S T A L O Estimation of the Static-SP tw-static (in sec) Message size (Kbytes) tw-static Basic Libraries Installation-File Estimation of Static-SP Static-SP-File 07 November 2018 Universidad Politécnica de Valencia
105
Universidad Politécnica de Valencia
RUN-TIME PROCESS I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G R U - M 07 November 2018 Universidad Politécnica de Valencia
106
Universidad Politécnica de Valencia
RUN-TIME PROCESS I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Optimum-AP Selection of Optimum AP D E G R U - M 07 November 2018 Universidad Politécnica de Valencia
107
Universidad Politécnica de Valencia
RUN-TIME PROCESS I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Optimum-AP Selection of Optimum AP Execution of LAR D E G R U - M 07 November 2018 Universidad Politécnica de Valencia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.