Download presentation
Presentation is loading. Please wait.
1
24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo
2
24 June 2015Universidad Politécnica de Valencia2 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
3
24 June 2015Universidad Politécnica de Valencia3 Collaborations and autoreferences Modelling Linear Algebra Routines + J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra Algorithms with Message-passing. 2001 Towards the Design of an Automatically Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J. González + A. Vidal: Empirical Modelling of Parallel Linear Algebra Routines. 2003
4
24 June 2015Universidad Politécnica de Valencia4 Colaborations and autoreferences Installation routines + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J. González: Optimización automática de rutinas paralelas de álgebra lineal. 2000
5
24 June 2015Universidad Politécnica de Valencia5 Colaborations and autoreferences Autotuning routines + J. Cuenca + J. González: Automatic parameterization of parallel linear algebra routines. 2001 + J. Cuenca: Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002
6
24 June 2015Universidad Politécnica de Valencia6 Colaborations and autoreferences Modifications to the libraries hierarchy + J. Cuenca + J. González: Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004
7
24 June 2015Universidad Politécnica de Valencia7 Colaborations and autoreferences Polylibraries + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: Designing Polylibraries to Speed Up Parallel Computations. 2003
8
24 June 2015Universidad Politécnica de Valencia8 Colaborations and autoreferences Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004
9
24 June 2015Universidad Politécnica de Valencia9 Colaborations and autoreferences Heterogeneous systems + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004
10
24 June 2015Universidad Politécnica de Valencia10 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
11
24 June 2015Universidad Politécnica de Valencia11 A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users
12
24 June 2015Universidad Politécnica de Valencia12 A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests
13
24 June 2015Universidad Politécnica de Valencia13 A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Brewer Frigo (1997): FFTW: The Fastest Fourier Transform in the WestFFTW LAWRA (1997): Linear Algebra With Recursive Algorithms LAWRA
14
24 June 2015Universidad Politécnica de Valencia14 A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra PHiPAC
15
24 June 2015Universidad Politécnica de Valencia15 A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential ATLAS Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines I-LIB
16
24 June 2015Universidad Politécnica de Valencia16 A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization
17
24 June 2015Universidad Politécnica de Valencia17 A little history Optimization based on computational kernels : Efficient kernels (BLAS) and algorithms based on these kernelsBLAS Auto-optimization of the basic kernels (ATLAS)
18
24 June 2015Universidad Politécnica de Valencia18 A little history Systematic development of routines : FLAME project FLAME R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA Dense Linear Algebra For Shared Memory Systems
19
24 June 2015Universidad Politécnica de Valencia19 A little history Auto-optimization of routines : At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche LFC At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky)Kalinov + Lastovetsky Use a system evaluation tool (NWS)NWS
20
24 June 2015Universidad Politécnica de Valencia20 A little history Middleware for auto-optimization : LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: FIBER Proposal of general middleware Evolution of I-LIB mpC: mpC For heterogeneous systems
21
24 June 2015Universidad Politécnica de Valencia21 A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing
22
24 June 2015Universidad Politécnica de Valencia22 A little history Skeletons and languages : Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)MALLBAP3L
23
24 June 2015Universidad Politécnica de Valencia23 A little history Heterogeneous and variable-load systems : Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous
24
24 June 2015Universidad Politécnica de Valencia24 A little history Distributed systems : Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS)Globus There can be servers to attend queries of clients
25
24 June 2015Universidad Politécnica de Valencia25 A little history P2P computing : Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable- load But special middleware is necessary
26
24 June 2015Universidad Politécnica de Valencia26 Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
27
24 June 2015Universidad Politécnica de Valencia27 Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)
28
24 June 2015Universidad Politécnica de Valencia28 Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation,... : overlapping of communication and computation Modelling Linear Algebra Routines
29
24 June 2015Universidad Politécnica de Valencia29 Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Modelling Linear Algebra Routines
30
24 June 2015Universidad Politécnica de Valencia30 The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Modelling Linear Algebra Routines
31
24 June 2015Universidad Politécnica de Valencia31 And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (t c ) and the start-up (t s ) and word-sending time (t w ) Modelling Linear Algebra Routines
32
24 June 2015Universidad Politécnica de Valencia32 LU factorisation (Golub - Van Loan) : = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) Modelling Linear Algebra Routines A 11 A 22 A 33 A 32 A 31 A 23 A 21 A 13 A 12 L 11 L 22 L 33 L 32 L 31 L 21 U 11 U 22 U 33 U 23 U 13 U 12
33
24 June 2015Universidad Politécnica de Valencia33 The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is With k 3 and k 2 the cost of operations performed with BLAS 3 or 2 Modelling Linear Algebra Routines
34
24 June 2015Universidad Politécnica de Valencia34 But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and... Modelling Linear Algebra Routines
35
24 June 2015Universidad Politécnica de Valencia35 The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Modelling Linear Algebra Routines
36
24 June 2015Universidad Politécnica de Valencia36 The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine Modelling Linear Algebra Routines
37
24 June 2015Universidad Politécnica de Valencia37 These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Modelling Linear Algebra Routines
38
24 June 2015Universidad Politécnica de Valencia38 Parallel block LU factorisation: matrix distribution of computations in the first step processors Modelling Linear Algebra Routines
39
24 June 2015Universidad Politécnica de Valencia39 Distribution of computations on successive steps: second stepthird step Modelling Linear Algebra Routines
40
24 June 2015Universidad Politécnica de Valencia40 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Modelling Linear Algebra Routines
41
24 June 2015Universidad Politécnica de Valencia41 The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c System Parameters: cost of arithmetic operations: k 2,geqr2 k 2,larft k 3,gemm k 3,trmm communication parameters: t s t w Modelling Linear Algebra Routines
42
24 June 2015Universidad Politécnica de Valencia42 The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Modelling Linear Algebra Routines
43
24 June 2015Universidad Politécnica de Valencia43 Modelling Linear Algebra Routines
44
24 June 2015Universidad Politécnica de Valencia44 Modelling Linear Algebra Routines IBM-SP2. 8 processors 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 512102415362048256030723584 problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model
45
24 June 2015Universidad Politécnica de Valencia45 Modelling Linear Algebra Routines Parameter selection for the QR algorithm - Network of Pentium III with Fast Ethernet p=4 p=8 b r c b r c 1024 16 1 4 1 8 2048 16 1 4 1 8 3072 32 1 4 1 8 4096 32 1 4 1 8
46
24 June 2015Universidad Politécnica de Valencia46 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
47
24 June 2015Universidad Politécnica de Valencia47 In the formulas (parallel block LU factorisation) The values of the System Parameters ( k 2,getf2, k 3,trsmm, k 3,gemm, t s, t w ) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) Installation Routines
48
24 June 2015Universidad Politécnica de Valencia48 Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed
49
24 June 2015Universidad Politécnica de Valencia49 is estimated by performing matrix-matrix multiplications and updatings of size (n/r b) (b n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Installation Routines
50
24 June 2015Universidad Politécnica de Valencia50 two multiple triangular systems are solved, one upper triangular of size b n/c, and another lower triangular of sizen/r b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Installation Routines
51
24 June 2015Universidad Politécnica de Valencia51 corresponds to a level 2 LU sequential factorisation of size b b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation Installation Routines
52
24 June 2015Universidad Politécnica de Valencia52 andappear in communications of three types, In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c In another a block of size b b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c Installation Routines
53
24 June 2015Universidad Politécnica de Valencia53 In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. Installation Routines
54
24 June 2015Universidad Politécnica de Valencia54 Some results in different systems (physical and logical platform) Values of k 3_DTRMM (≈ k 3_DGEMM ) on the different platforms (in microseconds) Installation Routines Block size Systemn163264128 SUN1refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0120 0.0070 0.0200 0.0110 0.0060 0.0220 0.0110 0.0060 0.0280 0.0110 0.0060 SUN5refBLAS macBLAS ATLAS 512,.., 4096 0.0120 0.0060 0.0040 0.0130 0.0050 0.0032 0.0140 0.0050 0.0025 0.0150 0.0050 0.0025 PIIIATLAS512,.., 40960.00380.00330.0030 PPCmacBLAS512,.., 40960.00230.00190.0018 R10KmacBLAS512,.., 40960.00700.00300.0025
55
24 June 2015Universidad Politécnica de Valencia55 Installation Routines Values of k 2_DGEQR2 (≈ k 2_DLARFT ) on the different platforms (in microseconds) Block size Systemn163264128 SUN1refBLAS macBLAS ATLAS 512,.., 4096 0.0200 0.0500 0.0700 SUN5refBLAS macBLAS ATLAS 512,.., 4096 0.0050 0.0300 0.0500 PIIIATLAS512,.., 40960.0150 PPCmacBLAS512,.., 40960.0100 R10KmacBLAS512,.., 40960.0250
56
24 June 2015Universidad Politécnica de Valencia56 Typically the values of the communication parameters are well estimated with a ping-pong Installation Routines Block size Systemn163264128 cSUN1MPICH512,.., 4096170 / 7.0 cPIIIMPICH512,.., 409660 / 0.7 IBM-SP2Mac-MPI512,.., 409675 / 0.3 Origin 2KMac-MPI512,.., 409620 / 0.1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.