Download presentation
Presentation is loading. Please wait.
Published byΘυώνη Πυλαρινός Modified over 6 years ago
1
Advances in the Optimization of Parallel Routines (II)
Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo 28 November 2018 Universidad Politécnica de Valencia
2
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 28 November 2018 Universidad Politécnica de Valencia
3
Universidad Politécnica de Valencia
Polylibraries Different basic libraries can be available: Reference BLAS, machine specific BLAS, ATLAS, … MPICH, machine specific MPI, PVM, … Reference LAPACK, machine specific LAPACK, … ScaLAPACK, PLAPACK, … To use a number of different basic libraries to develop a polylibrary In this paper we analyse the design of polylibraries, where the programs call routines from different libraries according to the characteristics of the problem and of the system used to solve it. An architecture for this type of libraries is proposed. Our aim is to develop a methodology which can be used in the design of parallel libraries. To evaluate the viability of the proposed method, the typical linear algebra libraries hierarchy has been considered. Experiments have been performed in different systems and with linear algebra routines from different levels of the hierarchy. The results confirm the design of polylibraries as a good technique for speeding up computations. In this paper, we propose the architecture of a polylibrary, and show with experiments with some linear algebra routines how the use of polylibraries helps in the decision of the libraries to be used to reduce the execution time when solving a problem. 28 November 2018 Universidad Politécnica de Valencia
4
Universidad Politécnica de Valencia
Polylibraries Typical parallel linear algebra libraries hierarchy ScaLAPACK LAPACK BLAS PBLAS BLACS MPI, PVM, ... When solving linear algebra problems, basic linear algebra routines are called to solved different parts of the problem. Typically some version of BLAS [9] or LA- PACK [10] is used in sequential or shared memory systems, and ScaLAPACK [11] can be used in message-passing systems. 28 November 2018 Universidad Politécnica de Valencia
5
Universidad Politécnica de Valencia
Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACK PBLAS BLACS MPI, PVM, ... ref. BLAS mac. BLAS ATLAS But there may be different versions of the libraries. It is possible to have libraries optimised for the machine (either provided by the vendor or freely available [12,13]), or automatic tuning libraries like ATLAS [1] can be installed, or again alternative libraries can be used [14]. The hierarchy of linear algebra libraries is shown in figure 1. At the lowest level of the hierarchy is the basic linear algebra library (BLAS), which may be the reference BLAS, a BLAS optimised for the machine, or an automatic tuning library. 28 November 2018 Universidad Politécnica de Valencia
6
Universidad Politécnica de Valencia
Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACK PBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM To perform the communications in message passing programs a library optimised for the machine can be used, as could a free version of MPI [15] or PVM [16]. 28 November 2018 Universidad Politécnica de Valencia
7
Universidad Politécnica de Valencia
Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK mac. LAPACK PBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM ESSL ref. LAPACK At higher levels of the hierarchy there are also different libraries with the same or similar functionality. 28 November 2018 Universidad Politécnica de Valencia
8
Universidad Politécnica de Valencia
Polylibraries PBLAS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM mac. LAPACK ESSL ref. LAPACK mac. ScaLAPACK ref. ScaLAPACK BLACS 28 November 2018 Universidad Politécnica de Valencia
9
Universidad Politécnica de Valencia
Polylibraries The advantage of Polylibraries A library optimised for the system might not be available The characteristics of the system can change Which library is the best may vary according to the routines and the systems Even for different problem sizes or different data access schemes the preferred library can change In parallel system with the file system shared by processors of different types Traditionally the user decides to use a particular set of libraries in the solution of all the problems. There are some considerations which make it advisable to use different libraries for different problems. Automatic selection is therefore interesting for a number of reasons: *A library optimised for the system might not be available. The use of an automatic tuning library would be satisfactory, although with time a better library may become available. Thus, it is necessary to have a common method to evaluate libraries or routines in the libraries, so that when a library is installed, the preferred library is changed. *When the characteristics of the system change, the preferred library can also change. This makes it necessary to reinstall the libraries. * Which library is the best may not be clear since library preference will vary according to the routines and the systems in question. This makes it necessary to develop the concept of polylibrary. The information generated in the installation of each library is used to construct the polylibrary. *Even for different problem sizes or different data access schemes the preferred library can change. This makes it necessary to design polylibraries in such a way that a routine in the polylibrary calls to one library or another not only according to the routine, but also to other factors such as the problem size or the data access scheme. * We may be working in a parallel system with the le system shared by processors of different types, which may or may not share the libraries. In the case of their sharing a basic library, this may be optimised for one type of processors and not for the others. In any case, the resulting library will be different for the different types of processors. 28 November 2018 Universidad Politécnica de Valencia
10
Architecture of a Polylibrary
A possible architecture for a polylibrary is shown in gure 2. 28 November 2018 Universidad Politécnica de Valencia
11
Architecture of a Polylibrary
LIF_1 Installation For each of the basic libraries an installation le (LIF) will be generated, containing information about the performance of the dierent routines in the library for different variable factors. The format of the different LIFs must be the same, and the generation of the LIFs must be carried out in the polylibrary installation or updating process. 28 November 2018 Universidad Politécnica de Valencia
12
Architecture of a Polylibrary
LIF_1 Installation Routine: DGEMM m 20 40 80 X Mflops n For some routines more than one parameter can be used, as for example in matrix-matrix multiplication GEMM, where it could be interesting to store the cost withsquare or rectangular matrices. Also the different access schemes can produce different costs, for example, the BLAS routine DROT can have dierent costs when the data in the vectors are stored in adjacent positions of memory than when they are stored in non-adjacent positions. The format of the LIF could be as follows: This generation can be made by using some model of the routines to predict the execution time or by performing some executions [17,2]. When the generation is made by performing some executions, the process must be guided in order to avoid an overlong installation [18]. For each routine in the polylibrary, an interface routine is developed. 28 November 2018 Universidad Politécnica de Valencia
13
Architecture of a Polylibrary
LIF_1 Installation Routine: DROT Leading dimension 1 100 200 X Mflops n 400 For some routines more than one parameter can be used, as for example in matrix-matrix multiplication GEMM, where it could be interesting to store the cost with square or rectangular matrices. Also the different access schemes can produce different costs, for example, the BLAS routine DROT can have dierent costs when the data in the vectors are stored in adjacent positions of memory than when they are stored in non-adjacent positions. The format of the LIF could be as follows: This generation can be made by using some model of the routines to predict the execution time or by performing some executions [17,2]. When the generation is made by performing some executions, the process must be guided in order to avoid an overlong installation [18]. For each routine in the polylibrary, an interface routine is developed. 28 November 2018 Universidad Politécnica de Valencia
14
Architecture of a Polylibrary
LIF_1 Installation 28 November 2018 Universidad Politécnica de Valencia
15
Architecture of a Polylibrary
LIF_2 Library_1 LIF_1 Installation 28 November 2018 Universidad Politécnica de Valencia
16
Architecture of a Polylibrary
LIF_2 Library_3 Library_1 LIF_1 Installation 28 November 2018 Universidad Politécnica de Valencia
17
Architecture of a Polylibrary
LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 28 November 2018 Universidad Politécnica de Valencia
18
Architecture of a Polylibrary
interface routine_1 interface routine_2 ... Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 The polylibrary will use the information generated in the lower level libraries installation (LIFs) to generate the code of an interface routine for each routine in the polylibrary. To do this, the format of the dierent LIFs would be the same, and what we propose is to develop each library together with an installation engine which generates the LIF in the format required. This format must be decided by the linear algebra community in order to facilitate the use of the information to design polylibraries or libraries of higher level in the libraries hierarchy 28 November 2018 Universidad Politécnica de Valencia
19
Architecture of a Polylibrary
interface routine_1 interface routine_2 ... if n<value call routine_1 from Library_1 else depending on data storage or call routine_1 from Library_2 Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 28 November 2018 Universidad Politécnica de Valencia
20
Universidad Politécnica de Valencia
Polylibraries Combining Polylibraries with other Optimisation Techniques: Polyalgorithms Algorithmic Parameters Block size Number of processors Logical topology of processors There are different techniques for obtaining optimized libraries, and better libraries can be obtained by combining different techniques. In polyalgorithms we have different algorithms to solve the same problem. The algorithms can be implemented in different libraries, and the use of different algorithms does not differ from the use of different libraries. If one of the basic libraries used is an automatically tuned library a large amount of time may be necessary to install the library in the system [1,2], which means a large amount of time is necessary to update the polylibrary when the conditions on the system change. The same is true when installing or reinstalling the polylibrary, and in a parallel system multiple installations will be necessary (possibly in parallel). Thus the installation process must be guided by the system manager [3], which decides when it is convenient to reinstall the polylibrary or include a new library, or a new type of processor, and which factors and range of values to consider. In some cases the value of some algorithmic parameters are obtained at execution time [19]. These values can be the block size in algorithms by blocks; the number of processors or some parameters determining the logical topology in a parallel system, ..This means different values of these parameters must be considered in the polylibrary installation. To adapt the method to systems where the load varies dynamically some heuristic can be used [6]. The values obtained at installation time can be influenced by using some measures of the system load obtained at running time. 28 November 2018 Universidad Politécnica de Valencia
21
Universidad Politécnica de Valencia
Experimental Results Routines of different levels in the hierarchy: Lowest level: GEMM: matrix-matrix multiplication Medium level: LU and QR factorisations Highest level: a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem an algorithm to solve the Toeplitz least square problem Experiments have been carried out with routines of different levels in the hierarchy: · In the lowest level of the hierarchy, the routines are the typical BLAS ones. These routines must be installed by performing selected computations, and the information is stored in the LIF of the library. The matrix-matrix multiplication has been used because it is a representative operation of the lowest level libraries, it is used in algorithms by blocks and is a good example to show the methodology. · Matrix factorisations by blocks [21] have been analysed (LU and QR) because they are basic operations at a second level of libraries, and use routines of a lower level, as for example the matrix-matrix multiplication. · As examples of the highest level of libraries a lift-and-project algorithm to solve the inverse additive eigenvalue problem [22] and an algorithm to solve the Toeplitz least square problem [23,24] have been considered. In both cases they use routines of lower levels, and those routines are used in a blackbox way. 28 November 2018 Universidad Politécnica de Valencia
22
Universidad Politécnica de Valencia
Experimental Results The platforms: SGI Origin 2000 IBM-SP2 Different networks of processors SUN Workstations + Ethernet PCs + Fast-Ethernet PCs + Myrinet The experiments have been carried out over different systems because the goal is to obtain a methodology valid for a wide range of systems. The platforms used have been: a shared memory space system (Origin 2000 with 32 processors), and some networks of processors. Some of the experimental results obtained are reported in this section. More experimental results can be found in [25]. 28 November 2018 Universidad Politécnica de Valencia
23
Experimental Results: GEMM
Routine: GEMM (matrix-matrix multiplication) Platform: five SUN Ultra 1 / one SUN Ultra 5 Libraries: refBLAS macBLAS ATLAS1 ATLAS2 ATLAS5 Algorithms and Parameters: Strassen base size By blocks block size Direct method The matrix-matrix multiplication is an operation from the lowest level of the hierarchy, but higher level implementations have been considered. For routines in the lowest level of the hierarchy only the sequential case is analysed, because these routines would be used by parallel routines which will call them in each processor. In a network of six SUNs with two different types of processors (five SUN Ultra 1 and one SUN Ultra 5) the interface of the routine would have one part for each type of processor. The libraries used in the experiments have been: a freely available reference BLAS (BLASref), the scientific library of SUN (BLASmac), and three versions of ATLAS [26]: a version of ATLAS installed in SUN Ultra 1 (ATLAS1), and another two prebuilt versions of ATLAS for Ultra 2 (ATLAS2) and Ultra 5 (ATLAS5). The same libraries are used by the SUN Ultra 1 and the SUN Ultra 5 because they share the file system. For polyalgorithms we use a direct multiplication, a multiplication by blocks, and a Strassen multiplication. The use of these algorithms allows us to combine the design of polylibraries with the automatic choice of algorithmic parameters: in the algorithm by blocks the block size is an algorithmic parameter, and in the Strassen multiplication the algorithmic parameter is the size of the matrix to stop the recurrence. The values in the LIF can be obtained by performing executions for the different algorithms, problem sizes and values of the parameters provided by the library manager. This may suppose a large amount of time, and heuristics can be used to reduce that time. For example, in the algorithm by blocks, experiments can be performed for different blocks for the smallest problem size. In the Strassen algorithm, for the smallest problem size, the problem is solved with base size half the problem size, then a quarter of the problem size, and so on until the execution time increases. The optimum base size is taken as the initial for the local search for the next problem size. 28 November 2018 Universidad Politécnica de Valencia
24
Experimental Results: GEMM
MATRIX-MATRIX MULTIPLICATION INTERFACE: if processor is SUN Ultra 5 if problem-size<600 solve using ATLAS5 and Strassen method with base size half of problem size else if problem-size<1000 solve using ATLAS5 and block method with block size 400 else endif else if processor is SUN Ultra 1 solve using ATLAS5 and direct method If the installation is made with matrix sizes 400, 800 and 1200, the interface routine could be: 28 November 2018 Universidad Politécnica de Valencia
25
Experimental Results: GEMM
200 600 1000 1400 1600 Low Time Library Method Parameter 0.04 ATL5 direct 1.06 4.68 Strass 2 12.53 ATL2 20.03 blocks 400 Mod 1.11 12.58 26.57 ATLAS5 Direct 4.83 13.50 31.02 The lowest execution time (low.) is compared with the execution time with the library, method and parameter provided by the model (mod.), and that obtained using the direct method with ATLAS5. The execution time using the model is close to the lowest, and the same happens when the direct method and ATLAS5 are used. The conclusion could be that it is better to determine the best library (ATLAS5) and method (direct). Possibly this is a good option in some cases, but we observe that the optimum library is not always apparent. The preferred library should be the scientific library of SUN, but it is clearly surpassed by all the versions of ATLAS. In SUN Ultra 1, one would expect the preferred ATLAS to be that installed in the system by the manager (ATLAS1), but in fact the best is ATLAS5. The use of different algorithms and libraries is not a great improvement with respect to the use of the best library, but the proposed method can also be used just to obtain or update which is the best library among those available. The method can also be used to decide which algorithm to choose from a set of available algorithms, because each algorithm can be considered as the routine of a different library.The selection of the parameters which produce lowest execution times can be combined with the selection of the library or algorithm. 28 November 2018 Universidad Politécnica de Valencia
26
Experimental Results: LU
Routine: LU factorisation Platform: 4 PentiumIII + Myrinet Libraries: ATLAS BLAS for Pentium II BLAS for Pentium III The matrix is distributed in a logical two-dimensional mesh (in ScaLAPACK style [11]) of p=rc processors, with block size b and d=max(r,c). Typical system parameters are the cost of arithmetic operations using BLAS (or similar one) kernels of levels 1, 2 or 3 (k1, k2, k3) and communication parameters (start-up, ts, and word-sending time, tw). Experiments have been carried out in a network of Pentium III with Myrinet and Fast-Ethernet. libraries have been used: ATLAS, a BLAS for Pentium II Xeon (BLASII), and two versions of a BLAS for Pentium III (BLASIII) 28 November 2018 Universidad Politécnica de Valencia
27
Experimental Results: LU
The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c d=max(r,c) System Parameters: cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm communication parameters: ts tw Automatic tuning medium level routines can be developed by obtaining a theoretical model of the routine where the system parameters are identified, and the values of the system parameters are obtained from those stored in the LIFs of lower level libraries. After the substitution of the values some decisions can be taken, e. g. the best library and the optimum values of some algorithmic parameters. We show how the method works with the LU and QR factorizations. The cost of the parallel block LU factorization can be modelled by the formulas: ) The matrix is distributed in a logical two-dimensional mesh (in ScaLAPACK style [11]) of p=rc processors, with block size b and d=max(r,c). Typical system parameters are the cost of arithmetic operations using BLAS (or similar one) kernels of levels 1, 2 or 3 (k1, k2, k3) and communication parameters (start-up, ts, and word-sending time, tw). Experiments have been carried out in a network of Pentium III with Myrinet and Fast-Ethernet. 28 November 2018 Universidad Politécnica de Valencia
28
Experimental Results: LU
the. low. mod. n time b ATLAS 512 1024 1536 0.13 32 0.12 0.79 0.74 2.36 2.21 64 2.27 BLAS-II 0.11 0.77 0.71 2.30 2.13 BLAS-III 0.70 In table 2 the experimental results with the LU factorization are summarized. The theoretical (the.) and lowest (low.) execution times, the execution time obtained with the parameters provided by the model (mod.), and the associated block size are shown for different libraries and problem sizes. The lowest optimum and modelled execution times for each matrix size are highlighted. The model provides good values for the parameters. The best results are obtained with BLASIII and BLASII, as predicted by the model. Also the optimum block size is well predicted. The difference between the libraries is less important in the parallel case than in the sequential case due to the cost of communications, which is common to all the libraries. 28 November 2018 Universidad Politécnica de Valencia
29
Experimental Results: QR
Routine: QR factorisation Platform: 8 PentiumIII + Fast-Ethernet Libraries: ATLAS BLAS for Pentium II BLAS for Pentium III The matrix is distributed in a logical two-dimensional mesh (in ScaLAPACK style [11]) of p=rc processors, with block size b and d=max(r,c). Typical system parameters are the cost of arithmetic operations using BLAS (or similar one) kernels of levels 1, 2 or 3 (k1, k2, k3) and communication parameters (start-up, ts, and word-sending time, tw). Experiments have been carried out in a network of Pentium III with Myrinet and Fast-Ethernet. libraries have been used: ATLAS, a BLAS for Pentium II Xeon (BLASII), and two versions of a BLAS for Pentium III (BLASIII) 28 November 2018 Universidad Politécnica de Valencia
30
Experimental Results: QR
The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c System Parameters: cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm communication parameters: ts tw The cost of the parallel block QR factorization can be modelled by the formulas: ) 28 November 2018 Universidad Politécnica de Valencia
31
Experimental Results: QR
128 256 384 1024 2048 3072 mod. Time r c b Library 0.025 12 8/16 BLASII 0.087 18 0.178 8 1.92 9.90 16 28.00 32 low. library 0.086 BLASIII 0.176 1x8 9.40 24 25.11 In table 3 results obtained with the QR factorization are shown. The execution time with the parameters provided by the model (mod.) and the lowest execution time (low.) are shown together with their parameters. The times with the model are always close to the lowest times, because the model makes a good selection of the parameters: number of processors, dimensions of the mesh, block size and library. 28 November 2018 Universidad Politécnica de Valencia
32
Experimental Results: L&P
Routine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob Platform: dual Pentium III Libraries combinations: La_In+B_In LAPACK and BLAS installed in the system and supposedly optimized for the machine La_Re+B_III reference LAPACK and a freely available BLAS for PentiumIII La_Re+B_II reference LAPACK and a freely available BLAS for Pentium II La_Re+B_In reference LAPACK and the installed BLAS La_In_Th+B_In_Th LAPACK and BLAS installed for the use of threads La_Re+B_II_Th reference LAPACK and a freely available BLAS for Pentium II using threads La_Re+B_In_Th reference LAPACK and the BLAS installed which uses threads A lift-and-project method for the solution of the inverse additive eigenvalue problem has been used as an example of high level algorithm of high cost. The order of the algorithm is O(n^ 4) The combinations of libraries used are: LAPACK and BLAS installed in the system and supposedly optimized for the machine (LaIn+BlIn), reference LAPACK and a freely available BLAS for Pentium III (LaRe+BIII), reference LAPACK and a freely available BLAS for Pentium II (LaRe+BlII), reference LAPACK and the installed BLAS (LaRe+BlIn), LAPACK and BLAS installed for the use of threads (LaInTh+BlInTh), reference LAPACK and a freely available BLAS for Pentium II using threads (LaRe+BlIITh), reference LAPACK and the BLAS installed which uses threads (LaRe+BlInTh). 28 November 2018 Universidad Politécnica de Valencia
33
Experimental Results: L&P
The theoretical model of the sequential algorithm cost: System Parameters: ksyev LAPACK k3, gemm k3, diaggemm BLAS-3 k1,dot k1,scal k1,axpy BLAS-1 In the formula there appear five parameters of routines from the lowest level of the hierarchy (BLAS), and one from the middle level (LAPACK). Three of the parameters of BLAS correspond to operations of BLAS 1, and two are from BLAS 3, but of the same operation (gemm) with two different types of matrices. 28 November 2018 Universidad Politécnica de Valencia
34
Experimental Results: L&P
Algorithm and cost The lift-and-project method used to solve the Least Square problem has been that described in [1]. The method will not be described here, but we will analyze the dierent parts to identify the lower level routines used in the solution of the problem, and the cost of the dierent parts of the algorithm. The most important parts in the algorithm (the other parts represent less than 0.1% of the total time) are: 28 November 2018 Universidad Politécnica de Valencia
35
Experimental Results: L&P
TRACE ADK EIGEN MATEIG MATMAT ZKAOA TOTAL La_In B_In 1.69 12.86 165.81 0.94 98.79 14.22 294.32 La_Re B_III 1.16 14.87 210.85 0.83 26.70 10.46 264.89 B_II 15.65 255.20 0.86 10.52 10.44 293.85 16.41 336.49 1.21 123.73 18.03 497.59 Lowest no threads 201.64 La_In_Th B_In_Th 1.10 13.92 266.63 0.66 14.13 12.34 308.80 B_II_Th 15.68 254.34 0.79 6.66 9.99 288.66 13.71 249.59 0.62 13.74 11.90 290.68 Lowest with threads 281.70 Lowest 197.06 Table 6 shows the time obtained for a matrix and with L = 25. The total time and the time in each part of the algorithm appear in the table. The combinations of libraries used are: LAPACK and BLAS installed in the system supposedly optimized for the machine (LaIn+BlIn), reference LAPACK and a freely available BLAS for Pentium III (LaRe+BIII), reference LAPACK and freely available BLAS for Pentium II (LaRe+BlII), reference LAPACK installed BLAS (LaRe+BlIn), LAPACK and BLAS installed for the use of threads (LaInTh+BlInTh), reference LAPACK and a freely available BLAS Pentium II using threads (LaRe+BlIITh), and reference LAPACK and the BLAS install which uses threads (LaRe+BlInTh). The lowest execution time when not using threads (lnt.) and when using threads (lwt.), and the lowest execution time (low.) are shown for each part of the algorithm and the total time. We can see the lowest time represents an important reduction with respect to the execution time obtained with the best combination (LaRe+BIII), which is not the one installed in the system. The versions that use threads are not better than those that do not. 28 November 2018 Universidad Politécnica de Valencia
36
Universidad Politécnica de Valencia
Polylibraries The method can be applied to sequential and parallel algorithms It can be combined with other methods of computation speed up. The LIF contains the cost of an operation for each one of the routines. These costs may be different for different data sizes or access schemes. Could be applied to help in the development of efficient parallel libraries in other fields. The architecture of a polylibrary has been described, and the advantage of using polylibraries has been shown with experiments with routines of different levels in the libraries hierarchy, and in different systems. The method can be applied to sequential and parallel algorithms, and it can be combined with other methods of computation speed up. The LIF of each library will contain the cost of an operation for each one of the routines in the library. These costs may be different for different data sizes or access schemes. The main issue for the design of polylibraries is the decision of the format of the LIFs, which must be taken by the linear algebra community. This would allow the development of a hierarchy of automatic tuning libraries, with associated LIFs. The proposed method could be applied to help in the development of efficient parallel libraries in other fields. At the moment we are researching the application to Divide and Conquer and Dynamic Programming algorithms. 28 November 2018 Universidad Politécnica de Valencia
37
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 28 November 2018 Universidad Politécnica de Valencia
38
Universidad Politécnica de Valencia
Algorithmic schemes To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: Design libraries to solve problems in different fields. Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna) Develop SKELETONS which could be used in parallel programming languages. Skil, Skipper, CARAML, P3L, … 28 November 2018 Universidad Politécnica de Valencia
39
Universidad Politécnica de Valencia
Dynamic Programming There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 28 November 2018 Universidad Politécnica de Valencia
40
Universidad Politécnica de Valencia
Dynamic Programming Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 28 November 2018 Universidad Politécnica de Valencia
41
Universidad Politécnica de Valencia
Dynamic Programming Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P P PS PK PK 28 November 2018 Universidad Politécnica de Valencia
42
Universidad Politécnica de Valencia
Dynamic Programming Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P P PK PK 28 November 2018 Universidad Politécnica de Valencia
43
Universidad Politécnica de Valencia
Dynamic Programming Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 28 November 2018 Universidad Politécnica de Valencia
44
Universidad Politécnica de Valencia
Dynamic Programming How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 28 November 2018 Universidad Politécnica de Valencia
45
Universidad Politécnica de Valencia
Dynamic Programming Experimental results: Systems: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, , Large value of qi The granularity of the computation (the cost of a computational step) 28 November 2018 Universidad Politécnica de Valencia
46
Universidad Politécnica de Valencia
Dynamic Programming Experimental results: CP1: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) Executions with selected problem (C =10000, ) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 28 November 2018 Universidad Politécnica de Valencia
47
Universidad Politécnica de Valencia
Dynamic Programming Parameter selection 10.000 50.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 28 November 2018 Universidad Politécnica de Valencia
48
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 28 November 2018 Universidad Politécnica de Valencia
49
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 28 November 2018 Universidad Politécnica de Valencia
50
Universidad Politécnica de Valencia
Dynamic Programming Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 28 November 2018 Universidad Politécnica de Valencia
51
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 28 November 2018 Universidad Politécnica de Valencia
52
Universidad Politécnica de Valencia
Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 28 November 2018 Universidad Politécnica de Valencia
53
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 28 November 2018 Universidad Politécnica de Valencia
54
Heterogeneous algorithms
Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0 b1 b2 28 November 2018 Universidad Politécnica de Valencia
55
Heterogeneous algorithms
Parameter selection: RI-THE: obtains p and b from the formula (homogeneous distribution) RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) RI-HET: obtains p and b through a reduced number of executions and each 28 November 2018 Universidad Politécnica de Valencia
56
Heterogeneous algorithms
Quotient with respect to the lowest experimental execution time: RI-THEO RI-HOMO RI-HETE 2 2 2 1,5 1,5 1,5 1 1 1 0,5 0,5 0,5 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5 Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 28 November 2018 Universidad Politécnica de Valencia
57
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
58
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
59
Universidad Politécnica de Valencia
Parameter selection at running time The NWS is called and it reports: ·the fraction of available CPU (fCPU) ·the current word sending time (tw_current) for a specific n and AP values (n0, AP0). Then the fraction of available network is calculated: I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
60
Universidad Politécnica de Valencia
Parameter selection at running time node1 node2 node3 node4 node5 node6 node7 node8 Situation A CPU avail. 100% 100% 100% 100% 100% 100% 100% 100% tw-current sec Situation B CPU avail. 80% 80% 80% 80% 100% 100% 100% 100% tw-current sec sec Situation C CPU avail. 60% 60% 60% 60% 100% 100% 100% 100% tw-current sec sec Situation D CPU avail. 60% 60% 60% 60% 100% 100% 80% 80% tw-current sec 0.7sec 0.8sec Situation E CPU avail. 60% 60% 60% 60% 100% 100% 50% 50% tw-current sec 0.7sec 4.0sec 28 November 2018 Universidad Politécnica de Valencia
61
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
62
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
63
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M The values of the SP are tuned, according to the current situation: 28 November 2018 Universidad Politécnica de Valencia
64
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
65
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
66
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M Block size Situation of the Platform Load n A B C D E Number of nodes to use p = r c Situation of the Platform Load n A B C D E 2 42 22 22 21 2 42 22 22 21 2 42 22 22 21 28 November 2018 Universidad Politécnica de Valencia
67
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
68
Universidad Politécnica de Valencia
Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS D E G R U - M 28 November 2018 Universidad Politécnica de Valencia
69
Universidad Politécnica de Valencia
Parameter selection at running time Static Model Dynamic Model n = 1024 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A B C D E n = 2048 0% 20% 40% 60% 80% 100% 120% 140% 160% A B C D E Situation of the platform load n = 3072 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A B C D E 28 November 2018 Universidad Politécnica de Valencia
70
Universidad Politécnica de Valencia
Work distribution There are different possibilities in heterogeneous systems: Heterogeneous algorithms (Gauss elimination). Homogeneous algorithms and assignation of: One process to each processor (LU factorization) A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP use of heuristics approximations 28 November 2018 Universidad Politécnica de Valencia
71
Universidad Politécnica de Valencia
Work distribution Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution 1 2 . j ... i … n 1 2 . j ... i … n p0 p1 p2 p p4 p ps pr-1 pr P0 P0 P1 P P3 P PS ... PK PK P P P PS PK PK 28 November 2018 Universidad Politécnica de Valencia
72
Universidad Politécnica de Valencia
Work distribution The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) Problem size: n number of types of coins C value to give v array of values of the coins q quantity of coins of each type Algorithmic parameters: p number of processes b block size (here n/p) d processes to processors assignment System parameters: tc cost of basic arithmetic operations ts start-up time tw word-sending time 28 November 2018 Universidad Politécnica de Valencia
73
Universidad Politécnica de Valencia
Work distribution Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (qi large): Communication cost: There is a new AP: d SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables Process Pp 28 November 2018 Universidad Politécnica de Valencia
74
Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P processors 1 2 3 ... P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary 28 November 2018 Universidad Politécnica de Valencia
75
Universidad Politécnica de Valencia
Work distribution Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 28 November 2018 Universidad Politécnica de Valencia
76
Universidad Politécnica de Valencia
Work distribution Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2 processors U5 U1 U1 U5 U1 U1 U5 one process to each processor p processes U1 U1 ... U1 28 November 2018 Universidad Politécnica de Valencia
77
Universidad Politécnica de Valencia
Work distribution Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4 processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 p processes 1 2 3 4 2 3 4 3 4 4 ... the values of SPs change 28 November 2018 Universidad Politécnica de Valencia
78
Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: 28 November 2018 Universidad Politécnica de Valencia
79
Universidad Politécnica de Valencia
Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations 28 November 2018 Universidad Politécnica de Valencia
80
Universidad Politécnica de Valencia
Work distribution Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The APs are p and the assignation array d The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw one step Maximum values 28 November 2018 Universidad Politécnica de Valencia
81
Universidad Politécnica de Valencia
Work distribution How to estimate arithmetic SPs: Solving a small problem on each type of processors How to estimate communication SPs: Using a ping-pong between each pair of processors, and processes in the same processor (CP1) Does not reflect the characteristics of the system Solving a small problem varying the number of processors, and with linear interpolation (CP2) Larger installation time 28 November 2018 Universidad Politécnica de Valencia
82
Universidad Politécnica de Valencia
Work distribution Three types of users are considered: GU (greedy user): Uses all the available processors, with one process per processor. CU (conservative user): Uses half of the available processors (the fastest), with one process per processor. EU (user expert in the problem, the system and heterogeneous computing): Uses a different number of processes and processors depending on the granularity: 1 process in the fastest processor, for low granularity The number of processes is half of the available processors, and in the appropriate processors, for middle granularity A number of processes equal to the number of processors, and in the appropriate processors, for large granularity 28 November 2018 Universidad Politécnica de Valencia
83
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: 28 November 2018 Universidad Politécnica de Valencia
84
Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC, with CP2: C gra LT CP2 50000 10 (1,2) 50 (1,2,4,4) 100 100000 500000 (1,2,3,4) 28 November 2018 Universidad Politécnica de Valencia
85
Universidad Politécnica de Valencia
Work distribution Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50 (1,1,2,3,3,3,3,3,3,3,3) 100 (1,1,3,3) 100000 (1,1,3) 500000 (1,1,2,3) 28 November 2018 Universidad Politécnica de Valencia
86
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: 28 November 2018 Universidad Politécnica de Valencia
87
Universidad Politécnica de Valencia
Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): 28 November 2018 Universidad Politécnica de Valencia
88
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 28 November 2018 Universidad Politécnica de Valencia
89
Universidad Politécnica de Valencia
Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use 28 November 2018 Universidad Politécnica de Valencia
90
Hybrid programming Advantages of Hybrid Programming
To improve scalability When too many tasks produce load imbalance Applications with fine and coarse-grain parallelism Redution of the code development time When the number of MPI processes is fixed In case of a mixture of functional and data parallelism 28 November 2018 Universidad Politécnica de Valencia
91
Hybrid programming Hybrid Programming in the literature
Most of the papers are about particular applications Some papers present hybrid models No theoretical models of the execution time are available 28 November 2018 Universidad Politécnica de Valencia
92
Universidad Politécnica de Valencia
Hybrid programming Systems Networks of Dual Pentiums HPC160 (each node four processors) IBM SP Blue Horizon (144 nodes, each 8 processors) Earth Simulator (640x8 vector processors) … 28 November 2018 Universidad Politécnica de Valencia
93
Universidad Politécnica de Valencia
Hybrid programming 28 November 2018 Universidad Politécnica de Valencia
94
Universidad Politécnica de Valencia
Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications 28 November 2018 Universidad Politécnica de Valencia
95
Universidad Politécnica de Valencia
Hybrid programming 28 November 2018 Universidad Politécnica de Valencia
96
Universidad Politécnica de Valencia
Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0 28 November 2018 Universidad Politécnica de Valencia
97
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient 28 November 2018 Universidad Politécnica de Valencia
98
Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver 28 November 2018 Universidad Politécnica de Valencia
99
Universidad Politécnica de Valencia
Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system 28 November 2018 Universidad Politécnica de Valencia
100
Universidad Politécnica de Valencia
Hybrid programming Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N1 N2 28 November 2018 Universidad Politécnica de Valencia
101
Universidad Politécnica de Valencia
Hybrid programming In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations 28 November 2018 Universidad Politécnica de Valencia
102
Universidad Politécnica de Valencia
Hybrid programming And more System Parameters: The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) The cost of arithmetic operations can vary when the number of threads in the node varies Consequently, the algorithms must be recoded and new models of the execution time must be obtained 28 November 2018 Universidad Politécnica de Valencia
103
Universidad Politécnica de Valencia
Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads 28 November 2018 Universidad Politécnica de Valencia
104
Universidad Politécnica de Valencia
Hybrid programming Open problem Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. Or at least for some type of programs, such as matricial problems in meshes of processors? And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained? 28 November 2018 Universidad Politécnica de Valencia
105
Universidad Politécnica de Valencia
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 28 November 2018 Universidad Politécnica de Valencia
106
Universidad Politécnica de Valencia
Peer to peer computing Distributed systems: They are inherently heterogeneous and dynamic But there are other problems: Higher communication cost Special middleware is necessary The typical paradigms are master/slave, client/server, where different types of processors (users) are considered. 28 November 2018 Universidad Politécnica de Valencia
107
Universidad Politécnica de Valencia
Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002 28 November 2018 Universidad Politécnica de Valencia
108
Universidad Politécnica de Valencia
Peer to peer computing Peer to peer: All the processors (users) are at the same level (at least initially) The community selects, in a democratic and continuous way, the topology of the global network Would it be interesting to have a P2P system for computing? Is some system of this type available? 28 November 2018 Universidad Politécnica de Valencia
109
Universidad Politécnica de Valencia
Peer to peer computing Would it be interesting to have a P2P system for computing? I think it would be interesting to develop a system of this type And to leave the community to decide, in a democratic and continuous way, if it is worthwhile Is some system of this type available? I think there is no pure P2P dedicated to computation 28 November 2018 Universidad Politécnica de Valencia
110
Universidad Politécnica de Valencia
Peer to peer computing … and other people seem to think the same: Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability” 28 November 2018 Universidad Politécnica de Valencia
111
Universidad Politécnica de Valencia
Peer to peer computing There are a lot of tools for Grid Computing: Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Netsolve/Gridsolve. Uses a client/server structure. PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator. 28 November 2018 Universidad Politécnica de Valencia
112
Universidad Politécnica de Valencia
Peer to peer computing For Computation on P2P the shared resources are: Information: books, papers, …, in a typical way. Libraries. One peer takes a library from another peer. Necessary description of the library and the system to know if the library fulfils our requests. Computation. One peer colaborates to solve a problem proposed by another peer. This is the central idea of Computation on P2P… 28 November 2018 Universidad Politécnica de Valencia
113
Universidad Politécnica de Valencia
Peer to peer computing Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 28 November 2018 Universidad Politécnica de Valencia
114
Universidad Politécnica de Valencia
Peer to peer computing There are Different global hierarchies Different libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 28 November 2018 Universidad Politécnica de Valencia
115
Universidad Politécnica de Valencia
Peer to peer computing And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Inst. Inform. Inst. Inform. Mac. LAPACK Inst. Inform. Inst. Inform. Inst. Inform. BLAS Inst. Inform. Inst. Inform. Inst. Inform. Reference MPI Inst. Inform. 28 November 2018 Universidad Politécnica de Valencia Inst. Inform.
116
Universidad Politécnica de Valencia
Peer to peer computing Trust problems appear: Does the library solve the problems we require to be solved? Is the library optimized for the system it claims to be optimized for? Is the installation information correct? Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems? 28 November 2018 Universidad Politécnica de Valencia
117
Universidad Politécnica de Valencia
Peer to peer computing Each peer would have the possibility of establishing a policy of use: The use of the resources could be payable The percentage of CPU dedicated to computations for the community The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes? 28 November 2018 Universidad Politécnica de Valencia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.