Automatic Optimization in Parallel Dynamic Programming Schemes Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es 07 May 2019 VECPAR2004
General Goal: to obtain parallel routines with autotuning capacity Our Goal General Goal: to obtain parallel routines with autotuning capacity Previous works: Linear Algebra Routines This communication: Parallel Dynamic Programming Schemes In the future: apply the techniques to hybrid, heterogeneous and distributed systems 07 May 2019 VECPAR 2004
Outline Modelling Parallel Routines for Autotuning Parallel Dynamic Programming Schemes Autotuning in Parallel Dynamic Programming Schemes Experimental Results 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries) 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of processors (q) used from the total available 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning The values of the System Parameters could be obtained With installation routines associated to the routine we are installing From information stored when the library was installed in the system At execution time by testing the system conditions prior to the call to the routine 07 May 2019 VECPAR 2004
Modelling Parallel Routines for Autotuning These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time 07 May 2019 VECPAR 2004
Parallel Dynamic Programming Schemes There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 07 May 2019 VECPAR 2004
Parallel Dynamic Programming Schemes Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 07 May 2019 VECPAR 2004
Parallel Dynamic Programming Schemes Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P1 P2 ...... PS ... PK-1 PK 07 May 2019 VECPAR 2004
Parallel Dynamic Programming Schemes Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P1 P2 .................... PK-1 PK 07 May 2019 VECPAR 2004
Autotuning in Parallel Dynamic Programming Schemes Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 07 May 2019 VECPAR 2004
Autotuning in Parallel Dynamic Programming Schemes How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 07 May 2019 VECPAR 2004
Experimental Results Systems: Varying: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, 100000, 500000 Large value of qi The granularity of the computation (the cost of a computational step) 07 May 2019 VECPAR 2004
Experimental Results CP1: CP2: CP3: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) CP3: Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 07 May 2019 VECPAR 2004
Experimental Results Parameter selection SUNEt PenFE 10.000 50.000 100.000 500.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 07 May 2019 VECPAR 2004
Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004
Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004
Experimental Results Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 07 May 2019 VECPAR 2004
Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004
Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004
Conclusions and future work The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme has been considered. Different forms of modelling the scheme and how parameters are selected have been studied. Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions In the future we plan to apply this technique to other algorithmic schemes, in hybrid, heterogeneous and distributed systems. 07 May 2019 VECPAR 2004