Automatic Optimization in Parallel Dynamic Programming Schemes

Automatic Optimization in Parallel Dynamic Programming Schemes
Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain 07 May 2019 VECPAR2004

General Goal: to obtain parallel routines with autotuning capacity
Our Goal General Goal: to obtain parallel routines with autotuning capacity Previous works: Linear Algebra Routines This communication: Parallel Dynamic Programming Schemes In the future: apply the techniques to hybrid, heterogeneous and distributed systems 07 May 2019 VECPAR 2004

Outline Modelling Parallel Routines for Autotuning
Parallel Dynamic Programming Schemes Autotuning in Parallel Dynamic Programming Schemes Experimental Results 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning
Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries) 07 May 2019 VECPAR 2004

Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation 07 May 2019 VECPAR 2004

Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. 07 May 2019 VECPAR 2004

The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of processors (q) used from the total available 07 May 2019 VECPAR 2004

And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) 07 May 2019 VECPAR 2004

The values of the System Parameters could be obtained With installation routines associated to the routine we are installing From information stored when the library was installed in the system At execution time by testing the system conditions prior to the call to the routine 07 May 2019 VECPAR 2004

These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time 07 May 2019 VECPAR 2004

Parallel Dynamic Programming Schemes
There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 07 May 2019 VECPAR 2004

Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 07 May 2019 VECPAR 2004

Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P P PS PK PK 07 May 2019 VECPAR 2004

Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P P PK PK 07 May 2019 VECPAR 2004

Autotuning in Parallel Dynamic Programming Schemes
Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 07 May 2019 VECPAR 2004

Autotuning in Parallel Dynamic Programming Schemes
How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 07 May 2019 VECPAR 2004

Experimental Results Systems: Varying:
SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, , Large value of qi The granularity of the computation (the cost of a computational step) 07 May 2019 VECPAR 2004

Experimental Results CP1: CP2: CP3:
ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) CP3: Executions with selected problem (C =10000, ) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 07 May 2019 VECPAR 2004

Experimental Results Parameter selection SUNEt PenFE 10.000 50.000
SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004

Experimental Results Three types of users are considered:
GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004

Conclusions and future work
The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme has been considered. Different forms of modelling the scheme and how parameters are selected have been studied. Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions In the future we plan to apply this technique to other algorithmic schemes, in hybrid, heterogeneous and distributed systems. 07 May 2019 VECPAR 2004

Automatic Optimization in Parallel Dynamic Programming Schemes

Similar presentations

Presentation on theme: "Automatic Optimization in Parallel Dynamic Programming Schemes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Optimization in Parallel Dynamic Programming Schemes

Similar presentations

Presentation on theme: "Automatic Optimization in Parallel Dynamic Programming Schemes"— Presentation transcript:

Similar presentations

About project

Feedback