Automatic Optimization in Parallel Dynamic Programming Schemes

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Creating Computer Programs lesson 27. This lesson includes the following sections: What is a Computer Program? How Programs Solve Problems Two Approaches:
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.
Dynamic Load Balancing Experiments in a Grid Vrije Universiteit Amsterdam, The Netherlands CWI Amsterdam, The
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
Antonio M. Vidal Jesús Peinado
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.
Efficient Model Selection for Support Vector Machines
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Programming for Performance CS433 Spring 2001 Laxmikant Kale.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Programming for Performance Laxmikant Kale CS 433.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
A Parallel Communication Infrastructure for STAPL
Auburn University
Parallel processing is not easy
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Special Cases In Linear Programming
Parallel and Distributed Simulation Techniques
Parallel Density-based Hybrid Clustering
Introduction to parallel algorithms
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Performance Evaluation of Adaptive MPI
Advances in the Optimization of Parallel Routines (I)
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.
CPU Scheduling G.Anuradha
Advances in the Optimization of Parallel Routines (II)
Advances in the Optimization of Parallel Routines (II)
Introduction to parallel algorithms
COMP60621 Fundamentals of Parallel and Distributed Systems
By Brandon, Ben, and Lee Parallel Computing.
Creating Computer Programs
Advances in the Optimization of Parallel Routines (III)
Automatic optimization of parallel linear algebra software
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Introduction to High Performance Computing Lecture 16
Introduction to High Performance Computing Lecture 17
Potential for parallel computers/parallel programming
Creating Computer Programs
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to parallel algorithms
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Automatic Optimization in Parallel Dynamic Programming Schemes Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es 07 May 2019 VECPAR2004

General Goal: to obtain parallel routines with autotuning capacity Our Goal General Goal: to obtain parallel routines with autotuning capacity Previous works: Linear Algebra Routines This communication: Parallel Dynamic Programming Schemes In the future: apply the techniques to hybrid, heterogeneous and distributed systems 07 May 2019 VECPAR 2004

Outline Modelling Parallel Routines for Autotuning Parallel Dynamic Programming Schemes Autotuning in Parallel Dynamic Programming Schemes Experimental Results 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries) 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of processors (q) used from the total available 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning The values of the System Parameters could be obtained With installation routines associated to the routine we are installing From information stored when the library was installed in the system At execution time by testing the system conditions prior to the call to the routine 07 May 2019 VECPAR 2004

Modelling Parallel Routines for Autotuning These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time 07 May 2019 VECPAR 2004

Parallel Dynamic Programming Schemes There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 07 May 2019 VECPAR 2004

Parallel Dynamic Programming Schemes Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor   1 2 . j N …. i … n 07 May 2019 VECPAR 2004

Parallel Dynamic Programming Schemes Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel   1 2 . j ... i … n PO P1 P2 ...... PS ... PK-1 PK 07 May 2019 VECPAR 2004

Parallel Dynamic Programming Schemes Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor   1 2 . j ... i … n N PO P1 P2 .................... PK-1 PK 07 May 2019 VECPAR 2004

Autotuning in Parallel Dynamic Programming Schemes Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 07 May 2019 VECPAR 2004

Autotuning in Parallel Dynamic Programming Schemes How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 07 May 2019 VECPAR 2004

Experimental Results Systems: Varying: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, 100000, 500000 Large value of qi The granularity of the computation (the cost of a computational step) 07 May 2019 VECPAR 2004

Experimental Results CP1: CP2: CP3: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) CP3: Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 07 May 2019 VECPAR 2004

Experimental Results Parameter selection SUNEt PenFE 10.000 50.000 100.000 500.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004

Experimental Results Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 07 May 2019 VECPAR 2004

Experimental Results Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 07 May 2019 VECPAR 2004

Conclusions and future work The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme has been considered. Different forms of modelling the scheme and how parameters are selected have been studied. Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions In the future we plan to apply this technique to other algorithmic schemes, in hybrid, heterogeneous and distributed systems. 07 May 2019 VECPAR 2004