Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

24th may Use of genetic algorithm for designing redundant sensor network Carine Gerkens Systèmes chimiques et conception de procédés Département.

Lauritzen-Spiegelhalter Algorithm

Types of Algorithms.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Progress in Linear Programming Based Branch-and-Bound Algorithms

Chess Problem Solver Solves a given chess position for checkmate Problem input in text format.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Building the communication performance model of heterogeneous clusters based on a switched network Alexey Lastovetsky

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Distributed Constraint Optimization * some slides courtesy of P. Modi

LINEAR EQUATION IN TWO VARIABLES. System of equations or simultaneous equations – System of equations or simultaneous equations – A pair of linear equations.

Max-flow/min-cut theorem Theorem: For each network with one source and one sink, the maximum flow from the source to the destination is equal to the minimal.

Article Title: Optimization model for resource assignment problems of linear construction projects ShuShun Liu & ChangJung Wang, National Yunlin University.

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.

Algorithms  Al-Khwarizmi, arab mathematician, 8 th century  Wrote a book: al-kitab… from which the word Algebra comes  Oldest algorithm: Euclidian algorithm.

LECTURE 13. Course: “Design of Systems: Structural Approach” Dept. “Communication Networks &Systems”, Faculty of Radioengineering & Cybernetics Moscow.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.

A Parallel Genetic Algorithm with Distributed Environment Scheme

On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Heuristics for Minimum Brauer Chain Problem Fatih Gelgi Melih Onus.

1 TCOM 5143 Lecture 10 Centralized Networks: Time Delay and Cost Tradeoffs.

A stochastic scheduling algorithm for precedence constrained tasks on Grid Future Generation Computer Systems (2011) Xiaoyong Tang, Kenli Li, Guiping Liao,

Validated Computing 2002 by Eustaquio A. Martínez 1, Tiaraju Asmuz Diverio 2 & Benjamín Barán 3 1 2

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Scheduling with Constraint Programming

Auburn University

Design and Analysis of Algorithms (09 Credits / 5 hours per week)

Neural Network Implementations on Parallel Architectures

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Density-based Hybrid Clustering

RE-Tree: An Efficient Index Structure for Regular Expressions

Design and Analysis of Algorithm

Course Description Algorithms are: Recipes for solving problems.

Advances in the Optimization of Parallel Routines (I)

Analysis and design of algorithm

Types of Algorithms.

Communication and Memory Efficient Parallel Decision Tree Construction

Types of Algorithms.

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (II)

Multi-Objective Optimization

Sungho Kang Yonsei University

Approximation Algorithms

Hierarchical Search on DisCSPs

Chapter 11 Limitations of Algorithm Power

Advances in the Optimization of Parallel Routines (III)

Types of Algorithms.

Automatic optimization of parallel linear algebra software

Automatic Optimization in Parallel Dynamic Programming Schemes

Major Design Strategies

Parallel Programming in C with MPI and OpenMP

Course Description Algorithms are: Recipes for solving problems.

Directional consistency Chapter 4

Major Design Strategies

Presentation transcript:

Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es 11 November 2018 HeteroPar2004

General Goal: to obtain parallel routines with autotuning capacity Our Goal General Goal: to obtain parallel routines with autotuning capacity Previous works: Linear Algebra Routines, Homogeneous Systems This communication: Parallel Dynamic Programming Schemes on Heterogeneous Systems In the future: apply the techniques to other algorithmic schemes 11 November 2018 HeteroPar2004

Outline Parallel Dynamic Programming Schemes Autotuning in Parallel Dynamic Programming Schemes Work Distribution Experimental Results 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P1 P2 ...... PS ... PK-1 PK 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P1 P2 .................... PK-1 PK 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes There are different possibilities in heterogeneous systems: Heterogeneous algorithms. Homogeneous algorithms and assignation of: One process to each processor A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations 11 November 2018 HeteroPar2004

Parallel Dynamic Programming Schemes Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution 1 2 . j ... i … n 1 2 . j ... i … n p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr P0 P0 P1 P3 P3 P3 ... PS ... PK PK P0 P1 P2 ...... PS ... PK-1 PK 11 November 2018 HeteroPar2004

Autotuning in Parallel Dynamic Programming Schemes The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) Problem size: n number of types of coins C value to give v array of values of the coins q quantity of coins of each type Algorithmic parameters (AP): p number of processes b block size (here n/p) d processes to processors assignment System parameters (SP): tc cost of basic arithmetic operations ts start-up time tw word-sending time 11 November 2018 HeteroPar2004

Autotuning in Parallel Dynamic Programming Schemes Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The APs are p and the assignation array d The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw one step Maximum values 11 November 2018 HeteroPar2004

Work distribution Assignment tree (P types of processors and p processes): P processors 1 2 3 ... P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary 11 November 2018 HeteroPar2004

Work distribution P =2 and p =3: 10 nodes in general: Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 1 5 10 10 5 1 1 4 6 4 1 1 3 3 1 1 2 1 11 November 2018 HeteroPar2004

Work distribution Systems: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) Ethernet TORC (Innovative Computing Laboratory): 21 nodes of different types (dual and single, Pentium II III and 4, AMD Athlon, …) FastEthernet, Myrinet, … 11 November 2018 HeteroPar2004

Work distribution Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2 processors U5 U1 U1 U5 U1 U1 U5 one process to each processor p processes U1 U1 ... U1 11 November 2018 HeteroPar2004

Work distribution Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4 processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 p processes 1 2 3 4 2 3 4 3 4 4 ... the values of SPs change 11 November 2018 HeteroPar2004

Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: 11 November 2018 HeteroPar2004

Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations 11 November 2018 HeteroPar2004

Experimental Results Systems: Varying: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet TORC: 11 nodes of different types (1.7 Ghz Pentium 4+1.2 Ghz AMD Athlon+600 Mhz Pentium III+8 550 Mhz Dual Pentium III) + FastEthernet Varying: The problem size C = 10000, 50000, 100000, 500000 Large value of qi The granularity of the computation (the cost of a computational step) 11 November 2018 HeteroPar2004

Experimental Results How to estimate arithmetic SPs: Solving a small problem on each type of processors How to estimate communication SPs: Using a ping-pong between each pair of processors, and processes in the same processor (CP1) Does not reflect the characteristics of the system Solving a small problem varying the number of processors, and with linear interpolation (CP2) Larger installation time 11 November 2018 HeteroPar2004

Experimental Results Three types of users are considered: GU (greedy user): Uses all the available processors, with one process per processor. CU (conservative user): Uses half of the available processors (the fastest), with one process per processor. EU (user expert in the problem, the system and heterogeneous computing): Uses a different number of processes and processors depending on the granularity: 1 process in the fastest processor, for low granularity The number of processes is half of the available processors, and in the appropriate processors, for middle granularity A number of processes equal to the number of processors, and in the appropriate processors, for large granularity 11 November 2018 HeteroPar2004

Experimental Results Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: 11 November 2018 HeteroPar2004

Experimental Results Parameters selection, in TORC, with CP2: C gra LT 50000 10 (1,2) 50 (1,2,4,4) 100 100000 500000 (1,2,3,4) 11 November 2018 HeteroPar2004

Experimental Results Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50 (1,1,2,3,3,3,3,3,3,3,3) 100 (1,1,3,3) 100000 (1,1,3) 500000 (1,1,2,3) 11 November 2018 HeteroPar2004

Experimental Results Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: 11 November 2018 HeteroPar2004

Experimental Results Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): 11 November 2018 HeteroPar2004

Conclusions and future work The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme for Heterogeneous Networks of Processors has been considered. Parameters selection is combined with heuristics search in the assignation tree. Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions. In the future we plan to apply this technique to other algorithmic schemes. 11 November 2018 HeteroPar2004