Advances in the Optimization of Parallel Routines (III)

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming Models and Paradigms

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Chapter 9. Concepts in Parallelisation An Introduction

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Antonio M. Vidal Jesús Peinado

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Colin J. MacDougall.  Class of Systems and Applications  “Employ distributed resources to perform a critical function in a decentralized manner”  Distributed.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Hybrid MPI and OpenMP Parallel Programming

1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,

UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Background Computer System Architectures Computer System Software.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Algorithm Complexity is concerned about how fast or slow particular algorithm performs.

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chen Jin (HT016952H) Zhao Xu Ying (HT016907B)

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Distributed Shared Memory

Distributed Processors

Conception of parallel algorithms

Resource Elasticity for Large-Scale Machine Learning

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Programming Models for SimMillennium

Parallel Density-based Hybrid Clustering

Is System X for Me? Cal Ribbens Computer Science Department

Course Description Algorithms are: Recipes for solving problems.

Parallel Programming in C with MPI and OpenMP

L21: Putting it together: Tree Search (Ch. 6)

Advances in the Optimization of Parallel Routines (I)

Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.

Advances in the Optimization of Parallel Routines (II)

Sanjoy Baruah The University of North Carolina at Chapel Hill

Advances in the Optimization of Parallel Routines (II)

CSE8380 Parallel and Distributed Processing Presentation

Hybrid Programming with OpenMP and MPI

Multiprocessor and Real-Time Scheduling

COMP60621 Fundamentals of Parallel and Distributed Systems

Multithreaded Programming

Hybrid Parallel Programming

Hybrid MPI and OpenMP Parallel Programming

Automatic optimization of parallel linear algebra software

Automatic Optimization in Parallel Dynamic Programming Schemes

Major Design Strategies

Parallel Programming in C with MPI and OpenMP

Course Description Algorithms are: Recipes for solving problems.

Hybrid Parallel Programming

COMP60611 Fundamentals of Parallel and Distributed Systems

Major Design Strategies

Presentation transcript:

Advances in the Optimization of Parallel Routines (III) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

Colaborations and autoreferences Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Algorithmic schemes To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: Design libraries to solve problems in different fields. Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna) Develop SKELETONS which could be used in parallel programming languages. Skil, Skipper, CARAML, P3L, … 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming There are different Parallel Dynamic Programming Schemes. The simple scheme of the “coins problem” is used: A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. But the granularity of the computation has been varied to study the scheme, not the problem. 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor 1 2 . j N …. i … n 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel 1 2 . j ... i … n PO P1 P2 ...... PS ... PK-1 PK 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Message-passing scheme: In each processor Pj for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor 1 2 . j ... i … n N PO P1 P2 .................... PK-1 PK 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The only AP is p The SPs are tc , ts and tw one step Process Pp 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming How to estimate arithmetic SPs: Solving a small problem How to estimate communication SPs: Using a ping-pong (CP1) Solving a small problem varying the number of processors (CP2) Solving problems of selected sizes in systems of selected sizes (CP3) 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Experimental results: Systems: SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet PenET: seven Pentium III + FastEthernet Varying: The problem size C = 10000, 50000, 100000, 500000 Large value of qi The granularity of the computation (the cost of a computational step) 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Experimental results: CP1: ping-pong (point-to-point communication). Does not reflect the characteristics of the system CP2: Executions with the smallest problem (C =10000) and varying the number of processors Reflects the characteristics of the system, but the time also changes with C Larger installation time (6 and 9 seconds) Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes Larger installation time (76 and 35 seconds) 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Parameter selection 10.000 50.000 100.000 500.000 SUNEt gra LT CP1 CP2 CP3 10 1 50 6 4 100 5 PenFE gra LT CP1 CP2 CP3 10 1 6 50 5 7 4 100 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Three types of users are considered: GU (greedy user): Uses all the available processors. CU (conservative user): Uses half of the available processors EU (expert user): Uses a different number of processors depending on the granularity: 1 for low granularity Half of the available processors for middle granularity All the processors for high granularity 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Dynamic Programming Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

Colaborations and autoreferences Heterogeneous systems + G. Carrillo: Installation routines for linear algebra libraries on LANs. 2000 + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 + J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 04 April 2019 Universidad Politécnica de Valencia

Heterogeneous algorithms Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0 b1 b2 04 April 2019 Universidad Politécnica de Valencia

Heterogeneous algorithms Parameter selection: RI-THE: obtains p and b from the formula (homogeneous distribution) RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) RI-HET: obtains p and b through a reduced number of executions and each 04 April 2019 Universidad Politécnica de Valencia

Heterogeneous algorithms Quotient with respect to the lowest experimental execution time: RI-THEO RI-HOMO RI-HETE 2 2 2 1,5 1,5 1,5 1 1 1 0,5 0,5 0,5 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5 Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time The NWS is called and it reports: ·the fraction of available CPU (fCPU) ·the current word sending time (tw_current) for a specific n and AP values (n0, AP0). Then the fraction of available network is calculated: I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time node1 node2 node3 node4 node5 node6 node7 node8 Situation A CPU avail. 100% 100% 100% 100% 100% 100% 100% 100% tw-current 0.7sec Situation B CPU avail. 80% 80% 80% 80% 100% 100% 100% 100% tw-current 0.8sec 0.7sec Situation C CPU avail. 60% 60% 60% 60% 100% 100% 100% 100% tw-current 1.8sec 0.7sec Situation D CPU avail. 60% 60% 60% 60% 100% 100% 80% 80% tw-current 1.8sec 0.7sec 0.8sec Situation E CPU avail. 60% 60% 60% 60% 100% 100% 50% 50% tw-current 1.8sec 0.7sec 4.0sec 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M The values of the SP are tuned, according to the current situation: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M Block size Situation of the Platform Load n A B C D E 1024 32 32 64 64 64 2048 64 64 64 128 128 3072 64 64 128 128 128 Number of nodes to use p = r  c Situation of the Platform Load n A B C D E 1024 42 42 22 22 21 2048 42 42 22 22 21 3072 42 42 22 22 21 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time I N S T A L O LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic Libraries Installation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS D E G R U - M 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Parameter selection at running time Static Model Dynamic Model n = 1024 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A B C D E n = 2048 0% 20% 40% 60% 80% 100% 120% 140% 160% A B C D E Situation of the platform load n = 3072 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% A B C D E 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution There are different possibilities in heterogeneous systems: Heterogeneous algorithms (Gauss elimination). Homogeneous algorithms and assignation of: One process to each processor (LU factorization) A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution 1 2 . j ... i … n 1 2 . j ... i … n p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr P0 P0 P1 P3 P3 P3 ... PS ... PK PK P0 P1 P2 ...... PS ... PK-1 PK 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) Problem size: n number of types of coins C value to give v array of values of the coins q quantity of coins of each type Algorithmic parameters: p number of processes b block size (here n/p) d processes to processors assignment System parameters: tc cost of basic arithmetic operations ts start-up time tw word-sending time 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (qi large): Communication cost: There is a new AP: d SPs are now unidimensional (tc) or bidimensional (ts ,tw ) tables Process Pp 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Assignment tree (P types of processors and p processes): P processors 1 2 3 ... P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: 1 1 1 1 5 10 10 5 1 1 4 6 4 1 1 3 3 1 1 2 1 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2 processors U5 U1 U1 U5 U1 U1 U5 one process to each processor p processes U1 U1 ... U1 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4 processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 p processes 1 2 3 4 2 3 4 3 4 4 ... the values of SPs change 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Theoretical model: Sequential cost: Computational parallel cost (qi large): Communication cost: The APs are p and the assignation array d The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw one step Maximum values 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution How to estimate arithmetic SPs: Solving a small problem on each type of processors How to estimate communication SPs: Using a ping-pong between each pair of processors, and processes in the same processor (CP1) Does not reflect the characteristics of the system Solving a small problem varying the number of processors, and with linear interpolation (CP2) Larger installation time 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Three types of users are considered: GU (greedy user): Uses all the available processors, with one process per processor. CU (conservative user): Uses half of the available processors (the fastest), with one process per processor. EU (user expert in the problem, the system and heterogeneous computing): Uses a different number of processes and processors depending on the granularity: 1 process in the fastest processor, for low granularity The number of processes is half of the available processors, and in the appropriate processors, for middle granularity A number of processes equal to the number of processors, and in the appropriate processors, for large granularity 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Parameters selection, in TORC, with CP2: C gra LT CP2 50000 10 (1,2) 50 (1,2,4,4) 100 100000 500000 (1,2,3,4) 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50 (1,1,2,3,3,3,3,3,3,3,3) 100 (1,1,3,3) 100000 (1,1,3) 500000 (1,1,2,3) 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Work distribution Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use 04 April 2019 Universidad Politécnica de Valencia

Hybrid programming Advantages of Hybrid Programming To improve scalability When too many tasks produce load imbalance Applications with fine and coarse-grain parallelism Redution of the code development time When the number of MPI processes is fixed In case of a mixture of functional and data parallelism 04 April 2019 Universidad Politécnica de Valencia

Hybrid programming Hybrid Programming in the literature Most of the papers are about particular applications Some papers present hybrid models No theoretical models of the execution time are available 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming Systems Networks of Dual Pentiums HPC160 (each node four processors) IBM SP Blue Horizon (144 nodes, each 8 processors) Earth Simulator (640x8 vector processors) … 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0 04 April 2019 Universidad Politécnica de Valencia

Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N1 N2 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming And more System Parameters: The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) The cost of arithmetic operations can vary when the number of threads in the node varies Consequently, the algorithms must be recoded and new models of the execution time must be obtained 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Hybrid programming Open problem Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. Or at least for some type of programs, such as matricial problems in meshes of processors? And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained? 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Distributed systems: They are inherently heterogeneous and dynamic But there are other problems: Higher communication cost Special middleware is necessary The typical paradigms are master/slave, client/server, where different types of processors (users) are considered. 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Peer to peer: All the processors (users) are at the same level (at least initially) The community selects, in a democratic and continuous way, the topology of the global network Would it be interesting to have a P2P system for computing? Is some system of this type available? 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Would it be interesting to have a P2P system for computing? I think it would be interesting to develop a system of this type And to leave the community to decide, in a democratic and continuous way, if it is worthwhile Is some system of this type available? I think there is no pure P2P dedicated to computation 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing … and other people seem to think the same: Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability” 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing There are a lot of tools for Grid Computing: Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Netsolve/Gridsolve. Uses a client/server structure. PlanetLab (at present 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator. 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing For Computation on P2P the shared resources are: Information: books, papers, …, in a typical way. Libraries. One peer takes a library from another peer. Necessary description of the library and the system to know if the library fulfils our requests. Computation. One peer colaborates to solve a problem proposed by another peer. This is the central idea of Computation on P2P… 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing There are Different global hierarchies Different libraries Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Mac. LAPACK BLAS Reference MPI 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing And the installation information varies, which makes the efficient uses of the theoretical model more difficult than in the heterogeneous case Peer 1 Peer 2 PLAPACK ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Inst. Inform. Inst. Inform. Mac. LAPACK Inst. Inform. Inst. Inform. Inst. Inform. BLAS Inst. Inform. Inst. Inform. Inst. Inform. Reference MPI Inst. Inform. 04 April 2019 Universidad Politécnica de Valencia Inst. Inform.

Universidad Politécnica de Valencia Peer to peer computing Trust problems appear: Does the library solve the problems we require to be solved? Is the library optimized for the system it claims to be optimized for? Is the installation information correct? Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems? 04 April 2019 Universidad Politécnica de Valencia

Universidad Politécnica de Valencia Peer to peer computing Each peer would have the possibility of establishing a policy of use: The use of the resources could be payable The percentage of CPU dedicated to computations for the community The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes? 04 April 2019 Universidad Politécnica de Valencia