16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Lecture 3: Parallel Algorithm Design

Algorithms + L. Grewe.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

History of Distributed Systems Joseph Cordina

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Parallel Computing Overview CS 524 – High-Performance Computing.

Parallel Programming Models and Paradigms

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Antonio M. Vidal Jesús Peinado

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Hybrid MPI and OpenMP Parallel Programming

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Auburn University

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Is System X for Me? Cal Ribbens Computer Science Department

Advances in the Optimization of Parallel Routines (I)

Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (II)

CSE8380 Parallel and Distributed Processing Presentation

Advances in the Optimization of Parallel Routines (III)

Automatic optimization of parallel linear algebra software

Automatic Optimization in Parallel Dynamic Programming Schemes

Presentation transcript:

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de Ingeniería y Tecnología de Computadores Universidad de Murciahttp://dis.um.es/~domingohttp://dis.um.es/~domingo... and more: J. González (Intel Barcelona), L.P. García (Politécnica Cartagena), A.M. Vidal (Politécnica Valencia), G. Carrillo (?), P. Alberti (U. Magallanes), P. Alonso (Politécnica Valencia), J.P. Martínez (U. Miguel Hernández), J. Dongarra (U. Tennessee), K. Roche (?)

16 December 2005Universidad de Murcia2 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia3 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia4 A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users

16 December 2005Universidad de Murcia5 A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests

16 December 2005Universidad de Murcia6 A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Brewer Frigo (1997): FFTW: The Fastest Fourier Transform in the WestFFTW LAWRA (1997): Linear Algebra With Recursive Algorithms LAWRA

16 December 2005Universidad de Murcia7 A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra PHiPAC

16 December 2005Universidad de Murcia8 A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential ATLAS Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines I-LIB

16 December 2005Universidad de Murcia9 A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

16 December 2005Universidad de Murcia10 A little history Optimization based on computational kernels : Efficient kernels (BLAS) and algorithms based on these kernelsBLAS Auto-optimization of the basic kernels (ATLAS)

16 December 2005Universidad de Murcia11 A little history Systematic development of routines : FLAME project FLAME R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA Dense Linear Algebra For Shared Memory Systems

16 December 2005Universidad de Murcia12 A little history Auto-optimization of routines : At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche LFC At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky)Kalinov + Lastovetsky Use a system evaluation tool (NWS)NWS

16 December 2005Universidad de Murcia13 A little history Middleware for auto-optimization : LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: FIBER Proposal of general middleware Evolution of I-LIB mpC: mpC For heterogeneous systems

16 December 2005Universidad de Murcia14 A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing

16 December 2005Universidad de Murcia15 A little history Skeletons and languages : Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)MALLBAP3L

16 December 2005Universidad de Murcia16 A little history Heterogeneous and variable-load systems : Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous

16 December 2005Universidad de Murcia17 A little history Distributed systems : Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS)Globus There can be servers to attend queries of clients

16 December 2005Universidad de Murcia18 A little history P2P computing : Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable- load But special middleware is necessary

16 December 2005Universidad de Murcia19 Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia20 Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

16 December 2005Universidad de Murcia21 Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation,... : overlapping of communication and computation Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia22 Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia23 The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia24 And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (t c ) and the start-up (t s ) and word-sending time (t w ) Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia25 LU factorisation (Golub - Van Loan) : = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) Modelling Linear Algebra Routines A 11 A 22 A 33 A 32 A 31 A 23 A 21 A 13 A 12 L 11 L 22 L 33 L 32 L 31 L 21 U 11 U 22 U 33 U 23 U 13 U 12

16 December 2005Universidad de Murcia26 The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the block size is b the cost is With k 3 and k 2 the cost of operations performed with BLAS 3 or 2 Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia27 But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and... Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia28 The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia29 The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia30 These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia31 Parallel block LU factorisation: matrix distribution of computations in the first step processors Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia32 Distribution of computations on successive steps: second stepthird step Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia33 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia34 The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  c System Parameters: cost of arithmetic operations: k 2,geqr2 k 2,larft k 3,gemm k 3,trmm communication parameters: t s t w Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia35 The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Modelling Linear Algebra Routines

16 December 2005Universidad de Murcia36 Modelling Linear Algebra Routines IBM-SP2. 8 processors 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80, problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model

16 December 2005Universidad de Murcia37 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia38 In the formulas (parallel block LU factorisation) The values of the System Parameters ( k 2,getf2, k 3,trsmm, k 3,gemm, t s, t w ) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) Installation Routines

16 December 2005Universidad de Murcia39 Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed

16 December 2005Universidad de Murcia40 is estimated by performing matrix-matrix multiplications and updatings of size (n/r  b)  (b  n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Installation Routines

16 December 2005Universidad de Murcia41 two multiple triangular systems are solved, one upper triangular of size b  n/c, and another lower triangular of sizen/r  b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Installation Routines

16 December 2005Universidad de Murcia42 corresponds to a level 2 LU sequential factorisation of size b  b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation Installation Routines

16 December 2005Universidad de Murcia43 andappear in communications of three types, In one of them a block of size b  b is broadcast in a row, and this parameter depends on b and c In another a block of size b  b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b  n/c and n/r  b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c Installation Routines

16 December 2005Universidad de Murcia44 In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. Installation Routines

16 December 2005Universidad de Murcia45 Some results in different systems (physical and logical platform) Values of k 3_DTRMM (≈ k 3_DGEMM ) on the different platforms (in microseconds) Installation Routines ,.., 4096macBLASR10K ,.., 4096macBLASPPC ,.., 4096ATLASPIII ,.., 4096 refBLAS macBLAS ATLAS SUN ,.., 4096 refBLAS macBLAS ATLAS SUN nSystem Block size

16 December 2005Universidad de Murcia46 Installation Routines Values of k 2_DGEQR2 (≈ k 2_DLARFT ) on the different platforms (in microseconds) ,.., 4096macBLASR10K ,.., 4096macBLASPPC ,.., 4096ATLASPIII ,.., 4096 refBLAS macBLAS ATLAS SUN ,.., 4096 refBLAS macBLAS ATLAS SUN nSystem Block size

16 December 2005Universidad de Murcia47 Typically the values of the communication parameters are well estimated with a ping-pong Installation Routines 20 / ,.., 4096Mac-MPIOrigin 2K 75 / ,.., 4096Mac-MPIIBM-SP2 60 / ,.., 4096MPICHcPIII 170 / ,.., 4096MPICHcSUN nSystem Block size

16 December 2005Universidad de Murcia48 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia49 Modelling the Linear Algebra Routine (LAR) Obtaining information from the System Selection of paramete rs values Execution of LAR DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME Autotuning routines

16 December 2005Universidad de Murcia50 DESIGN PROCESS DESIGNDESIGN LAR: Linear Algebra Routine Made by the LAR Designer Example of LAR: Parallel Block LU factorisation LAR

16 December 2005Universidad de Murcia51 Modelling the LAR DESIGNDESIGN LAR Modelling the LAR MODEL

16 December 2005Universidad de Murcia52 Modelling the LAR DESIGNDESIGN MODEL T exec = f (SP, AP, n) SP: System Parameters AP: Algorithmic Parameters n : Problem size Made by the LAR-Designer Only once per LAR LAR Modelling the LAR MODEL

16 December 2005Universidad de Murcia53 Modelling the LAR DESIGNDESIGN SP: k 3, k 2, t s, t w AP: p = r x c, b n : Problem size MODEL LAR: Parallel Block LU factorisation LAR Modelling the LAR MODEL

16 December 2005Universidad de Murcia54 Implementation of SP- Estimators DESIGNDESIGN LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators

16 December 2005Universidad de Murcia55 Implementation of SP- Estimators DESIGNDESIGN LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication Similar quantity of data

16 December 2005Universidad de Murcia56 INSTALLATION PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators DESIGNDESIGN Installation Process Only once per Platform Done by the System Manager

16 December 2005Universidad de Murcia57 Estimation of Static-SP INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN

16 December 2005Universidad de Murcia58 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN Estimation of Static-SP Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS Installation File SP values are obtained using the information (n and AP values) of this file.

16 December 2005Universidad de Murcia59 Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Estimation of the Static-SP t w-static (in  sec) Message size (Kbytes) t w-static Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI Estimation of the Static-SP k 3-static (in  sec) Block size k 3-static

16 December 2005Universidad de Murcia60 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME

16 December 2005Universidad de Murcia61 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Optimum-AP Selection of Optimum AP DESIGNDESIGN RUN-TIMERUN-TIME

16 December 2005Universidad de Murcia62 RUN-TIME PROCESS INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Optimum-AP Selection of Optimum AP Execution of LAR DESIGNDESIGN RUN-TIMERUN-TIME

16 December 2005Universidad de Murcia63 Autotuning routines Experiments LAR: block LU factorization. Platforms:IBM SP2, SGI Origin 2000, NoW Basic Libraries: reference BLAS, machine BLAS, ATLAS

16 December 2005Universidad de Murcia64 Autotuning routines LU on IBM SP2 Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

16 December 2005Universidad de Murcia65 Autotuning routines LU on Origin 2000 Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

16 December 2005Universidad de Murcia66 Autotuning routines LU on NoW Quotient between the execution time with the parameters selected by the model and the lowest experimentl execution time (varying the value of the parameters)

16 December 2005Universidad de Murcia67 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Hybrid programming Peer to peer computing

16 December 2005Universidad de Murcia68 Modifications to libraries’ hierarchy In the optimization of routines individual basic operations appear repeatedly: LU: QR:

16 December 2005Universidad de Murcia69 Modifications to libraries’ hierarchy The information generated to instal a routine could be used for another different routine without additional experiments: t s and t w are obtained when the communication library (MPI, PVM, …) is installed K 3,gemm is obtained when the basic computational library (BLAS, ATLAS, …) is installed

16 December 2005Universidad de Murcia70 Modifications to libraries’ hierarchy To determine: the type of experiments necessary for the different routines in the library: t s and t w ¿obtained with ping-pong, broadcast, … ? K 3,gemm ¿obtained for small block sizes, … ? the format in which the data will be stored, to facilitate the use of them when installing other routines

16 December 2005Universidad de Murcia71 Modifications to libraries’ hierarchy The method could be valid not only for one library (that I am developing) but also for others libraries I or somebody else will develop in the future: the type of experiments the format in which the data will be stored must be decided by the Parallel Linear Algebra Community … and the typical hierarchy of libraries would change

16 December 2005Universidad de Murcia72 Modifications to libraries’ hierarchy typical hierarchy of Parallel Linear Algebra libraries ScaLAPACK LAPACK BLAS PBLAS BLACS Communications

16 December 2005Universidad de Murcia73 Modifications to libraries’ hierarchy To include installation information in the lowest levels of the hierarchy ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

16 December 2005Universidad de Murcia74 Modifications to libraries’ hierarchy When installing libraries in a higher level this information can be used, and new information is generated … ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

16 December 2005Universidad de Murcia75 Modifications to libraries’ hierarchy And so in higher levels ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information

16 December 2005Universidad de Murcia76 Modifications to libraries’ hierarchy And new libraries with autotunig capacity could be developed ScaLAPACK LAPACK BLAS PBLAS BLACS Communications Self-Optimisation Information Inverse Eigenvalue ProblemLeast Square ProblemPDE Solver Self-Optimisation Information

16 December 2005Universidad de Murcia77 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy GETRF from LAPACK (level 1) GETRF_manager k 3 _information Model GETRF { } GEMM from BLAS (level 0) GEMM_manager k 3 _information Model GEMM { }

16 December 2005Universidad de Murcia78 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

16 December 2005Universidad de Murcia79 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

16 December 2005Universidad de Murcia80 Modifications to libraries’ hierarchy Movement of information between routines in the different levels of the hierarchy

16 December 2005Universidad de Murcia81 Modifications to libraries’ hierarchy Architecture of a Self Optimized Linear Algebra Routine manager SP 1 _information SP 1 _manager Installation_SP 1 _values AP AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP AP z n c SP 1 c,1.... SP 1 c,z SP 1 _information SP 1 _manager Installation_SP 1 _values AP AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP AP z n c SP 1 c,1.... SP 1 c,z... LAR(n, AP) {... } Model T exec = f (SP,AP, n) SP = f(AP,n) Installation_information n 1... n w AP 1... AP z Current_problem_size ncnc Current_system_information Current_CPUs_availability %CPU 1... %CPU p Current_network_availability % net %net 1-p... % net P-1..%net p-p SOLAR_manager Optimum_AP AP 0 SP 1 _information SP 1 _manager Installation_SP 1 _values AP AP z n 1 SP 1 1,1.... SP 1 1,z n w SP 1 w,1.... SP 1 w,z Current_SP 1 _values AP AP z n c SP 1 c,1.... SP 1 c,z SP t _information SP t _manager Installation_SP 1 _values AP AP z n 1 SP t 1,1.... SP t 1,z n w SP t w,1.... SP t w,z Current_SP 1 _values AP AP z n c SP t c,1.... SP t c,z

16 December 2005Universidad de Murcia82 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

16 December 2005Universidad de Murcia83 Polylibraries ● Different basic libraries can be available: ● Reference BLAS, machine specific BLAS, ATLAS, … ● MPICH, machine specific MPI, PVM, …MPIPVM ● Reference LAPACK, machine specific LAPACK, …LAPACK ● ScaLAPACK, PLAPACK, … ScaLAPACKPLAPACK ● To use a number of different basic libraries to develop a polylibrary

16 December 2005Universidad de Murcia84 Polylibraries Typical parallel linear algebra libraries hierarchy ScaLAPACK LAPACK BLAS PBLAS BLACS MPI, PVM,...

16 December 2005Universidad de Murcia85 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACKPBLAS BLACS MPI, PVM,... ref. BLAS mac. BLAS ATLAS

16 December 2005Universidad de Murcia86 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK LAPACKPBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM

16 December 2005Universidad de Murcia87 Polylibraries A possible parallel linear algebra polylibraries hierarchy ScaLAPACK mac. LAPACK PBLAS BLACS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM ESSL ref. LAPACK

16 December 2005Universidad de Murcia88 Polylibraries BLACS PBLAS ref. BLAS mac. BLAS ATLAS mac. MPI LAM MPICH PVM mac. LAPACK ESSL ref. LAPACK mac. ScaLAPACK ESSL ref. ScaLAPACK

16 December 2005Universidad de Murcia89 Polylibraries ● The advantage of Polylibraries ● A library optimised for the system might not be available ● The characteristics of the system can change ● Which library is the best may vary according to the routines and the systems ● Even for different problem sizes or different data access schemes the preferred library can change ● In parallel system with the file system shared by processors of different types

16 December 2005Universidad de Murcia90 Architecture of a Polylibrary Library_1

16 December 2005Universidad de Murcia91 Architecture of a Polylibrary Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia92 Architecture of a Polylibrary Library_1 LIF_1 Installation X Mflops 80 X Mflops 40n X Mflops m Routine: DGEMM

16 December 2005Universidad de Murcia93 Architecture of a Polylibrary Library_1 LIF_1 Installation X Mflops 400 X Mflops 200n X Mflops Leading dimension Routine: DROT

16 December 2005Universidad de Murcia94 Architecture of a Polylibrary Library_2 Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia95 Architecture of a Polylibrary Library_2 LIF_2 Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia96 Architecture of a Polylibrary Library_2 LIF_2 Library_3 Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia97 Architecture of a Polylibrary Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia98 Architecture of a Polylibrary PolyLibrary interface routine_1 interface routine_2... Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia99 Architecture of a Polylibrary PolyLibrary interface routine_1 interface routine_2... interface routine_1 if n<value call routine_1 from Library_1 else depending on data storage call routine_1 from Library_1 or call routine_1 from Library_2... Library_2 LIF_2 Library_3 LIF_3 Installation Library_1 LIF_1 Installation

16 December 2005Universidad de Murcia100 Polylibraries ● Combining Polylibraries with other Optimisation Techniques: ● Polyalgorithms ● Algorithmic Parameters ● Block size ● Number of processors ● Logical topology of processors

16 December 2005Universidad de Murcia101 Experimental Results Routines of different levels in the hierarchy: ● Lowest level: ● GEMM: matrix-matrix multiplication ● Medium level: ● LU and QR factorisations ● Highest level: ● a Lift-and-Project algorithm to solve the inverse additive eigenvalue problem ● an algorithm to solve the Toeplitz least square problem

16 December 2005Universidad de Murcia102 Experimental Results The platforms: ● SGI Origin 2000 ● IBM-SP2 ● Different networks of processors ● SUN Workstations + Ethernet ● PCs + Fast-Ethernet ● PCs + Myrinet

16 December 2005Universidad de Murcia103 Experimental Results: GEMM Routine: GEMM ( matrix-matrix multiplication) Platform: five SUN Ultra 1 / one SUN Ultra 5 Libraries: refBLASmacBLAS ATLAS1ATLAS2ATLAS5 Algorithms and Parameters: Strassen  base size By blocks  block size Direct method

16 December 2005Universidad de Murcia104 Experimental Results: GEMM MATRIX-MATRIX MULTIPLICATION INTERFACE: if processor is SUN Ultra 5 if problem-size<600 solve using ATLAS5 and Strassen method with base size half of problem size else if problem-size<1000 solve using ATLAS5 and block method with block size 400 else solve using ATLAS5 and Strassen method with base size half of problem size endif else if processor is SUN Ultra 1 if problem-size<600 solve using ATLAS5 and direct method else if problem-size<1000 solve using ATLAS5 and Strassen method with base size half of problem size else solve using ATLAS5 and direct method endif

16 December 2005Universidad de Murcia105 Experimental Results: GEMM ATL5 blocks ATL2 Strass ATL5 Strass ATL5 direct 0.04 ATL5 direct Time Library Method Parameter Low Time ATLAS5 Direct ATL5 Strass ATL5 Strass ATL5 Strass ATL5 blocks ATL5 Strass 2 Time Library Method Parameter Mod n

16 December 2005Universidad de Murcia106 Experimental Results: LU Routine: LU factorisation Platform: 4 PentiumIII + Myrinet Libraries: ATLAS BLAS for Pentium II BLAS for Pentium III

16 December 2005Universidad de Murcia107 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Experimental Results: LU

16 December 2005Universidad de Murcia108 Experimental Results: LU BLAS-III BLAS-II ATLAS n btimeb b mod.low.the.

16 December 2005Universidad de Murcia109 Experimental Results: L&P Routine: Lift-and-Project method for the Inverse Additive Eigenvalue Prob Platform: dual Pentium III Libraries combinations: reference LAPACK and the BLAS installed which uses threadsLa_Re+B_In_Th reference LAPACK and a freely available BLAS for Pentium II using threads La_Re+B_II_Th LAPACK and BLAS installed for the use of threadsLa_In_Th+B_In_Th reference LAPACK and the installed BLASLa_Re+B_In reference LAPACK and a freely available BLAS for Pentium IILa_Re+B_II reference LAPACK and a freely available BLAS for PentiumIIILa_Re+B_III LAPACK and BLAS installed in the system and supposedly optimized for the machine La_In+B_In

16 December 2005Universidad de Murcia110 The theoretical model of the sequential algorithm cost : System Parameters: k syev  LAPACK k 3, gemm k 3, diaggemm  BLAS-3 k 1,dot k 1,scal k 1,axpy  BLAS-1 Experimental Results: L&P

16 December 2005Universidad de Murcia111 Experimental Results: L&P

16 December 2005Universidad de Murcia112 Experimental Results: L&P

16 December 2005Universidad de Murcia113 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

16 December 2005Universidad de Murcia114 Algorithmic schemes ● To study ALGORITHMIC SCHEMES, and not individual routines. The study could be useful to: ● Design libraries to solve problems in different fields. ● Divide and Conquer, Dynamic Programming, Branch and Bound (La Laguna)La Laguna ● Develop SKELETONS which could be used in parallel programming languages.SKELETONS ● Skil, Skipper, CARAML, P3L, … SkilSkipperCARAMLP3L

16 December 2005Universidad de Murcia115 Dynamic Programming ● There are different Parallel Dynamic Programming Schemes. ● The simple scheme of the “coins problem” is used: ● A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. ● But the granularity of the computation has been varied to study the scheme, not the problem.

16 December 2005Universidad de Murcia116 Dynamic Programming ● Sequential scheme: for i=1 to number_of_decisions for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor j.....N 1 2 …. i … n

16 December 2005Universidad de Murcia117 Dynamic Programming ● Parallel scheme: for i=1 to number_of_decisions In Parallel: for j=1 to problem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel endfor j i … n P O P 1 P P S... P K-1 P K

16 December 2005Universidad de Murcia118 Dynamic Programming ● Message-passing scheme: In each processor P j for i=1 to number_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes P j has assigned endfor endInEachProcessor j i … n P O P 1 P P K-1 P K N

16 December 2005Universidad de Murcia119 Dynamic Programming ● Theoretical model: Sequential cost: Computational parallel cost (q i large): Communication cost: ● The only AP is p ● The SPs are t c, t s and t w one step Process P p

16 December 2005Universidad de Murcia120 Dynamic Programming ● How to estimate arithmetic SPs: Solving a small problem ● How to estimate communication SPs: ● Using a ping-pong (CP1) ● Solving a small problem varying the number of processors (CP2) ● Solving problems of selected sizes in systems of selected sizes (CP3)

16 December 2005Universidad de Murcia121 Dynamic Programming ● Experimental results: ● Systems: ● SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet ● PenET: seven Pentium III + FastEthernet ● Varying: ● The problem size C = 10000, 50000, , ● Large value of q i ● The granularity of the computation (the cost of a computational step)

16 December 2005Universidad de Murcia122 Dynamic Programming ● Experimental results: ● CP1: ● ping-pong (point-to-point communication). ● Does not reflect the characteristics of the system ● CP2: ● Executions with the smallest problem (C =10000) and varying the number of processors ● Reflects the characteristics of the system, but the time also changes with C ● Larger installation time (6 and 9 seconds) ● CP2: ● Executions with selected problem (C =10000, ) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes ● Larger installation time (76 and 35 seconds)

16 December 2005Universidad de Murcia CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra CP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTCP3CP2CP1LTgra SUNEt PenFE Parameter selection Dynamic Programming

16 December 2005Universidad de Murcia124 Dynamic Programming ● Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt:

16 December 2005Universidad de Murcia125 Dynamic Programming ● Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE:

16 December 2005Universidad de Murcia126 Dynamic Programming ● Three types of users are considered: ● GU (greedy user): ● Uses all the available processors. ● CU (conservative user): ● Uses half of the available processors ● EU (expert user): ● Uses a different number of processors depending on the granularity: ● 1 for low granularity ● Half of the available processors for middle granularity ● All the processors for high granularity

16 December 2005Universidad de Murcia127 Dynamic Programming ● Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt:

16 December 2005Universidad de Murcia128 Dynamic Programming ● Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE:

16 December 2005Universidad de Murcia129 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

16 December 2005Universidad de Murcia130 Heterogeneous algorithms ● Necessary new algorithms with unbalanced distribution of data: Different SPs for different processors APs include vector of selected processors vector of block sizes Gauss elimination b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0

16 December 2005Universidad de Murcia131 Heterogeneous algorithms ● Parameter selection: ● RI-THE: obtains p and b from the formula (homogeneous distribution) ● RI-HOM: obtains p and b through a reduced number of executions (homogeneous distribution) ● RI-HET: ● obtains p and b through a reduced number of executions ● and each

16 December 2005Universidad de Murcia132 Heterogeneous algorithms ● Quotient with respect to the lowest experimental execution time: 0 0,5 1 1, RI-THEO RI-HOMO RI-HETE 0 0,5 1 1, ,5 1 1, Homogeneous system: Five SUN Ultra 1 Hybrid system: Five SUN Ultra 1 One SUN Ultra 5 Heterogeneous system: Two SUN Ultra 1 (one manages the file system) One SUN Ultra 5

16 December 2005Universidad de Murcia133 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia134 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia135 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME The NWS is called and it reports:  the fraction of available CPU (f CPU )  the current word sending time (t w_current ) for a specific n and AP values (n 0, AP 0 ). Then the fraction of available network is calculated: Parameter selection at running time

16 December 2005Universidad de Murcia136 node1node2node3node4node5node6node7node8 Situation A CPU avail.100%100%100%100%100%100%100%100% t w-current 0.7  sec Situation B CPU avail.80%80%80%80%100% 100%100%100% t w-current 0.8  sec0.7  sec Situation C CPU avail.60%60%60%60%100%100%100%100% t w-current 1.8  sec0.7  sec Situation D CPU avail.60%60%60%60%100%100%80%80% t w-current 1.8  sec0.7  sec0.8  sec Situation E CPU avail.60%60%60%60%100%100%50%50% t w-current 1.8  sec0.7  sec4.0  sec Parameter selection at running time

16 December 2005Universidad de Murcia137 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia138 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia139 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME The values of the SP are tuned, according to the current situation: Parameter selection at running time

16 December 2005Universidad de Murcia140 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia141 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia142 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Block size Situation of the Platform Load nABCDE Number of nodes to use p = r  c Situation of the Platform Load nABCDE  24  22  22  2 2   24  22  22  22   24  22  2 2  22  1 Parameter selection at running time

16 December 2005Universidad de Murcia143 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia144 INSTALLATIONINSTALLATION LAR Modelling the LAR MODEL Implementation of SP- Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN RUN-TIMERUN-TIME Parameter selection at running time

16 December 2005Universidad de Murcia145 Parameter selection at running time n = % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ABCDE Static Model Dynamic Model n = % 20% 40% 60% 80% 100% 120% 140% 160% ABCDE Situation of the platform load n = % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ABCDE

16 December 2005Universidad de Murcia146 Work distribution ● There are different possibilities in heterogeneous systems: ● Heterogeneous algorithms (Gauss elimination). ● Homogeneous algorithms and assignation of: ● One process to each processor (LU factorization) ● A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations

16 December 2005Universidad de Murcia147 Work distribution ● Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithmdistribution j i … n P 0 P 1 P P S... P K-1 P K j i … n p 0 p 1 p 2 p 3 p 4 p 5... p s... p r-1 p r P 0 P 0 P 1 P 3 P 3 P 3... P S... P K P K

16 December 2005Universidad de Murcia148 Work distribution ● The model: t(n,C,v,q,t c (n,C,v,q,p,b,d),t s (n,C,v,q,p,b,d),t w (n,C,v,q,p,b,d)) ● Problem size: ● n number of types of coins ● C value to give ● v array of values of the coins ● q quantity of coins of each type ● Algorithmic parameters: ● p number of processes ● b block size (here n/p) ● d processes to processors assignment ● System parameters: ● t c cost of basic arithmetic operations ● t s start-up time ● t w word-sending time

16 December 2005Universidad de Murcia149 Work distribution ● Theoretical model: The same as for the homogeneous case because the same homogeneous algorithm is used Sequential cost: Computational parallel cost (q i large): Communication cost: ● There is a new AP: d ● SPs are now unidimensional (t c ) or bidimensional (t s,t w ) tables Process P p

16 December 2005Universidad de Murcia150 Work distribution ● Assignment tree (P types of processors and p processes): 3P PP3P32P P processors p processes... Some limit in the height of the tree (the number of processes) is necessary

16 December 2005Universidad de Murcia151 Work distribution ● Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general:

16 December 2005Universidad de Murcia152 Work distribution ● Assignment tree. SUNEt P=2 types of processors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change U5 U1U5 U1 2 processors p processes... U1 U5 U1 one process to each processor

16 December 2005Universidad de Murcia153 Work distribution ● Assignment tree. TORC, used P=4 types of processors: one 1.7 Ghz Pentium 4 (only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type processors p processes... not in the tree the values of SPs change two consecutive processes are assigned to a same node

16 December 2005Universidad de Murcia154 Work distribution ● Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: ● Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge:

16 December 2005Universidad de Murcia155 Work distribution ● Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: ● Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds s i, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations

16 December 2005Universidad de Murcia156 Work distribution ● Theoretical model: Sequential cost: Computational parallel cost (q i large): Communication cost: ● The APs are p and the assignation array d ● The SPs are the unidimensional array t c, and the bidimensional arrays t s and t w one step Maximum values

16 December 2005Universidad de Murcia157 Work distribution ● How to estimate arithmetic SPs: Solving a small problem on each type of processors ● How to estimate communication SPs: ● Using a ping-pong between each pair of processors, and processes in the same processor (CP1) ● Does not reflect the characteristics of the system ● Solving a small problem varying the number of processors, and with linear interpolation (CP2) ● Larger installation time

16 December 2005Universidad de Murcia158 Work distribution ● Three types of users are considered: ● GU (greedy user): ● Uses all the available processors, with one process per processor. ● CU (conservative user): ● Uses half of the available processors (the fastest), with one process per processor. ● EU (user expert in the problem, the system and heterogeneous computing): ● Uses a different number of processes and processors depending on the granularity: ● 1 process in the fastest processor, for low granularity ● The number of processes is half of the available processors, and in the appropriate processors, for middle granularity ● A number of processes equal to the number of processors, and in the appropriate processors, for large granularity

16 December 2005Universidad de Murcia159 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt:

16 December 2005Universidad de Murcia160 Work distribution ● Parameters selection, in TORC, with CP2: C gra LTCP (1,2) (1,2)(1,2,4,4) (1,2)(1,2,4,4) (1,2) (1,2)(1,2,4,4) (1,2)(1,2,4,4) (1,2) (1,2)(1,2,3,4) (1,2)(1,2,3,4)

16 December 2005Universidad de Murcia161 Work distribution ● Parameters selection, in TORC ( without the 1.7 Ghz Pentium 4 ), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 C gra LTCP (1,1,2)(1,1,2,3,3,3,3,3,3) (1,1,2)(1,1,2,3,3,3,3,3,3,3,3) (1,1,3,3)(1,1,2,3,3,3,3,3,3,3,3) (1,1,2) (1,1,3)(1,1,2,3,3,3,3,3,3,3,3) (1,1,3)(1,1,2,3,3,3,3,3,3,3,3) (1,1,2) (1,1,2)(1,1,2,3) (1,1,2)

16 December 2005Universidad de Murcia162 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC:

16 December 2005Universidad de Murcia163 Work distribution ● Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4):

16 December 2005Universidad de Murcia164 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

16 December 2005Universidad de Murcia165 Hybrid programming OpenMP Fine-grain parallelism Efficient in SMP Sequential and parallel codes are similar Tools for development and parallelisation Allows run time scheduling Memory allocation can reduce performance MPI Coarse-grain parallelism More portable Parallel code very different from sequential Development and debugging more complex Static assigment of processes Local memories, which facilitates efficient use

16 December 2005Universidad de Murcia166 Hybrid programming Advantages of Hybrid Programming ● To improve scalability ● When too many tasks produce load imbalance ● Applications with fine and coarse-grain parallelism ● Redution of the code development time ● When the number of MPI processes is fixed ● In case of a mixture of functional and data parallelism

16 December 2005Universidad de Murcia167 Hybrid programming Hybrid Programming in the literature ● Most of the papers are about particular applications ● Some papers present hybrid models ● No theoretical models of the execution time are available

16 December 2005Universidad de Murcia168 Hybrid programming Systems ● Networks of Dual Pentiums ● HPC160 (each node four processors) ● IBM SP ● Blue Horizon (144 nodes, each 8 processors) ● Earth Simulator (640x8 vector processors) …

16 December 2005Universidad de Murcia169 Hybrid programming

16 December 2005Universidad de Murcia170 Hybrid programming Models MPI+OpenMP OpenMP used for loops parallelisation OpenMP+MPI Unsafe threads MPI and OpenMP processes in SPMD model Reduces cost of communications

16 December 2005Universidad de Murcia171 Hybrid programming

16 December 2005Universidad de Murcia172 Hybrid programming !$OMP PARALLEL DO REDUCTION (+:sum) PRIVATE (x) do 20 i = myid+1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 enddo !$OMP END PARALLEL DO mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISIO N, &MPI_SUM,0,MPI_COMM_WORLD,ierr) call MPI_FINALIZE(ierr) stop end program main include 'mpif.h' double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) call MPI_BCAST(n,1,MPI_INTEGER,0, & MPI_COMM_WORLD,ierr) h = 1.0d0/n sum = 0.0d0

16 December 2005Universidad de Murcia173 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Lanucara, Rovida: Conjugate-Gradient

16 December 2005Universidad de Murcia174 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Djomehri, Jin: CFD Solver

16 December 2005Universidad de Murcia175 Hybrid programming It is not clear if with hybrid programming the execution time would be lower Viet, Yoshinaga, Abderazek, Sowa: Linear system

16 December 2005Universidad de Murcia176 Hybrid programming ● Matrix-matrix multiplication: MPI SPMD MPI+OpenMP decide which is preferable MPI+OpenMP: less memory fewer communications may have worse memory use N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 p0 N0 p1 N2 p0 N2 p1 N1 p0 N1 p1 N0 N2 N1

16 December 2005Universidad de Murcia177 Hybrid programming ● In the time theoretical model more Algorithmic Parameters appear: 8 processors: p=rxs, 1x8, 2x4, 4x2, 8x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x2, 2x1 total 6 configurations 16 processors: p=rxs, 1x16, 2x8, 4x4, 8x2, 16x1 p=rxs, 1x4, 2x2, 4x1 q=uxv, 1x4, 2x2, 4x1 total 9 configurations

16 December 2005Universidad de Murcia178 Hybrid programming ● And more System Parameters: ● The cost of communications is different inside and outside a node (similar to the heterogeneous case with more than one process per processor) ● The cost of arithmetic operations can vary when the number of threads in the node varies ● Consequently, the algorithms must be recoded and new models of the execution time must be obtained

16 December 2005Universidad de Murcia179 Hybrid programming … and the formulas change: P0 P1 P2 P3 P4 P5 P6 synchronizations Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 communications The formula changes, for some systems 6x1 nodes and 1x6 threads could be better, and for others 1x6 nodes and 6x1 threads

16 December 2005Universidad de Murcia180 Hybrid programming ● Open problem ● Is it possible to generate automatically MPI+OpenMP programs from MPI programs? Maybe for the SPMD model. ● Or at least for some type of programs, such as matricial problems in meshes of processors? ● And is it possible to obtain the execution time of the MPI+OpenMP program from that of the MPI program and some description of how the time model has been obtained?

16 December 2005Universidad de Murcia181 Outline ● A little history ● Modelling Linear Algebra Routines ● Installation routines ● Autotuning routines ● Modifications to libraries’ hierarchy ● Polylibraries ● Algorithmic schemes ● Heterogeneous systems ● Hybrid programming ● Peer to peer computing

16 December 2005Universidad de Murcia182 Peer to peer computing ● Distributed systems: ● They are inherently heterogeneous and dynamic ● But there are other problems: ● Higher communication cost ● Special middleware is necessary ● The typical paradigms are master/slave, client/server, where different types of processors (users) are considered.

16 December 2005Universidad de Murcia183 Peer to peer computing Peer-to-Peer Computing. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja1, Jim Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu. HP Laboratories Palo Alto. 2002

16 December 2005Universidad de Murcia184 Peer to peer computing ● Peer to peer: ● All the processors (users) are at the same level (at least initially) ● The community selects, in a democratic and continuous way, the topology of the global network ● Would it be interesting to have a P2P system for computing? ● Is some system of this type available?

16 December 2005Universidad de Murcia185 Peer to peer computing ● Would it be interesting to have a P2P system for computing? ● I think it would be interesting to develop a system of this type ● And to leave the community to decide, in a democratic and continuous way, if it is worthwhile ● Is some system of this type available? ● I think there is no pure P2P dedicated to computation

16 December 2005Universidad de Murcia186 Peer to peer computing ● … and other people seem to think the same: ● Lichun Ji (2003): “… P2P networks seem to outperform other approaches largely due to the anonymity of the participants in the peer-network, low network costs and the inexpensive disk-space. Trying to apply P2P principles in the area of distributed computation was significantly less successful” ● Arjav J. Chakravarti, Gerald Baumgartner, Mario Lauria (2004): “… current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability”

16 December 2005Universidad de Murcia187 Peer to peer computing ● There are a lot of tools for Grid Computing: ● Globus (of course), but does Globus provide computational P2P capacity or is it a tool with which P2P computational systems can be developed? Globus ● Netsolve/Gridsolve. Uses a client/server structure. Netsolve/Gridsolve ● PlanetLab (at present PlanetLab 387 nodes and 162 sites). In each site one Principal Researcher and one System Administrator.

16 December 2005Universidad de Murcia188 Peer to peer computing ● For Computation on P2P the shared resources are: ● Information: books, papers, …, in a typical way. ● Libraries. One peer takes a library from another peer. ● Necessary description of the library and the system to know if the library fulfils our requests. ● Computation. One peer colaborates to solve a problem proposed by another peer. ● This is the central idea of Computation on P2P…

16 December 2005Universidad de Murcia189 Peer to peer computing ● Two peers collaborate in the solution of a computational problem using the hierarchy of parallel linear algebra libraries PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1

16 December 2005Universidad de Murcia190 Peer to peer computing ● There are ● Different global hierarchies ● Different libraries PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1

16 December 2005Universidad de Murcia191 Peer to peer computing ● And the installation information varies, which makes the efficient use of the theoretical model more difficult than in the heterogeneous case PLAPACK Mac. LAPACK BLAS Reference MPI ScaLAPACK Ref. LAPACK ATLAS PBLAS BLACS Machine MPI Peer 2Peer 1 Inst. Inform.

16 December 2005Universidad de Murcia192 Peer to peer computing ● Trust problems appear: ● Does the library solve the problems we require to be solved? ● Is the library optimized for the system it claims to be optimized for? ● Is the installation information correct? ● Is the system stable? There are trust-algorithms for P2P systems; are they (or some modification) applicable to these trust problems?

16 December 2005Universidad de Murcia193 Peer to peer computing ● Each peer would have the possibility of establishing a policy of use: ● The use of the resources could be payable ● The percentage of CPU dedicated to computations for the community ● The type of problems it is interested in And the MAIN PROBLEM: is it interesting to develop a P2P system for the management and optimization of computational codes ?