Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Slides:

Advertisements

Similar presentations

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

History of Distributed Systems Joseph Cordina

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus Gabrielle Allen, Thomas Dramlitsch, Ian Foster,

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Antonio M. Vidal Jesús Peinado

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

Efficient Model Selection for Support Vector Machines

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

1 Cactus in a nutshell... n Cactus facilitates parallel code design, it enables platform independent computations and encourages collaborative code development.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

“DECISION” PROJECT “DECISION” PROJECT INTEGRATION PLATFORM CORBA PROTOTYPE CAST J. BLACHON & NGUYEN G.T. INRIA Rhône-Alpes June 10th, 1999.

Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Data Structures and Algorithms in Parallel Computing Lecture 7.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Auburn University

Ioannis E. Venetis Department of Computer Engineering and Informatics

Resource Elasticity for Large-Scale Machine Learning

Many-core Software Development Platforms

Advances in the Optimization of Parallel Routines (I)

Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.

for more information ... Performance Tuning

Guoliang Chen Parallel Computing Guoliang Chen

Adaptive Strassen and ATLAS’s DGEMM

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (III)

Automatic optimization of parallel linear algebra software

Automatic Optimization in Parallel Dynamic Programming Schemes

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et Paralèle Rabat, Maroc Mai 2001

Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

 Linear Algebra: highly optimizable operations  Optimizations are Platform Specific  Traditional method: Hand-Optimization for each platform Current Situation of Linear Algebra Parallel Routines (LAPRs)

 Time-consuming  Incompatible with Hardware Evolution  Incompatible with changes in the system (architecture and basic libraries)  Unsuitable for dynamic systems  Misuse by non expert users Problems of traditional method

ATLAS, FLAME, I-LIB  Analyse platform characteristics in detail  Sequential code  Empirical results of the LAPR + Automation  High Installation Time Current approaches

Develop a methodology for obtaining Automatically Tuned Software Execution Environment Auto-tuning Software Our objective

 Routines Parameterised: System parameters, Algorithmic parameters  System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters A reduced number of executions at installation time  Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process From the file with information generated in the installation process Methodology

 System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters  Algorithmic parameters obtained at running time From the analytical model with the system parameters obtained in the installation process Analytical modelling

The behaviour of the algorithm on the platform is defined T exec = f (SPs, n, APs)  SPs = f(n, APs)System Parameters  APsAlgorithmic Parameters  nProblem Size Analytical Model

System Parameters (SPs): Hardware Platform  Physical Characteristics  Current Conditions Basic libraries How to estimate each SP? 1º.- Obtain the kernel of performance cost of LAPR 2º.- Make an Estimation Routine from this kernel Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs) Analytical Model LAPRs Performance

Arithmetic System Parameters (ASPs): t c arithmetic cost but using BLAS: k 1 k 2 and k 3. Computation Kernel of the LAPR  Estimation Routine  Similar storage scheme  Similar quantity of data Analytical Model

Communication System Parameters (CSPs): t s start-up time t w word-sending time Communication Kernel of the LAPR  Estimation Routine  Similar kind of communication  Similar quantity of data Analytical Model

Algorithmic Parameters (APs) Values chosen in each execution b block size pnumber of processors r  c logical topology grid configuration (logical 2D mesh) Analytical Model

Pre-installing (manual): 1º Make the Analytical Model: T exec = f (SPs, n, APs) 2º Write the Estimation Routines for the SPs Installing on a Platform (automatic): 3º Estimate the SPs using the Estimation Routines of step 2 4º Write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:

LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.  Message-passing with MPI  Logical Ring & Logical 2D-Mesh Platform:SGI Origin 2000 Application Example

Application Example. Algorithm Scheme B WD 00 b n/r n

Application Example: Pre-installing. 1º Make the Analytical Model: T exec = f (SPs,n,APs)

Application Example: Pre-installing. 2º Write the Estimation Routines for the SPs k 3 matrix-matrix multiplication with DGEMM k 1 Givens Rotation to 2 vectors with DROT t s communications along the 2 directions of the 2D-mesh t w

Application Example: I nstalling 3º Estimate the SPs using the Estimation Routines k µs µsb = 32 k µsb = µsb = 128 t s 20 µs t w 0.1 µs

Comparison of execution times using different sets of Execution Parameters (4 processors) Application Example: Executing

Comparison of execution times using different sets of Execution Parameters (8 processors) Application Example: Executing

LAPR: One-sided Block Jacobi Method Algorithmic Parameters: block size mesh topology Platform: SGI Origin 2000 with message-passing System Parameters:arithmetic costs communication costs Satisfactory Reduction of the Execution Time: from 25% higher than the optimal to only 2% Application Example: Executing

Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

 System parameters obtained at installation time Installation routines making a reduced number of executions at installation time  Algorithmic parameters obtained at running time From the file with information generated in the installation process Exhaustive Execution

The behaviour of the algorithm on the platform is defined (as in Analytical Modelling) T exec = f (SPs, n, APs)  SPs = f(n, APs)System Parameters  APsAlgorithmic Parameters  nProblem Size Exhaustive Execution

Identify Algorithmic Parameters (APs) (as in Analytical Modelling) Values chosen in each execution b block size pnumber of processors r  c logical topology grid configuration (logical 2D mesh) Exhaustive Execution

Pre-installing (manual): 1º Determine the APs 2º Decide heuristics to reduce execution time in the installation process Installing on a Platform (automatic): 3º Decide (the manager) the problem sizes to be analysed 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec Execution: The user executes LAPR for a size n: LAPR obtains optimal APs The Methodology. Step by step:

LAPR: Gaussian elimination.  Message-passing with MPI  Logical Ring, rowwise block-cyclic striped partitioning Platform:networks of processors (heterogeneous system) Application Example

Application Example: Pre-installing. 1º Determine the APs logical ring, rowwise block-cyclic striped partitioning pnumber of processors b block size for the data distribution different block sizes in heterogeneous systems b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0 b1b1 b2b2 b0b0

Application Example: Pre-installing. 2º Decide heuristics to reduce execution time in the installation process u Execution time varies in a continuous way with the problem size and the APs u Consider the system as homogeneous u Installation can finish:  When Analytical and Experimental predictions coincide  When a certain time has been spent on the installation

 Homogeneous Systems: 3º The manager decides the problem sizes 4º Execute and write a Configuration File, or include the information in the LAPR: for each n APs that minimize T exec  Heterogeneous Systems: 3º The manager decides the problem sizes 4º Execute: write a Configuration File, for each n APs that minimize T exec write a Speed File, with the relative speeds of the processors in the system Application Example: I nstalling

 RI-THE:Obtains p and b from the formula.  RI-HOM:Obtains p and b through a reduced number of executions.  RI-HET:1º. As RI-HOM. 2º. Obtains b i for each processor Application Example: I nstallation Routines

Three different configurations: PLA_HOM: 5 SUN Ultra-1 PLA_HYB: 5 SUN Ultra-1 1 SUN Ultra-5 PLA_HET: 1 SUN Ultra-1 1 SUN Ultra-5 1 SUN Ultra-1 (manages the file system) Application Example: Systems

Experimental results in PLA-HOM: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

Experimental results in PLA-HYB: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

Experimental results in PLA-HET: Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time Application Example: Executing

Two techniques for automatic tuning of Parallel Linear Algebra Routines: 1. Analytical Modelling For predictable systems (homogeneous, static,...) like Origin Exhaustive Execution For less predictable systems (heterogeneous, dynamic,...) like networks of workstations  Transparent to the user  Execution close to the optimum Comparison

Outline Current Situation of Linear Algebra Parallel Routines (LAPRs) Objective Approach I: Analytical Model of the LAPRs Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions Application: Gauss elimination on networks of processors Validation with the LU factorization Conclusions Future Works

To validate the methodology it is necessary to experiment with:  More routines: block LU factorization  More systems: u Architectures: IBM SP2 and Origin 2000 u Libraries: reference BLAS, machine BLAS, ATLAS Validation with the LU factorization

Sequential LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations of different levels: k 1, k 2, k 3 APs: block size b LUES UM b

Quotient between different execution times and the optimum execution time Sequential LU. Comparison in IBM SP2

Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In SUN 1 Sequential LU. Model execution time/optimum execution time

Parallel LU Analytical Model: T exec = f (SPs,n,APs) SPs: cost of arithmetic operations: k 1, k 2, k 3 cost of communications: t s, t w APs: block size b, number of processors p, grid configuration r  c b

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in IBM SP2

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. Parallel LU. Comparison in Origin 2000

 The modelling of the algorithm provides satisfactory results in different systems Origin 2000, IBM SP2 reference BLAS, machine BLAS, ATLAS  The prediction is worse in some cases: When the number of processors increases In multicomputers where communications are more important (IBM SP2)  Exhaustive Executions Parallel LU. Conclusions

If the manager installs the routine for sizes 512, 1536, 2560, and executions are performed for sizes 1024, 2048, 3072, the execution time is well predicted The same policy can be used in the installation of other software: Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in IBM SP2 Parallel LU. Exhaustive Execution

 Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software  Two techniques can be used: Analytical Modelling Exhaustive Executions or a combination of both  Experiments performed in different systems and with different routines Conclusions

 We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries: it is necessary to analyse the methodology in more systems and with more routines  Architecture of an Automatically Tuned Linear Algebra Library  At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes Future Works

Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager

Architecture of an Automatically Tuned Linear Algebra Library Installation routines Library designer

Architecture of an Automatically Tuned Linear Algebra Library Installation routines Basic routines library Library Basic routines declaration designer manager

Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library Library Basic routines declaration manager Installation designer manager

Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation designer manager

Architecture of an Automatically Tuned Linear Algebra Library Installation file Installation routines Basic routines library SP file AP file Library Basic routines declaration manager Installation Compilation designer manager