24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Optimizing Membrane System Implementation with Multisets and Evolution Rules Compression Workshop on Membrane Computing Eighth page 1 Optimizing Membrane.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

History of Distributed Systems Joseph Cordina

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Heterogeneous and Grid Compuitng2 Implementation issues u Heterogeneous parallel algorithms –Design and analysis »Good progress over last decade –Scientific.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.

Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Antonio M. Vidal Jesús Peinado

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.

Systems analysis and design, 6th edition Dennis, wixom, and roth

Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Processes Introduction to Operating Systems: Module 3.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Validated Computing 2002 by Eustaquio A. Martínez 1, Tiaraju Asmuz Diverio 2 & Benjamín Barán 3 1 2

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Ioannis E. Venetis Department of Computer Engineering and Informatics

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Steven Whitham Jeremy Woods

Advances in the Optimization of Parallel Routines (I)

Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.

Guoliang Chen Parallel Computing Guoliang Chen

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (II)

Advances in the Optimization of Parallel Routines (III)

Automatic optimization of parallel linear algebra software

Automatic Optimization in Parallel Dynamic Programming Schemes

Presentation transcript:

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo

24 June 2015Universidad Politécnica de Valencia2 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

24 June 2015Universidad Politécnica de Valencia3 Collaborations and autoreferences Modelling Linear Algebra Routines + J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra Algorithms with Message-passing Towards the Design of an Automatically Tuned Linear Algebra Library J. Cuenca + L. P. García + J. González + A. Vidal: Empirical Modelling of Parallel Linear Algebra Routines. 2003

24 June 2015Universidad Politécnica de Valencia4 Colaborations and autoreferences Installation routines + G. Carrillo: Installation routines for linear algebra libraries on LANs G. Carrillo + J. Cuenca + J. González: Optimización automática de rutinas paralelas de álgebra lineal. 2000

24 June 2015Universidad Politécnica de Valencia5 Colaborations and autoreferences Autotuning routines + J. Cuenca + J. González: Automatic parameterization of parallel linear algebra routines J. Cuenca: Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002

24 June 2015Universidad Politécnica de Valencia6 Colaborations and autoreferences Modifications to the libraries hierarchy + J. Cuenca + J. González: Architecture of an Automatic Tuned Linear Algebra Library

24 June 2015Universidad Politécnica de Valencia7 Colaborations and autoreferences Polylibraries + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: Designing Polylibraries to Speed Up Parallel Computations. 2003

24 June 2015Universidad Politécnica de Valencia8 Colaborations and autoreferences Algorithmic schemes + J. P. Martínez: Automatic Optimization in Parallel Dynamic Programming Schemes. 2004

24 June 2015Universidad Politécnica de Valencia9 Colaborations and autoreferences Heterogeneous systems + J. Cuenca + J. Dongarra + J. González + K. Roche: Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004

24 June 2015Universidad Politécnica de Valencia10 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

24 June 2015Universidad Politécnica de Valencia11 A little history Parallel optimization in the past: Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users

24 June 2015Universidad Politécnica de Valencia12 A little history Initial solutions to this situation: Problem-specific solutions Polyalgorithms Installation tests

24 June 2015Universidad Politécnica de Valencia13 A little history Problem specific solutions: Brewer (1994): Sorting Algorithms, Differential Equations Brewer Frigo (1997): FFTW: The Fastest Fourier Transform in the WestFFTW LAWRA (1997): Linear Algebra With Recursive Algorithms LAWRA

24 June 2015Universidad Politécnica de Valencia14 A little history Polyalgorithms: Brewer FFTW PHiPAC (1997): Linear Algebra PHiPAC

24 June 2015Universidad Politécnica de Valencia15 A little history Installation tests: ATLAS (2001): Dense Linear Algebra, sequential ATLAS Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm I-LIB (2000): some parallel linear algebra routines I-LIB

24 June 2015Universidad Politécnica de Valencia16 A little history Parallel optimization today: Optimization based on computational kernels Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

24 June 2015Universidad Politécnica de Valencia17 A little history Optimization based on computational kernels : Efficient kernels (BLAS) and algorithms based on these kernelsBLAS Auto-optimization of the basic kernels (ATLAS)

24 June 2015Universidad Politécnica de Valencia18 A little history Systematic development of routines : FLAME project FLAME R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design LAWRA Dense Linear Algebra For Shared Memory Systems

24 June 2015Universidad Politécnica de Valencia19 A little history Auto-optimization of routines : At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche LFC At execution time: Solve a reduced problem in each processor (Kalinov + Lastovetsky)Kalinov + Lastovetsky Use a system evaluation tool (NWS)NWS

24 June 2015Universidad Politécnica de Valencia20 A little history Middleware for auto-optimization : LFC: Middleware for Dense Linear Algebra Software in Clusters. Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in the development of higher level libraries FIBER: FIBER Proposal of general middleware Evolution of I-LIB mpC: mpC For heterogeneous systems

24 June 2015Universidad Politécnica de Valencia21 A little history Parallel optimization in the future?: Skeletons and languages Heterogeneous and variable-load systems Distributed systems P2P computing

24 June 2015Universidad Politécnica de Valencia22 A little history Skeletons and languages : Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)MALLBAP3L

24 June 2015Universidad Politécnica de Valencia23 A little history Heterogeneous and variable-load systems : Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous

24 June 2015Universidad Politécnica de Valencia24 A little history Distributed systems : Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS)Globus There can be servers to attend queries of clients

24 June 2015Universidad Politécnica de Valencia25 A little history P2P computing : Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable- load But special middleware is necessary

24 June 2015Universidad Politécnica de Valencia26 Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

24 June 2015Universidad Politécnica de Valencia27 Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

24 June 2015Universidad Politécnica de Valencia28 Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation,... : overlapping of communication and computation Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia29 Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia30 The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia31 And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (t c ) and the start-up (t s ) and word-sending time (t w ) Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia32 LU factorisation (Golub - Van Loan) : = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) Modelling Linear Algebra Routines A 11 A 22 A 33 A 32 A 31 A 23 A 21 A 13 A 12 L 11 L 22 L 33 L 32 L 31 L 21 U 11 U 22 U 33 U 23 U 13 U 12

24 June 2015Universidad Politécnica de Valencia33 The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is With k 3 and k 2 the cost of operations performed with BLAS 3 or 2 Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia34 But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and... Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia35 The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia36 The values of the System Parameters could be obtained With installation routines associated to each linear algebra routine From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization At execution time by testing the system conditions prior to the call to the routine Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia37 These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia38 Parallel block LU factorisation: matrix distribution of computations in the first step processors Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia39 Distribution of computations on successive steps: second stepthird step Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia40 The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  cd=max(r,c) System Parameters: cost of arithmetic operations: k 2,getf2 k 3,trsmm k 3,gemm communication parameters: t s t w Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia41 The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r  c System Parameters: cost of arithmetic operations: k 2,geqr2 k 2,larft k 3,gemm k 3,trmm communication parameters: t s t w Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia42 The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia43 Modelling Linear Algebra Routines

24 June 2015Universidad Politécnica de Valencia44 Modelling Linear Algebra Routines IBM-SP2. 8 processors 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80, problem size time (seconds) mean model optimum Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model

24 June 2015Universidad Politécnica de Valencia45 Modelling Linear Algebra Routines Parameter selection for the QR algorithm - Network of Pentium III with Fast Ethernet p=4 p=8 b r c b r c

24 June 2015Universidad Politécnica de Valencia46 Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

24 June 2015Universidad Politécnica de Valencia47 In the formulas (parallel block LU factorisation) The values of the System Parameters ( k 2,getf2, k 3,trsmm, k 3,gemm, t s, t w ) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c) Installation Routines

24 June 2015Universidad Politécnica de Valencia48 Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed

24 June 2015Universidad Politécnica de Valencia49 is estimated by performing matrix-matrix multiplications and updatings of size (n/r  b)  (b  n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Installation Routines

24 June 2015Universidad Politécnica de Valencia50 two multiple triangular systems are solved, one upper triangular of size b  n/c, and another lower triangular of sizen/r  b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Installation Routines

24 June 2015Universidad Politécnica de Valencia51 corresponds to a level 2 LU sequential factorisation of size b  b At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager), And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation Installation Routines

24 June 2015Universidad Politécnica de Valencia52 andappear in communications of three types, In one of them a block of size b  b is broadcast in a row, and this parameter depends on b and c In another a block of size b  b is broadcast in a column, and the parameter depends on b and r And in the other, blocks of sizes b  n/c and n/r  b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c Installation Routines

24 June 2015Universidad Politécnica de Valencia53 In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed. The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation. The basic installation process can be designed allowing the intervention of the system manager. Installation Routines

24 June 2015Universidad Politécnica de Valencia54 Some results in different systems (physical and logical platform) Values of k 3_DTRMM (≈ k 3_DGEMM ) on the different platforms (in microseconds) Installation Routines Block size Systemn SUN1refBLAS macBLAS ATLAS 512,.., SUN5refBLAS macBLAS ATLAS 512,.., PIIIATLAS512,.., PPCmacBLAS512,.., R10KmacBLAS512,..,

24 June 2015Universidad Politécnica de Valencia55 Installation Routines Values of k 2_DGEQR2 (≈ k 2_DLARFT ) on the different platforms (in microseconds) Block size Systemn SUN1refBLAS macBLAS ATLAS 512,.., SUN5refBLAS macBLAS ATLAS 512,.., PIIIATLAS512,.., PPCmacBLAS512,.., R10KmacBLAS512,..,

24 June 2015Universidad Politécnica de Valencia56 Typically the values of the communication parameters are well estimated with a ping-pong Installation Routines Block size Systemn cSUN1MPICH512,.., / 7.0 cPIIIMPICH512,.., / 0.7 IBM-SP2Mac-MPI512,.., / 0.3 Origin 2KMac-MPI512,.., / 0.1