Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.

Slides:



Advertisements
Similar presentations
The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
MAGMA – LAPACK for HPC on Heterogeneous Architectures MAGMA – LAPACK for HPC on Heterogeneous Architectures Stan Tomov and Jack Dongarra Research Director.
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Lecture 6: Multicore Systems
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
PaRSEC: Parallel Runtime Scheduling and Execution Controller Jack Dongarra, George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Matrix Algebra on GPU and Multicore Architectures Matrix Algebra on GPU and Multicore Architectures Stan Tomov Research Director Innovative Computing Laboratory.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium.
The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.
Thinking Outside of the Tera-Scale Box Piotr Luszczek.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
International Supercomputing Conference 2014 Leipzig, Germany Tutorial on June 22, 2014.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Ioannis E. Venetis Department of Computer Engineering and Informatics
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Linchuan Chen, Xin Huo and Gagan Agrawal
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Dense Linear Algebra (Data Distributions)
Presentation transcript:

Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University of Tennessee, Knoxville ISC’12 Tutorial, Hamburg June 17, 2012

Outline MAGMA: LAPACK for GPUs Methodology overview Use both GPUs and multicore CPUs MAGMA: from single to multiGPU support One-sided factorizations and linear solvers Two-sided factorizations and eigensolvers Dynamic scheduling approaches to DLA MAGMA algorithms with dynamic scheduling Conclusions

MAGMA: LAPACK for GPUs MAGMA Matrix algebra for GPU and multicore architecture To provide LAPACK/ScaLAPACK on hybrid architectures MAGMA BLAS A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10] MAGMA developers & collaborators UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) Community effort, similarly to LAPACK/ScaLAPACK

A New Generation of DLA Software MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - hybrid scheduler - hybrid kernels

MAGMA Software Stack

MAGMA 1.1  50+ hybrid LAPACK algorithms have been developed (total of 200+ routines)  Every algorithm is in 4 precisions (s/c/d/z)  There are 3 mixed precision algorithms (zc & ds)  These are hybrid algorithms, expressed in terms of BLAS  MAGMA BLAS  A subset of GPU BLAS, optimized for Tesla and Fermi GPUs

MAGMA Methodology A methodology to use all available resources:  MAGMA uses HYBRIDIZATION methodology based on – Representing linear algebra algorithms as collections of TASKS and DATA DEPENDENCIES among them – Properly SCHEDULING tasks' execution over multicore and GPU hardware components  Successfully applied to fundamental linear algebra algorithms – One and two-sided factorizations and solvers – Iterative linear and eigen-solvers  Productivity – 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions Hybrid CPU+GPU algorithms (small tasks for multicores and large tasks for GPUs)

Hybrid Algorithms One-sided factorizations (LU, QR, Cholesky) Hybridization – Panels (Level 2 BLAS) are factored on CPU using LAPACK – Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”

A Hybrid Algorithm Example  Left-looking hybrid Cholesky factorization in MAGMA 1.0  The difference with LAPACK – the 3 additional lines in red  Line 10 (done on CPU) is overlapped with work on the GPU

LU Factorization (Single GPU)

From single to multiple GPUs support Data distribution 1-D block-cyclic distribution Algorithm GPU holding current panel is sending it to CPU All updates are done in parallel on the GPUs Look-ahead is done with GPU holding the next panel GPU 0 GPU 1 GPU 2 GPU 0... nb

LU factorization (multiGPUs)

Matrix out of GPU memory

Out of GPU Memory Algorithms Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) The rest of the matrix stays on the CPU Left-looking versions minimize writing on the CPU Factored sub-matric A 1 on CPU To be factored sub- matrix A 2 on GPU... 1)Copy A 2 to the GPU 2)Update A 2 using A 1 (a panel of A 1 at a time) 3)Factor the updated A 2 using existing hybrid code 4)Copy factored A 2 to the CPU Trivially extended to multiGPUs: A 2 is “larger” with 1-D block cyclic distribution, again reusing existing algorithms Untouched part of the matrix

Hybrid Algorithms Two-sided factorizations (to bidiagonal, tridiagonal, and upper Hessenberg forms) for eigen- and singular-value problems Hybridization – Trailing matrix updates (Level 3 BLAS) are done on the GPU (similar to the one-sided factorizations) – Panels (Level 2 BLAS) are hybrid – operations with memory footprint restricted to the panel are done on CPU – The time consuming matrix-vector products involving the entire trailing matrix are done on the GPU

MultiGPUs Two-Sided Factorizations Performance of DSYMV on multi M2090s Need HP multiGPU Level 2 BLAS (e.g., 50% of flops in the tridiagonal reduction) T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

Hybrid Two-Sided Factorizations

MAGMA Tridiagonalization in DP T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.  50 % of the flops are in SYMV  Memory bound, i.e. does not scale well on multicore CPUs  Use the GPU’s high memory bandwidth and optimized SYMV  8 x speedup over 12 Intel cores GHz) Keeneland system, using one node 3 NVIDIA GPUs 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB)

Further GPU Kernel Optimizations Fermi 2070 A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators, VECPAR, Japan, 2012.

Further GPU Kernel Optimizations Fermi 2070

Further GPU Kernel Optimizations Fermi 2070

From Static to Dynamic Scheduling… Static may stall in situations where work is available Hand tuned optimizations Hardware heterogeneity Kernel heterogeneity Separation of concerns Dynamic Runtime System

Punch Lines Productivity! Oh… Did I say Productivity?

Block Algorithms Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Block QR Factorization

Fork-Join Paradigm

Leveraging Block Algorithms… Column-major data layout Tile data layout

Lessons Learnt from PLASMA PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures: Tile Algorithms on homogeneous x86 cores Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel- Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Example: Tile QR Factorization First panel factorization and corresponding updates DAG for a 4x4 tiles matrix

Let’s go crazy! DAG of 20x20 tile QR Factorization

Dynamic Scheduling Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer Data Flow Programming Model: five decades old concept Think "how things connect" rather than "how things happen” Assembly line Inherently parallel

Matrices Over Runtime Systems at Exascale (MORSE) Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems” Runtime challenges due to the ever growing hardware complexity Algorithmic challenges to exploit the hardware capabilities at most Integrated into MAGMA software stack

MAGMA-MORSE: x86 + MultiGPUs Lessons Learned from PLASMA! CUDA-based hybrid systems New high performance numerical kernels StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux) Both: x86 and GPUs => Hybrid Computations Similar to LAPACK in functionality

Achieving High Level of Productivity From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k],...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n],...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k],...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n],...); }

Achieving High Level of Productivity From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ starpu_Insert_Task(&cl_zgeqrt, k, k,...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_zunmqr, k, n,...); for (m = k+1; m < MT; m++){ starpu_Insert_Task(&cl_ztsqrt, m, k,...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_ztsmqr, m, n, k,...); }

Hybrid Architecture Targeted ⇒ PCI Interconnect 16X 64Gb/s, very thin pipe! ⇒ Fermi C cuda cores 515 Gflop/s

Performance Charts Cholesky QR LU Symmetric Matrix Inversion

Cholesky Performance 8 Intel x86 cores + 3 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov, Software for GPUs, GPU Computing GEMs, vol.2, 2011.

QR Performance E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, AMD x86 cores + 4 Tesla GPUs

QR Performance +~200 Gflop/s but 12 cores = ~150 Gflop/s 8 AMD x86 cores + 4 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, 2011.

Performance Breakdown Task distribution observed on StarPU: sgeqrt : 20% of tasks on GPUs stsmqr : 92.5% of tasks on GPUs Taking advantage of heterogeneity ! Only do what you are good for Don’t do what you are not good for KernelCPUGPUSpeedup sgeqrt9 Gflops60 Gflops~ 6 stsqrt12 Gflops67 Gflops~ 6 sormqr8.5 Gflops227 Gflops~ 27 stsmqr10 Gflops285 Gflops~ 27

LU Performance E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE International Conference on Computer Systems and Applications (best paper award), Intel x86 cores + 3 Tesla GPUs

Symmetric Matrix Inversion A −1, Seriously??? YES! Critical component of the variance-covariance matrix computation in statistics Three steps: Cholesky factorization (DPOTRF) Inverting the Cholesky factor (DTRTRI) Calculating the product of the inverted Cholesky factor with its transpose (DLAUUM) Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg

Scheduling Algorithms as DAGs 44

A −1 Performance H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India 8 Intel x86 cores + 2 Fermi GPUs

Summary and Future Directions Two methodologies for solving challenging DLA Static Scheduling (performance) Dynamic Scheduling (productivity) LAPACK compliant API Source codes freely available in MAGMA What’s next? Extended numerical functionality Distributed Memory Systems

Colloborators / Support  MAGMA [Matrix Algebra on GPU and Multicore Architectures] team  PLASMA [Parallel Linear Algebra for Scalable Multicore Architectures] team  Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France KAUST, Saudi Arabia

Questions?