ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Slides:



Advertisements
Similar presentations
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
OpenFOAM on a GPU-based Heterogeneous Cluster
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Irregular Applications –Sparse Matrix Vector Multiplication
Parallel Multifrontal Sparse Solvers Information Sciences Institute 22 June 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Computer Graphics Graphics Hardware
Parallel Direct Methods for Sparse Linear Systems
NFV Compute Acceleration APIs and Evaluation
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Sathish Vadhiyar Parallel Programming
Ioannis E. Venetis Department of Computer Engineering and Informatics
Enabling machine learning in embedded systems
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
for more information ... Performance Tuning
Pipeline parallelism and Multi–GPU Programming
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
CS 179 Lecture 14.
Introduction to cuBLAS
Adaptive Strassen and ATLAS’s DGEMM
GPU Implementations for Finite Element Methods
STUDY AND IMPLEMENTATION
Computer Graphics Graphics Hardware
Multicore and GPU Programming
Presentation transcript:

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs Dileep Mardham

Introduction Sparse Direct Solvers is a fundamental tool in scientific computing Sparse factorization can be a challenge to accelerate using GPUs GPUs(Graphics Processing Units) can be quite good for accelerating sparse direct solvers GPUs can help alleviate this cost for factorizations which involve sufficient dense math However, for many other cases, the prevalence of small/irregular dense math, make it challenging to significantly accelerate sparse factorization using the GPU.

Example Matrices

PROBLEMS Many optimizations remain Substantial PCIe overhead Kernel launch overhead Memory Issues

BACKGROUND A = LLt Cholesky factorization where the sparse, SPD (symmetric positive definite) matrix A is factored as the product of a sparse lower triangular matrix, L , and its transpose, Lt.

Cont. Many flavors Supernodal / Multi-frontal Left / right looking Supernodes collections of similar columns provide opportunity for dense matrix math grow with mesh size due to ‘fill’ The larger the model, the larger the supernodes Supernodes for solids grow faster than supernodes for shells

ELIMINATION TREE DAG : determines order in which supernodes can be factored Descendant supernodes referenced multiple times supernode

Operations Involved Sparse matrix factorization typically makes extensive use of the “Basic Linear Algebra Subprograms”(BLAS) and “Linear Algebra Package”(LAPACK) libraries. The specific double-precision BLAS and LAPACK routines used in Cholesky factorization are: DPOTRF : direct Cholesky factorization of a dense matrix (LAPACK) DTRSM : triangular system solution (BLAS) DGEMM : general matrix-matrix multiplication (BLAS) DSYRK : symmetric matrix-matrix multiplication

ALGORITHM 1. ‘Batching’ can be used to minimize the effect of launch latency 2. Concurrent kernels (i.e. simultaneous execution of multiple kernels on the GPU using streams) can be used to maximize GPU utilization 3. By placing a large amount of matrix data on the GPU and performing all of the factorization steps on the GPU, communication across the PCIe bus can be completely avoided

PLACING LARGE DATA Kernel - 6 μsec PCIe - 10 μsec Flops - 100 Mflops

BATCHING & CONCURRENT KERNELS 8192 DGEMMs 1.2 Gflops Each 2048 DGEMMs 4.8 Gflops

Contd.

STREAMING

RESULTS CPU: Dual-socket Intel Xeon E5-2698 v3 (2 x 16 core Haswell) @ 2.30 Ghz. GPU: NVIDIA Tesla K40 with maximum boost clocks of 3004 Mhz. (memory) and 875 Mhz. (core).

SPEEDUP VS CPU The average speedup vs. the CPU for all 99 tested matrices is 1.7x

SPEEDUP VS GPU Average speedup for the 99 tested matrices is 1.3x

CONCLUSION Once the A data pertaining to a subtree has been copied to the GPU, the entire subtree can be factored without any need for PCIe communication To achieve high computational performance, BLAS/LAPACK operations within an independent level of the elimination tree can be batched to minimize kernel launch overhead Large matrices are decomposed into multiple subtrees which are streamed through the GPU

REFERENCES Steven C. Rennich, Darko Stosic, and Timothy A. Davis. 2016. Accelerating sparse cholesky factorization on GPUs, Parallel Computing. Direct Methods for Sparse Linear Systems, Timothy A. Davis, SIAM, Philadelphia, Sept. 2006.  R. Mehmood and J. Crowcroft. Parallel iterative solution method of large sparse linear equation systems. Technical Report, University of Cambridge, 2005. T. A. Davis, “SuiteSparse,” http://www.suitesparse.com