ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Dileep Mardham

Introduction Sparse Direct Solvers is a fundamental tool in scientific computing Sparse factorization can be a challenge to accelerate using GPUs GPUs(Graphics Processing Units) can be quite good for accelerating sparse direct solvers GPUs can help alleviate this cost for factorizations which involve sufficient dense math However, for many other cases, the prevalence of small/irregular dense math, make it challenging to significantly accelerate sparse factorization using the GPU.

Example Matrices

PROBLEMS Many optimizations remain Substantial PCIe overhead
Kernel launch overhead Memory Issues

BACKGROUND A = LLt Cholesky factorization
where the sparse, SPD (symmetric positive definite) matrix A is factored as the product of a sparse lower triangular matrix, L , and its transpose, Lt.

Cont. Many flavors Supernodal / Multi-frontal Left / right looking
Supernodes collections of similar columns provide opportunity for dense matrix math grow with mesh size due to ‘fill’ The larger the model, the larger the supernodes Supernodes for solids grow faster than supernodes for shells

ELIMINATION TREE DAG : determines order in which supernodes can be factored Descendant supernodes referenced multiple times supernode

Operations Involved Sparse matrix factorization typically makes extensive use of the “Basic Linear Algebra Subprograms”(BLAS) and “Linear Algebra Package”(LAPACK) libraries. The specific double-precision BLAS and LAPACK routines used in Cholesky factorization are: DPOTRF : direct Cholesky factorization of a dense matrix (LAPACK) DTRSM : triangular system solution (BLAS) DGEMM : general matrix-matrix multiplication (BLAS) DSYRK : symmetric matrix-matrix multiplication

ALGORITHM 1. ‘Batching’ can be used to minimize the effect of launch latency 2. Concurrent kernels (i.e. simultaneous execution of multiple kernels on the GPU using streams) can be used to maximize GPU utilization 3. By placing a large amount of matrix data on the GPU and performing all of the factorization steps on the GPU, communication across the PCIe bus can be completely avoided

PLACING LARGE DATA Kernel - 6 μsec PCIe - 10 μsec Flops - 100 Mflops

BATCHING & CONCURRENT KERNELS
8192 DGEMMs 1.2 Gflops Each 2048 DGEMMs 4.8 Gflops

Contd.

STREAMING

RESULTS CPU: Dual-socket Intel Xeon E v3 (2 x 16 core 2.30 Ghz. GPU: NVIDIA Tesla K40 with maximum boost clocks of 3004 Mhz. (memory) and 875 Mhz. (core).

SPEEDUP VS CPU The average speedup vs. the CPU for all 99 tested matrices is 1.7x

SPEEDUP VS GPU Average speedup for the 99 tested matrices is 1.3x

CONCLUSION Once the A data pertaining to a subtree has been copied to the GPU, the entire subtree can be factored without any need for PCIe communication To achieve high computational performance, BLAS/LAPACK operations within an independent level of the elimination tree can be batched to minimize kernel launch overhead Large matrices are decomposed into multiple subtrees which are streamed through the GPU

REFERENCES Steven C. Rennich, Darko Stosic, and Timothy A. Davis Accelerating sparse cholesky factorization on GPUs, Parallel Computing. Direct Methods for Sparse Linear Systems, Timothy A. Davis, SIAM, Philadelphia, Sept R. Mehmood and J. Crowcroft. Parallel iterative solution method of large sparse linear equation systems. Technical Report, University of Cambridge, 2005. T. A. Davis, “SuiteSparse,”

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Similar presentations

Presentation on theme: "ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Similar presentations

Presentation on theme: "ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs"— Presentation transcript:

Similar presentations

About project

Feedback