Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Similar presentations


Presentation on theme: "ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs"— Presentation transcript:

1 ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Dileep Mardham

2 Introduction Sparse Direct Solvers is a fundamental tool in scientific computing Sparse factorization can be a challenge to accelerate using GPUs GPUs(Graphics Processing Units) can be quite good for accelerating sparse direct solvers GPUs can help alleviate this cost for factorizations which involve sufficient dense math However, for many other cases, the prevalence of small/irregular dense math, make it challenging to significantly accelerate sparse factorization using the GPU.

3 Example Matrices

4

5 PROBLEMS Many optimizations remain Substantial PCIe overhead
Kernel launch overhead Memory Issues

6 BACKGROUND A = LLt Cholesky factorization
where the sparse, SPD (symmetric positive definite) matrix A is factored as the product of a sparse lower triangular matrix, L , and its transpose, Lt.

7 Cont. Many flavors Supernodal / Multi-frontal Left / right looking
Supernodes collections of similar columns provide opportunity for dense matrix math grow with mesh size due to ‘fill’ The larger the model, the larger the supernodes Supernodes for solids grow faster than supernodes for shells

8 ELIMINATION TREE DAG : determines order in which supernodes can be factored Descendant supernodes referenced multiple times supernode

9 Operations Involved Sparse matrix factorization typically makes extensive use of the “Basic Linear Algebra Subprograms”(BLAS) and “Linear Algebra Package”(LAPACK) libraries. The specific double-precision BLAS and LAPACK routines used in Cholesky factorization are: DPOTRF : direct Cholesky factorization of a dense matrix (LAPACK) DTRSM : triangular system solution (BLAS) DGEMM : general matrix-matrix multiplication (BLAS) DSYRK : symmetric matrix-matrix multiplication

10

11 ALGORITHM 1. ‘Batching’ can be used to minimize the effect of launch latency 2. Concurrent kernels (i.e. simultaneous execution of multiple kernels on the GPU using streams) can be used to maximize GPU utilization 3. By placing a large amount of matrix data on the GPU and performing all of the factorization steps on the GPU, communication across the PCIe bus can be completely avoided

12 PLACING LARGE DATA Kernel - 6 μsec PCIe - 10 μsec Flops - 100 Mflops

13 BATCHING & CONCURRENT KERNELS
8192 DGEMMs 1.2 Gflops Each 2048 DGEMMs 4.8 Gflops

14 Contd.

15 STREAMING

16 RESULTS CPU: Dual-socket Intel Xeon E v3 (2 x 16 core 2.30 Ghz. GPU: NVIDIA Tesla K40 with maximum boost clocks of 3004 Mhz. (memory) and 875 Mhz. (core).

17 SPEEDUP VS CPU The average speedup vs. the CPU for all 99 tested matrices is 1.7x

18 SPEEDUP VS GPU Average speedup for the 99 tested matrices is 1.3x

19 CONCLUSION Once the A data pertaining to a subtree has been copied to the GPU, the entire subtree can be factored without any need for PCIe communication To achieve high computational performance, BLAS/LAPACK operations within an independent level of the elimination tree can be batched to minimize kernel launch overhead Large matrices are decomposed into multiple subtrees which are streamed through the GPU

20 REFERENCES Steven C. Rennich, Darko Stosic, and Timothy A. Davis Accelerating sparse cholesky factorization on GPUs, Parallel Computing. Direct Methods for Sparse Linear Systems, Timothy A. Davis, SIAM, Philadelphia, Sept   R. Mehmood and J. Crowcroft. Parallel iterative solution method of large sparse linear equation systems. Technical Report, University of Cambridge, 2005. T. A. Davis, “SuiteSparse,”


Download ppt "ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs"

Similar presentations


Ads by Google