September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.

September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute of Applied Mathematics, TU Delft 2 Department of Computer Science & Technology, Tsinghua University 3 Lab. Parallel Soft. & Comp. Sci., Inst. Software Chinese Academy of Science

Outline Introduction to Krylov-subspace linear system solvers & preconditioning techniques Introduction to GPGPU & NVIDIA CUDA GMRES solver on CUDA Approximate inverse preconditioner based on A- biconjugate (AINV) on CUDA Experiments & Results Conclusion September 15, 20152

3 Introduction – Krylov subspace solver Iterative Linera System Solver [2] Krylov Subspace-based solver: Popular Solvers: GMRES, CG, Bi-CG

Introduction – Preconditioners (1) Iteration Count ~ Condition of matrix A Preconditioners [2,9] : Improve condition of the ‘actual’ matrix for iteration Left & right preconditioning Effective matrix & system: September 15, 20154

Introduction – Preconditioned GMRES 5

Introduction – Preconditioners (2) Incomplete Factorization-based: Incomplete LU/Cholesky factorization [1,2] ILU(0), ILUt, ILUk, ILUtp, etc Preconditioning: forward/backward elimination Approximate Inverse-based: A-biconjugation (AINV) [8] Forbenius-norm minimization Preconditioning: matrix-vector product September 15, 20156

A-biconjugate based Preconditioner 7

Introduction – GPGPU & NVIDIA CUDA General Purposed computing on Graphics Processing Units [12] NVIDIA CUDA [6] : First (de facto) widely adopted platform for GP-GPU Characteristics of GPU: Throughput-oriented architecture, SIMD style High peak FLOPS/bandwidth Massively parallel (>thousands of concurrent threads) Weak caching/memory model/programmability Weak support for branches, no ILP mechanism September 15, 20158

Introduction – GPU/CPU Comparison CPUGPU SampleIntel i7-920 (Nehalem)NVIDIA Tesla C1060 Freq.2.66 GHz1.3 GHz Peak FLOPS (SP/DP)85 G / 42.5 G624 G / 78 G Peak Bandwidth~25 GB/s~100 GB/s Core configuration4-physical core 8-virtual cores 10-multiprocessor 240-stream processor Cache System3-level coherent cache (32KB x4, 256KB x4, 8MB) 2-level cache (24KB x10, 256 KB) SW-managed CacheNone16KB x30 September 15, 20159

CUDA Thread Hierarchy September 15, 2015 CUDA Device Abstraction Introduction – NVIDIA CUDA 10

Data Formats for Sparse Matrices ELLPACK & ELLPACK-based (HYB) [4] Good bandwidth utilization CSR/CSC (Compressed Sparse Row/Column) September 15, 201511

September 15, 2015 G-S Modified G-S GMRES in CUDA – Algorithms Orthgonalization [11,13] : Gram-Schmidt Modified Gram-Schmidt Gram-Schmidt with re-orthogonalization 12

GMRES in CUDA – Implementation Sparse Matrix-Dense Vector products (SpMV) Orthogonalization Inner Products AXPY operations Preconditioner – AINV Close relationship to ILU-related ones High-performance/easy parallelization September 15, 201513

AINV w/ Predefined Sparsity Pattern AINV-0: W T +Z has the same sparsity pattern as A Similar to ILU-0 Preconditioner Generation: CSC format for both W and Z Preconditioning in GMRES: HYB format September 15, 201514

AINV in CUDA Parallelization: Inner iteration on Line 4~7 and Line 8~12 Kernels: Sparse-Vector Sparse-Vector inner products (Line 5~6) Sparse-Vector Sparse-Vector updates (Line 9~11) September 15, 201515

Experiments – Tests GMRES kernels Krylov subspace generation: SpMV Orthogonalization AINV-0 preconditioner generation AINV-0 preconditioned GMRES iteration September 15, 201516

Experiments – Configurations CPUIntel i7-920 (4-core, 2.66GHz) Memory12GB (DDR-III, 1066MHz) GPUNVidia Tesla C1060 GPU Memory4GB CUDA Version2.0 September 15, 201517 ProteinCantWindTunnelEpidemCircuitPetroOPFTDSCubesParabolic Size36K62K218K526K171K132K2.1M25K101K526K NNZ4.3M4.0M11.6M2.1M959K4.2M8.1M160K874K2.1M Table.1 System Configurations Table.2 Test Matrices

Experiments – SpMV 3.7x speedup in SpMV Performance: Bandwidth utilization Distribution in non-zero element count per row September 15, 201518

Experiments – Orthogonalization Modified G-S scheme Orthogonalization: 1 vector ~ 64 bases Short vectors: CPU Long vectors: GPU September 15, 201519

Experiments – AINV-0 Construction Averaged 2x speed-up Performance: Lower matrix bandwidth Fewer non-zeros per row Adjacent rows with higher sparsity pattern similarity Larger Matrix size September 15, 201520

Experiments – GMRES iterations Restarted GMRES(64) Components: Orthogonalization (1~64) A-based SpMV Preconditioning Left, Right, & Scaling ~3x speed-up per iteration September 15, 201521

Conclusion >3x speed-up’s for Krylov-subspace methods kernel >3.5x speed-up for Krylov-subspace generation ~7x speed-up for orthogonalization process for long matrix/vector size 2x speed-up for AINV-0 preconditioner generation ~3x speed-up for GMRES iteration Future Work: Optimization in both CPU & GPU implementation AINV with dynamic fill-in’s September 15, 201522

References 1.Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006. 2.Yousef Saad, Iterative Methods for Sparse Linear Systems, 2 nd Ed., SIAM, 2003. 3.BLAS – Basic Linear Algebra Subprograms, http://www.netlib.org/blas/.http://www.netlib.org/blas/ 4.Nathan Bell and Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors, SC’09. 5.Muthu Manikandan Baskaran and Rajesh Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on GPUs, Technical Report 2009. 6.CUDA Zone, http://www.nvidia.com/cuda.http://www.nvidia.com/cuda 7.Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, SC’08. 8.M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems, SIAM J. Sci. Comput. 19 (1998). 9.Michele Benzi, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational Physics 182 (2002) pp.418-477. 10.Matthias Christen, Olaf Schenk and Helmar Burkhart, General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform, Workshop on GPGPU, 2007. 11.L. Giarud, J. Langou and M. Rozloznik, The Loss of Orthogonality in the Gram-Schmidt Orthogonalization Process, Intl. J. Computers & Math. with Applications, 50 (2005), pp.1069-1075. 12.GPGPU, http://www.gpgpu.org.http://www.gpgpu.org 13.W. Hoffmann, Iterative Algorithms for Gram-Schmidt Orthogonalization, Computing 41 (1989), 335-348. 14.Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder, Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, SIGGRAPH’05. September 15, 201523

ANY QUESTIONS? THANK YOU! September 15, 201524

September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.

Similar presentations

Presentation on theme: "September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.

Similar presentations

Presentation on theme: "September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute."— Presentation transcript:

Similar presentations

About project

Feedback