September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
st International Conference on Parallel Processing (ICPP)
OpenFOAM on a GPU-based Heterogeneous Cluster
" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space Comp 790 GPGP Project Stephen Olivier.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Sparse Matrix Algorithms on GPUs and their Integration into SCIRun & Miriam Leeser Dana Brooks 1 This work is supported.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
CUDA Linear Algebra Library and Next Generation
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Two Phase Flow using two levels of preconditioning on the GPU Prof. Kees Vuik and Rohit Gupta Delft Institute of Applied Mathematics.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
JAVA AND MATRIX COMPUTATION
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
Sunpyo Hong, Hyesoon Kim
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Irregular Applications –Sparse Matrix Vector Multiplication
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Shengxin Zhu The University of Oxford
Presentation transcript:

September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute of Applied Mathematics, TU Delft 2 Department of Computer Science & Technology, Tsinghua University 3 Lab. Parallel Soft. & Comp. Sci., Inst. Software Chinese Academy of Science

Outline Introduction to Krylov-subspace linear system solvers & preconditioning techniques Introduction to GPGPU & NVIDIA CUDA GMRES solver on CUDA Approximate inverse preconditioner based on A- biconjugate (AINV) on CUDA Experiments & Results Conclusion September 15, 20152

3 Introduction – Krylov subspace solver Iterative Linera System Solver [2] Krylov Subspace-based solver: Popular Solvers: GMRES, CG, Bi-CG

Introduction – Preconditioners (1) Iteration Count ~ Condition of matrix A Preconditioners [2,9] : Improve condition of the ‘actual’ matrix for iteration Left & right preconditioning Effective matrix & system: September 15, 20154

Introduction – Preconditioned GMRES 5

Introduction – Preconditioners (2) Incomplete Factorization-based: Incomplete LU/Cholesky factorization [1,2] ILU(0), ILUt, ILUk, ILUtp, etc Preconditioning: forward/backward elimination Approximate Inverse-based: A-biconjugation (AINV) [8] Forbenius-norm minimization Preconditioning: matrix-vector product September 15, 20156

A-biconjugate based Preconditioner 7

Introduction – GPGPU & NVIDIA CUDA General Purposed computing on Graphics Processing Units [12] NVIDIA CUDA [6] : First (de facto) widely adopted platform for GP-GPU Characteristics of GPU: Throughput-oriented architecture, SIMD style High peak FLOPS/bandwidth Massively parallel (>thousands of concurrent threads) Weak caching/memory model/programmability Weak support for branches, no ILP mechanism September 15, 20158

Introduction – GPU/CPU Comparison CPUGPU SampleIntel i7-920 (Nehalem)NVIDIA Tesla C1060 Freq.2.66 GHz1.3 GHz Peak FLOPS (SP/DP)85 G / 42.5 G624 G / 78 G Peak Bandwidth~25 GB/s~100 GB/s Core configuration4-physical core 8-virtual cores 10-multiprocessor 240-stream processor Cache System3-level coherent cache (32KB x4, 256KB x4, 8MB) 2-level cache (24KB x10, 256 KB) SW-managed CacheNone16KB x30 September 15, 20159

CUDA Thread Hierarchy September 15, 2015 CUDA Device Abstraction Introduction – NVIDIA CUDA 10

Data Formats for Sparse Matrices ELLPACK & ELLPACK-based (HYB) [4] Good bandwidth utilization CSR/CSC (Compressed Sparse Row/Column) September 15,

September 15, 2015 G-S Modified G-S GMRES in CUDA – Algorithms Orthgonalization [11,13] : Gram-Schmidt Modified Gram-Schmidt Gram-Schmidt with re-orthogonalization 12

GMRES in CUDA – Implementation Sparse Matrix-Dense Vector products (SpMV) Orthogonalization Inner Products AXPY operations Preconditioner – AINV Close relationship to ILU-related ones High-performance/easy parallelization September 15,

AINV w/ Predefined Sparsity Pattern AINV-0: W T +Z has the same sparsity pattern as A Similar to ILU-0 Preconditioner Generation: CSC format for both W and Z Preconditioning in GMRES: HYB format September 15,

AINV in CUDA Parallelization: Inner iteration on Line 4~7 and Line 8~12 Kernels: Sparse-Vector Sparse-Vector inner products (Line 5~6) Sparse-Vector Sparse-Vector updates (Line 9~11) September 15,

Experiments – Tests GMRES kernels Krylov subspace generation: SpMV Orthogonalization AINV-0 preconditioner generation AINV-0 preconditioned GMRES iteration September 15,

Experiments – Configurations CPUIntel i7-920 (4-core, 2.66GHz) Memory12GB (DDR-III, 1066MHz) GPUNVidia Tesla C1060 GPU Memory4GB CUDA Version2.0 September 15, ProteinCantWindTunnelEpidemCircuitPetroOPFTDSCubesParabolic Size36K62K218K526K171K132K2.1M25K101K526K NNZ4.3M4.0M11.6M2.1M959K4.2M8.1M160K874K2.1M Table.1 System Configurations Table.2 Test Matrices

Experiments – SpMV 3.7x speedup in SpMV Performance: Bandwidth utilization Distribution in non-zero element count per row September 15,

Experiments – Orthogonalization Modified G-S scheme Orthogonalization: 1 vector ~ 64 bases Short vectors: CPU Long vectors: GPU September 15,

Experiments – AINV-0 Construction Averaged 2x speed-up Performance: Lower matrix bandwidth Fewer non-zeros per row Adjacent rows with higher sparsity pattern similarity Larger Matrix size September 15,

Experiments – GMRES iterations Restarted GMRES(64) Components: Orthogonalization (1~64) A-based SpMV Preconditioning Left, Right, & Scaling ~3x speed-up per iteration September 15,

Conclusion >3x speed-up’s for Krylov-subspace methods kernel >3.5x speed-up for Krylov-subspace generation ~7x speed-up for orthogonalization process for long matrix/vector size 2x speed-up for AINV-0 preconditioner generation ~3x speed-up for GMRES iteration Future Work: Optimization in both CPU & GPU implementation AINV with dynamic fill-in’s September 15,

References 1.Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, Yousef Saad, Iterative Methods for Sparse Linear Systems, 2 nd Ed., SIAM, BLAS – Basic Linear Algebra Subprograms, 4.Nathan Bell and Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors, SC’09. 5.Muthu Manikandan Baskaran and Rajesh Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on GPUs, Technical Report CUDA Zone, 7.Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, SC’08. 8.M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems, SIAM J. Sci. Comput. 19 (1998). 9.Michele Benzi, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational Physics 182 (2002) pp Matthias Christen, Olaf Schenk and Helmar Burkhart, General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform, Workshop on GPGPU, L. Giarud, J. Langou and M. Rozloznik, The Loss of Orthogonality in the Gram-Schmidt Orthogonalization Process, Intl. J. Computers & Math. with Applications, 50 (2005), pp GPGPU, 13.W. Hoffmann, Iterative Algorithms for Gram-Schmidt Orthogonalization, Computing 41 (1989), Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder, Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, SIGGRAPH’05. September 15,

ANY QUESTIONS? THANK YOU! September 15,