Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Slides:

Advertisements

Similar presentations

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

More on threads, shared memory, synchronization

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center.

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

CS240A: Conjugate Gradients and the Model Problem.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

CUDA Linear Algebra Library and Next Generation

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Large-scale Deep Unsupervised Learning using Graphics Processors

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CS240A: Conjugate Gradients and the Model Problem.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

Irregular Applications –Sparse Matrix Vector Multiplication

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

My Coordinates Office EM G.27 contact time:

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CUDA programming Performance considerations (CUDA best practices)

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.

Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement Farzad Khorasani Bryan Rowe,

CS/EE 217 – GPU Architecture and Parallel Programming

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

CUDA and OpenCL Kernels

L18: CUDA, cont. Memory Hierarchy and Examples

DRAM Bandwidth Slide credit: Slides adapted from

CS/EE 217 – GPU Architecture and Parallel Programming

© 2012 Elsevier, Inc. All rights reserved.

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

© 2012 Elsevier, Inc. All rights reserved.

6- General Purpose GPU Programming

Presentation transcript:

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors Nathan Bell and Michael Garland

Overview GPUs deliver high SpMV performance 10+ GFLOP/s on unstructured matrices 140+ GByte/s memory bandwidth No one-size-fits-all approach Match method to matrix structure Exploit structure when possible Fast methods for regular portion Robust methods for irregular portion

Characteristics of SpMV Memory bound FLOP : Byte ratio is very low Generally irregular & unstructured Unlike dense matrix operations (BLAS) Computational intensity is low because we read in a lot of data per FLOP (2 FLOPs per NZ)

Solving Sparse Linear Systems Iterative methods CG, GMRES, BiCGstab, etc. Require 100s or 1000s of SpMV operations Represents dominant cost in iterative solvers

Finite-Element Methods Discretized on structured or unstructured meshes Determines matrix sparsity structure

Objectives Expose sufficient parallelism Develop 1000s of independent threads Minimize execution path divergence SIMD utilization Minimize memory access divergence Memory coalescing

Structured Unstructured Sparse Matrix Formats (CSR) Compressed Row (COO) Coordinate (ELL) ELLPACK (DIA) Diagonal (HYB) Hybrid Structured Unstructured No one-size-fits-all solution Different formats tailored to different sparsity structures DIA and ELL are ideal vector formats, but impose rigid containts on structure In contrast CSR and COO accomodate unstructured matrices, but at the cost of irregularity in matrix format HYB is a hybrid of the ELL and COO that splits the matrix into structured and unstructured portions

Compressed Sparse Row (CSR) Rows laid out in sequence Inconvenient for fine-grained parallelism Even CPU SIMD doesn’t map well to CSR unless you have blocks Introduce CSR (scalar) and CSR (vector), CSR (block) doesn’t work because 32-thread blocks don’t fill the machine

CSR (scalar) kernel One thread per row Poor memory coalescing Unaligned memory access Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 … … … …

CSR (vector) kernel One SIMD vector or warp per row Partial memory coalescing Unaligned memory access Warp 0 Warp 1 Warp 2 Warp 3 Warp 4

ELLPACK (ELL) Storage for K nonzeros per row Pad rows with fewer than K nonzeros Inefficient when row length varies

Hybrid Format ELL handles typical entries COO handles exceptional entries Implemented with segmented reduction The threshold K represents the typical number of nonzeros per row In many cases, such as FEM matrices, ELL stores 90% or more of the matrix entries COO uses segmented reduction

Exposing Parallelism DIA, ELL & CSR (scalar) CSR (vector) COO One thread per row CSR (vector) One warp per row COO One thread per nonzero Finer Granularity

Exposing Parallelism SpMV performance for matrices with 4M nonzeros and various shapes (rows * columns = 4M) We chose HYB splitting strategy to ensure that HYB = max(ELL, COO) on this chart Even if your matrix is stored perfectly in ELL, if rows not > 20K then insufficient parallelism

Execution Divergence Variable row lengths can be problematic Idle threads in CSR (scalar) Idle processors in CSR (vector) Robust strategies exist COO is insensitive to row length We

Execution Divergence

Memory Access Divergence Uncoalesced memory access is very costly Sometimes mitigated by cache Misaligned access is suboptimal Align matrix format to coalescing boundary Access to matrix representation DIA, ELL and COO are fully coalesced CSR (vector) is partially coalesced CSR (scalar) is seldom coalesced

Memory Bandwidth (AXPY) As long as we obey the memory coalescing rules we observe high bandwidth utilization/near-peak performance

Performance Results GeForce GTX 285 Structured Matrices Peak Memory Bandwidth: 159 GByte/s All results in double precision Source vector accessed through texture cache Structured Matrices Common stencils on regular grids Unstructured Matrices Wide variety of applications and sparsity patterns

Structured Matrices

Unstructured Matrices

Performance Comparison System Cores Clock (GHz) Notes GTX 285 240 1.5 NVIDIA GeForce GTX 285 Cell 8 (SPEs) 3.2 IBM QS20 Blade (half) Core i7 4 3.0 Intel Core i7 (Nehalem) Sources: Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors N. Bell and M. Garland, Proc. Supercomputing '09, November 2009 Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms Samuel Williams et al., Supercomputing 2007.

Performance Comparison The GTX 280 is faster in all 14 cases GPUs have much higher peak BW than multicore platforms, and of that bandwidth we deliver a high fraction

ELL kernel __global__ void ell_spmv(const int num_rows, const int num_cols, const int num_cols_per_row, const int stride, const double * Aj, const double * Ax, const double * x, double * y) { const int thread_id = blockDim.x * blockIdx.x + threadIdx.x; const int grid_size = gridDim.x * blockDim.x; for (int row = thread_id; row < num_rows; row += grid_size) { double sum = y[row]; int offset = row; for (int n = 0; n < num_cols_per_row; n++) { const int col = Aj[offset]; if (col != -1) sum += Ax[offset] * x[col]; offset += stride; } y[row] = sum; } } 25-line kernel, DIA is similarly short, CSR (vector) & COO somewhat more complex Version with texture caching would use texfetch1d(x_texture, col);

#include <cusp/hyb_matrix. h> #include <cusp/io/matrix_market #include <cusp/hyb_matrix.h> #include <cusp/io/matrix_market.h> #include <cusp/krylov/cg.h> int main(void) { // create an empty sparse matrix structure (HYB format) cusp::hyb_matrix<int, double, cusp::device_memory> A; // load a matrix stored in MatrixMarket format cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx"); // allocate storage for solution (x) and right hand side (b) cusp::array1d<double, cusp::device_memory> x(A.num_rows, 0); cusp::array1d<double, cusp::device_memory> b(A.num_rows, 1); // solve linear system with the Conjugate Gradient method cusp::krylov::cg(A, x, b); return 0; } http://cusp-library.googlecode.com

Extensions & Optimizations Block formats (register blocking) Block CSR Block ELL Block vectors Solve multiple RHS Block Krylov methods Other optimizations Better CSR (vector)

Further Reading Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors N. Bell and M. Garland Proc. Supercomputing '09, November 2009 Efficient Sparse Matrix-Vector Multiplication on CUDA NVIDIA Tech Report NVR-2008-004, December 2008 Optimizing Sparse Matrix-Vector Multiplication on GPUs M. M. Baskaran and R. Bordawekar. IBM Research Report RC24704, IBM, April 2009 Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs J. W. Choi, A. Singh, and R. Vuduc Proc. ACM SIGPLAN (PPoPP), January 2010 CUSP is and open source library (Apache license) of sparse matrix operations and solvers We are open to collaboration

Questions? nbell@nvidia.com