L11: Sparse Linear Algebra on GPUs CS6235. 2 Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

Block LU Factorization Lecture 24 MA471 Fall 2003.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space Comp 790 GPGP Project Stephen Olivier.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.
L10/L11: Sparse Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab.
CUDA Linear Algebra Library and Next Generation
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
L20: Introduction to “Irregular” Algorithms November 9, 2010.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
JAVA AND MATRIX COMPUTATION
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Solution of Sparse Linear Systems
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Chapter 1 Linear Algebra S 2 Systems of Linear Equations.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Irregular Applications –Sparse Matrix Vector Multiplication
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
L18: MPI, cont. November 10, Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
University of California, Berkeley
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Richard Dorrance Literature Review: 1/11/13
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
for more information ... Performance Tuning
Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.
Linchuan Chen, Peng Jiang and Gagan Agrawal
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Matrix Operations
Parallel build blocks.
© 2012 Elsevier, Inc. All rights reserved.
© 2012 Elsevier, Inc. All rights reserved.
Linear Algebra Lecture 16.
Presentation transcript:

L11: Sparse Linear Algebra on GPUs CS6235

2 Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

GPU Challenges Computation partitioning? Memory access patterns? Parallel reduction BUT, good news is that sparse linear algebra performs TERRIBLY on conventional architectures, so poor baseline leads to improvements! 2 L11: Sparse Linear Algebra CS6235

Some common representations [] A = data = * 1 7 * * [] 1 7 * 2 8 * * [] 0 1 * 1 2 * * [] offsets = [-2 0 1] data =indices = Stores elements along a set of diagonals. Stores a set of K elements per row and pad as needed. Best suited when number non-zeros roughly consistent across rows. DIA. Each thread iterates over the diagonals + Avoids the need to store row/column indices + Guarantees coalesced access - Can be wasteful if lots of padding required ELL 3 L11: Sparse Linear Algebra - Efficiency rapidly degrades when the number of nonzeros per matrix row varies

Some common representations [] A = ptr = [ ] indices = [ ] data = [ ] row = [ ] indices = [ ] data = [ ] Compressed Sparse Row (CSR): Store only nonzero elements, with “ptr” to beginning of each row and “indices” representing column. Store nonzero elements and their corresponding “coordinates”. CSR COO + Performance is largely insensitive to irregularity in the underlying data structure 4 L11: Sparse Linear Algebra

CSR Example for (j=0; j<nr; j++) { for (k = ptr[j]; k<ptr[j+1]-1; k++) t[j] = t[j] + data[k] * x[indices[k]]; 5 L11: Sparse Linear Algebra CS6235

Summary of Representation and Implementation Bytes/Flop Kernel Granularity Coalescing 32-bit 64-bit DIA thread : row full 4 8 ELL thread : row full 6 10 CSR(s) thread : row rare 6 10 CSR(v) warp : row partial 6 10 COO thread : nonz full 8 12 HYB thread : row full 6 10 Table 1 from Bell/Garland: Summary of SpMV kernel properties. 6 L12: Sparse Linear Algebra CS6235

Other Representation Examples Blocked CSR – Represent non-zeros as a set of blocks, usually of fixed size – Within each block, treat as dense and pad block with zeros – Block looks like standard matvec – So performs well for blocks of decent size Hybrid ELL/COO – Exploits these: ELLPACK format is well-suited to vector and SIMD, its efficiency rapidly degrades when the number of nonzeros per matrix row varies Storage efficiency of the COO format is invariant to the distribution of nonzeros per row – Find a “K” value that works for most of matrix – Use COO for rows with more nonzeros (or even significantly fewer) 7 L11: Sparse Linear Algebra CS6235 Richard Vuduc, Attila Gyulassy, James W. Demmel, and Katherine A. Yelick Memory hierarchy optimizations and performance bounds for sparse A T Ax. In Proceedings of the 2003 international conference on Computational science: PartIII (ICCS'03)

Stencil Example What is a 3-point stencil? 5-point stencil? 7-point? 9-point? 27-point? Examples: a[i] = 2*b[i] - (b[i-1] + b[i+1]); [ ] a[i][j] = 4*b[i][j] -(b[i-1][j] + b[i+1][j] + b[i][j-1] + b[i][j+1]); [ ] [ ] [ ] 8 L11: Sparse Linear Algebra CS6235

Stencil Result (structured matrices) See Figures 11 and 12, Bell and Garland 9 L11: Sparse Linear Algebra CS6235

Unstructured Matrices See Figures 13 and 14 Note that graphs can also be represented as sparse matrices. What is an adjacency matrix? 10 L11: Sparse Linear Algebra CS6235

PPoPP paper What if you customize the representation to the problem? Additional global data structure modifications (like blocked representation)? Strategy – Apply models and autotuning to identify best solution for each application 11 L11: Sparse Linear Algebra CS6235

Summary of Results BELLPACK (blocked ELLPACK) achieves up to 29 Gflop/s in SP and 15.7 Gflop/s in DP Up to 1.8x and 1.5x improvement over Bell and Garland. 12 L11: Sparse Linear Algebra CS6235

This Lecture Exposure to the issues in a sparse matrix vector computation on GPUs A set of implementations and their expected performance A little on how to improve performance through application-specific knowledge and customization of sparse matrix representation 13 L11: Sparse Linear Algebra CS6235