L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Slides:

Advertisements

Similar presentations

Block LU Factorization Lecture 24 MA471 Fall 2003.

Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Matrices. Special Matrices Matrix Addition and Subtraction Example.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Chapter 3 The Inverse. 3.1 Introduction Definition 1: The inverse of an n  n matrix A is an n  n matrix B having the property that AB = BA = I B is.

Foundations of Computer Graphics (Fall 2012) CS 184, Lecture 2: Review of Basic Math

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.

L10/L11: Sparse Linear Algebra on GPUs CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab.

CUDA Linear Algebra Library and Next Generation

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

L20: Introduction to “Irregular” Algorithms November 9, 2010.

Lecture 8 Matrix Inverse and LU Decomposition

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

JAVA AND MATRIX COMPUTATION

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

Irregular Applications –Sparse Matrix Vector Multiplication

2.5 – Determinants and Multiplicative Inverses of Matrices.

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Do Now: Perform the indicated operation. 1.). Algebra II Elements 11.1: Matrix Operations HW: HW: p.590 (16-36 even, 37, 44, 46)

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

L18: MPI, cont. November 10, Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA.

College Algebra Chapter 6 Matrices and Determinants and Applications

MTH108 Business Math I Lecture 20.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

L11: Dense Linear Algebra on GPUs, cont.

Reminder: Array Representation In C++

Reminder: Array Representation In C++

Parallel Matrix Operations

CSCE569 Parallel Computing

Linear Algebra Lecture 18.

Parallel build blocks.

© 2012 Elsevier, Inc. All rights reserved.

Dense Linear Algebra (Data Distributions)

© 2012 Elsevier, Inc. All rights reserved.

Reminder: Array Representation In C++

Lecture 8 Matrix Inverse and LU Decomposition

Parallelized Analytic Placer

Presentation transcript:

L12: Sparse Linear Algebra on GPUs CS6963

Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ” Project proposals – Due 5PM, Wednesday, March 17 (hard deadline) – handin cs6963 prop CS L12: Sparse Linear Algebra

Outline ProjectA A few more suggested topics Sparse Linear Algebra Readings: – “Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors,” Bell and Garland (Nvidia), SC09, Nov – “Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs”, Choi, Singh, Vuduc, PPoPP 10, Jan – “Optimizing sparse matrix-vector multiply on emerging multicore platforms,” Journal of Parallel Computing, 35(3): , March (Expanded from SC07 paper.) CS L12: Sparse Linear Algebra

More Project Ideas Suggestions from Kris Sikorski: – Singular value decomposition – Newton’s method 4 L12: Sparse Linear Algebra CS6963

Sparse Linear Algebra Suppose you are applying matrix-vector multiply and the matrix has lots of zero elements – Computation cost? Space requirements? General sparse matrix representation concepts – Primarily only represent the nonzero data values – Auxiliary data structures describe placement of nonzeros in “dense matrix” 5 L12: Sparse Linear Algebra CS6963

GPU Challenges Computation partitioning? Memory access patterns? Parallel reduction BUT, good news is that sparse linear algebra performs TERRIBLY on conventional architectures, so poor baseline leads to improvements! 6 L12: Sparse Linear Algebra CS6963

Some common representations [] A = data = * 1 7 * * [] 1 7 * 2 8 * * [] 0 1 * 1 2 * * [] offsets = [-2 0 1] data =indices = ptr = [ ] indices = [ ] data = [ ] row = [ ] indices = [ ] data = [ ] DIA: Store elements along a set of diagonals. Compressed Sparse Row (CSR): Store only nonzero elements, with “ptr” to beginning of each row and “indices” representing column. ELL: Store a set of K elements per row and pad as needed. Best suited when number non-zeros roughly consistent across rows. COO: Store nonzero elements and their corresponding “coordinates”.

CSR Example for (j=0; j<nr; j++) { for (k = ptr[j]; k<ptr[j+1]-1; k++) t[j] = t[j] + data[k] * x[indices[k]]; 8 L12: Sparse Linear Algebra CS6963

Summary of Representation and Implementation Bytes/Flop Kernel Granularity Coalescing 32-bit 64-bit DIA thread : row full 4 8 ELL thread : row full 6 10 CSR(s) thread : row rare 6 10 CSR(v) warp : row partial 6 10 COO thread : nonz full 8 12 HYB thread : row full 6 10 Table 1 from Bell/Garland: Summary of SpMV kernel properties. 9 L12: Sparse Linear Algebra CS6963

Other Representation Examples Blocked CSR – Represent non-zeros as a set of blocks, usually of fixed size – Within each block, treat as dense and pad block with zeros – Block looks like standard matvec – So performs well for blocks of decent size Hybrid ELL and COO – Find a “K” value that works for most of matrix – Use COO for rows with more nonzeros (or even significantly fewer) 10 L12: Sparse Linear Algebra CS6963

Stencil Example What is a 3-point stencil? 5-point stencil? 7-point? 9-point? 27-point? How is this represented by a sparse matrix? 11 L12: Sparse Linear Algebra CS6963

Stencil Result (structured matrices) See Figures 11 and 12, Bell and Garland 12 L12: Sparse Linear Algebra CS6963

Unstructured Matrices See Figures 13 and L12: Sparse Linear Algebra CS6963

PPoPP paper What if you customize the representation to the problem? Additional global data structure modifications (like blocked representation)? Strategy – Apply models and autotuning to identify best solution for each application 14 L12: Sparse Linear Algebra CS6963

Summary of Results BELLPACK (blocked ELLPACK) achieves up to 29 Gflop/s in SP and 15.7 Gflop/s in DP Up to 1.8x and 1.5x improvement over Bell and Garland. 15 L12: Sparse Linear Algebra CS6963

This Lecture Exposure to the issues in a sparse matrix vector computation on GPUs A set of implementations and their expected performance A little on how to improve performance through application-specific knowledge and customization of sparse matrix representation 16 L12: Sparse Linear Algebra CS6963

What’s Coming Before Spring Break – Two application case studies from newly published Kirk and Hwu – Review for Midterm (week after spring break) CS L12: Sparse Linear Algebra