Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Slides:

Advertisements

Similar presentations

Lecture 19: Parallel Algorithms

Advertisements

Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Math 201 for Management Students

CUDA Linear Algebra Library and Next Generation

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Lab 3. Why Compressed Row Storage –A sparse matrix has a lot of elements of value zero. –Using a two dimensional array to store a sparse matrix wastes.

Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

Irregular Applications –Sparse Matrix Vector Multiplication

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

Martin Kruliš by Martin Kruliš (v1.0)1.

Lecture 10 CUDA Instructions

Lecture 15 Introduction to OpenCL

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Direct Methods for Sparse Linear Systems

Analysis of Sparse Convolutional Neural Networks

CUDA Interoperability with Graphical Environments

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

GPU Computing CIS-543 Lecture 10: CUDA Libraries

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Linear Algebra Lecture 4.

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Lecture 22: Parallel Algorithms

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Linchuan Chen, Peng Jiang and Gagan Agrawal

Introduction to CUDA C Slide credit: Slides adapted from

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

GPU Implementations for Finite Element Methods

DRAM Bandwidth Slide credit: Slides adapted from

CS/EE 217 – GPU Architecture and Parallel Programming

CSCE569 Parallel Computing

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Reduction)

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Parallel build blocks.

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Convolution Layer Optimization

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

© 2012 Elsevier, Inc. All rights reserved.

6- General Purpose GPU Programming

Ax = b Methods for Solution of the System of Equations (ReCap):

Presentation transcript:

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries Kyu Ho Park May 10, 2016 Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA. 2. Massimiliano Fatica CUDA Libraries and CUDA FORTRAN, NVIDIA

Sparse Matrix Sparse matrix: a matrix where the majority of the elements are zero. Example: 1 0 2 0 0 0 0 0 0 3 4 5 6 0 0 7 row 0 row 1 row 2 row 3 How to represent a sparse matrix?

Compressed Sparse Row(CSR) storage format It consists of arrays data[], col_index[],and row_ptr[]. data[]: It stores all the non-zero values in the sparse matrix. col_index[]:It represents the column index of every non- zero value in the sparse matrix. row_ptr[]: It represents the beginning locations of each row.

CSR row 0 row2 row3 data[] 1 2 3 4 5 6 7 col_index[] row_ptr[] 2 5 7 col_index[] row0 row1 row2 row3 row4 row_ptr[] 2 5 7

Linear System A linear system of N equations of N variables can be expressed in the form of AX + Y = 0, where A is an N x N matrix, X is a vector of N variables, and Y is a vector of N constant values.

AX + Y =0 The solutions: 1. Calculate directly X= A-1 x (-Y). 2. Iterative method by conjugate gradient method: (1) Guessing a solution X and calculate AX+Y, and see if the result is close to a 0 vector. (2)If not, modify the X using a gradient vector formula. (3)The most time-consuming part of this iterative approach is in the evaluation of AX+Y, which is a sparse matrix- vector multiplication.

CSR row 0 row2 row3 data[] 1 2 3 4 5 6 7 col_index[] row_ptr[] 2 5 7 col_index[] row0 row1 row2 row3 row4 row_ptr[] 2 5 7

SpMV [1] A sequential loop that implements SpMV. 1. for(int row=0;row<num_rows;row++){ 2. float dop=0; 3. int r_start=row_ptr[row]; 4. int r_end=row_ptr[row+1]; 5. for(int i=r_start; i<r_end; i++){ 6. dop +=data[i]*x[col_index[i]]; 7. } 8. y[row]+=dop; 9. }

Parallel SpMV in CUDA 1 0 2 0 0 0 0 0 0 3 4 5 6 0 0 7 Thread 0

Parallel SpMV in CUDA[1] 1. __global__ void SPMV(int num_rows, float *data, int *col_index, int *row_ptr, float *x, float *y) { 2. int row=blockIdx.x*blockDim.x + threadIdx.x; 3. if(row < num_rows){ 4. float dop=0; 5. int r_start=row_ptr[row]; 6. int r_end=row_ptr[row+1]; 7. for(int i=r_start; i<r_end; i++){ 8. dop +=data[i]*x[col_index[i]]; 9. } 10. y[row]+=dop; 11. }

CSR row 0 row2 row3 data[] 1 2 3 4 5 6 7 col_index[] row_ptr[] 2 5 7 col_index[] row0 row1 row2 row3 row4 row_ptr[] 2 5 7

Padding and Transposition ELL storage format CSR representation: row0: 1 2 0 2 row_ptr =[0,2,2,5,7] row2: 3 4 5 1 2 3 row3: 6 7 0 3 CSR with padding row0: 1 2 * 0 2 * row1: * * * * * * row2: 3 4 5 1 2 3 row3: 6 7 * 0 3 *

ELL Storage Format row0: 1 2 * 0 2 * row1: * * * * * * 1 * 3 6 2 * 4 7 * * 5 * Row-major Column-major

ELL format[1] data[]: index[]: 1 * 3 6 2 4 7 5 * 1 2 3 * 1 2 3 __global__ void SPMV_ELL(int num_rows, float *data, int *col_index, int num_elel, float *x, float *y){ 2. int row=blockIdx.x*blockDim.x + threadIdx.x; 3. if(row < num_rows) { float dop=0; for( int i=0; i<num_elem; i++){ dop+=data[row+i*num_rows]*x[col_index[rwo+i*num*rows]]; } y[row]=dop;

Hybrid method to Control Padding Excessive padding in ELL is caused by one or small number of rows have an exceedingly large number of non-zero elements. COO(coordinated) format: It provides a mechanism to take away some elements from these rows.

COO format row 1 row 2 row3 data[] 1 2 3 4 5 6 7 col_index[] col_index[] row_index[] 2 3

ELL + COO Hybrid Format + row0: 1 2 * 0 2 * row1: * * * * * * 3 6 2 4 7 ELL Format: + data : 5 col_index : 3 row_index : 2 COO Format:

JDS(Jagged Diagonal Storage) format COO helps to regulate the amount of padding in an ELL representation, we can firther reduce the padding over head by sorting and partitioning the rows of a sparse matrix. Sort the rows according to their length from the longest to the shortest.

Sorting and Partitioning for Regularization JDS(Jagged Diagonal Storage) format

Sorting and Partitioning for Regularization JDS(Jagged Diagonal Storage) format row0: 1 2 0 row1: 0 0 0 row2: 3 4 5 row3: 6 7 0 3 4 5 1 2 0 6 7 0 0 0 0 CSR JDS Nonzero values data[7] {3,4,5,1,2,6,7} Column indices col_index[7] {0,1,2,0,1,0,1} JDS row indices jds_row_index[4] {2,0,3,1} jds_section_pointer[4] {0,3,7,7]

CUDA Libraries[2] CUDA C Runtime Applications Third Party Libraries -CUFFT -CUBLAS -CUSPARSE -Libm -NPP Thrust NVIDIA Libraries CUDA C Runtime

CUDA Libraries[2]

NPP

Thrust

CUFFT[2]

CUFFT[2]

CUFFT

CUBLAS

CUSPARSE Library Four types of operations: Level 1: operations between a vector in sparse format and a vector in dense format. Functions: axpyi( y=y+ax), doti( z=yT x)… etc. Level 2: operations between a matrix in sparse format and a vector in dense format. Functions: mv( solve y=aAx+by), sv(solves a sparse triangular linear system). Level 3: operations between a matrix in a sparse format and a vector in dense format. Functions: mm(C=aAB+bC), sm( solve a sparse triangular linear system.

CUSPARSE The CURSPARSE functions are operated on the device. Operations format: cusparse< T > [<matrix data format>]operation[<output matrix format>] ,where T={S,D,C,Z,X}//S:float, D:double,C:cuComplex, Z:cuDoubleComplex, X: a generix type, matrix data format:{dense,COO,CSR,CSC, HSB…….}, operation:{axpyi,doti,dotci,gthr,gthrz,roti,sctr}(level1), {mv, sv}(level2),and {mm,sm }(level3).

CUSPARSE[2]

Libm[2]

CURAND

CUBLAS[2]

CUBLAS[2]

Reading and Presentation List 1.MRI and CT Processing with MathLab and CUDA: 강은희,이주영 2.Matrix Multiplication with CUDA, Robert Hochberg, 2012: 박겨레 3.Optimizing Matrix Trqanspose in CUDA, Greg Ruetsch and Paulisu Micikevicius,2010: 박일우 4.NVIDIA Profiler User’s Guide: 노성철 5.Monte Carlo Methods in CUDA: 조정석 6.Optimizing Parallel Reduction in CUDA,Mark Harris,NVIDIA:박주연 7.Deep Learning and MultiGPU: 박종찬 8.Image Processing with CUDA, Jia Tse, 2006:최우석,김치현 9.Image Convolution with CUDA, Victor Podlozhnyuk, 2007: Homework#4

Second Term Reading List 10. Parallel Genetic Algorithm on CUDA Architecture, Petr Pospichal ,Jiri Jaros, and Josef Schwarz, 2010.: 양은주 11.Texture Memory, Chap 7 of CUDA by Example.:전민수 12.Atomics, Chap 9 of CUDA by Example.:이상록 13.Sparse Matrix-Vector Product.:장형욱 14.Solving Ordinary Differential Equations on GPUs.:윤종민 15.Fast Fourier Transform on GPUs.:이한섭 16. Building an Efficient Hash Table on GPU. 17.Efficient CUDA Algorithms for the Maximum Network Flow Problem. 채종욱