GPU Computing CIS-543 Lecture 10: CUDA Libraries Dr. Muhammad Abid, DCIS, PIEAS
Introduction CUDA libraries: maximize productivity and efficiency designed to be high-level, highly-usable APIs with standardized data formats all computation implemented in the library is accelerated using a GPU The APIs of many CUDA libraries are deliberately made similar to those in a standard library in the same domain low maintenance overheads for software developers.
Introduction
CUDA Libraries GPU-Accelerated Libraries NVIDIA cuFFT: Fast Fourier Transforms NVIDIA cuBLAS: Linear Algebra (BLAS Library) CULA Tools: Linear Algebra MAGMA: Next generation Linear Algebra IMSL Fortran Numerical Library: Mathematics and Statistics NVIDIA cuSPARSE: Sparse Linear Algebra NVIDIA CUSP: Sparse Linear Algebra and Graph Computations
CUDA Libraries AccelerEyes ArrayFire: Mathematics, Signal and Image Processing, and Statistics NVIDIA cuRAND: Random Number Generation NVIDIA NPP Image and Signal Processing NVIDIA CUDA Math Library: Mathematics Thrust: Parallel Algorithms and Data Structures HiPLAR: Linear Algebra in R
CUDA Libraries Geometry Performance Primitives: Computational Geometry Paralution: Sparse Iterative Methods AmgX: Core Solvers cuDNN:library of primitives for deep neural networks FFmpeg: popular open-source multi-media framework cuSOLVER: A collection of dense and sparse direct solvers
CUDA Libraries NVBIO: High-Throughput Sequence Analysis AND SO ON
A Common Library Workfl ow Many CUDA libraries share concepts, features, and a common workflow when being called from a host application: Create a library-specific handle that manages contextual information useful for the library’s operation. Allocate device memory for inputs and outputs to the library function. If inputs are not already in a library-supported format, convert them to be accessible by the library.
A Common Library Workfl ow Populate the pre-allocated device memory with inputs in a supported format. Configure the library computation to be executed. Execute a library call that offloads the desired computation to the GPU. Retrieve the results of that computation from device memory, possibly in a library determined format. If necessary, convert the retrieved data to the application’s native format. Release CUDA resources. Continue with the remainder of the application.
The cuSPARSE Library implements a wide range of general-purpose sparse linear algebra functions. supports a collection of dense and sparse data formats on which those functions operate. separates functions into levels. All Level 1 functions operate exclusively on dense and sparse vectors. All Level 2 functions operate on sparse matrices and dense vectors. All Level 3 functions operate on sparse matrices and dense matrices.
The cuSPARSE Library
cuSPARSE Data Storage Formats Dense matrix: contains primarily non-zero values. stored in a multi-dimensional array. Sparse matrices & vectors: consist primarily of zero-valued entries and can be more compactly represented by storing only the non-zero values and their coordinates rather than many redundant zero values. There are many ways of representing sparse matrices, eight of which are currently supported by cuSPARSE
cuSPARSE Data Storage Formats Dense Coordinate (COO): the coordinate (COO) sparse matrix format stores the non-zero value with both its row index and its column index. The point at which a coordinate-formatted matrix consumes less space than a dense matrix depends on the sparsity of a matrix, the size of the values, and the size of the type used to store their coordinates. For example, given a sparse matrix storing 32-bit fl oating-point values and a coordinate format that uses 32-bit integers to represent matrix coordinates, space savings are achieved when less than one third of the cells in the matrix contain non-zero values. This is true because storing a non-zero entry in this particular coordinate format requires triple the space of storing only the value in a dense format.
cuSPARSE Data Storage Formats Compressed Sparse Row (CSR): similar to coordinate format Rather than storing the row index for each value explicitly, CSR instead stores an offset to where all of the values belonging to the same row are stored in the value and column arrays. When storing large matrices with many elements per row, representing each row with simply an offset and a length is clearly much more efficient than storing a row index for every value.
cuSPARSE Data Storage Formats Compressed Sparse Row (CSR): R with length nRows+1 The length of row i can be derived from the difference between the offset of row i+1 and i. Essentially, the value at R[i+1] is the total number of non-zero values stored in rows 0, 1, …, and i. R[nRows+1] is the total number of non-zero values in M.
cuSPARSE Data Storage Formats CSR: float *h_csrVals; int *h_csrCols; int *h_csrRows; float *d_csrVals; int *d_csrCols; int *d_csrRows; cudaMalloc((void **)&d_csrVals, n_vals * sizeof(float)); cudaMalloc((void **)&d_csrCols, n_vals * sizeof(int)); cudaMalloc((void **)&d_csrRows, (n_rows + 1) * sizeof(int)); cudaMemcpy(d_csrVals, h_csrVals, n_vals * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_csrCols, h_csrCols, n_vals * sizeof(int), cudaMemcpy(d_csrRows, h_csrRows, (n_rows + 1) * sizeof(int),
Formatting Conversion with cuSPARSE performing matrix-vector or matrix-matrix operations in cuSPARSE requires that the inputs and outputs are in CSR, BSR, BSRX, or HYB formats.
THE cuBLAS LIBRARY collection of linear algebra routines. port of a legacy linear algebra library, the Basic Linear Algebra Subprograms (BLAS) library. Like BLAS, cuBLAS subroutines are split into multiple classes based on the data types on which they operate. Level 1 contains vector-only operations like vector addition. Level 2 contains matrix-vector operations like matrix-vector multiplication. Level 3 contains matrix-matrix operations like matrix-multiplication.
THE cuBLAS LIBRARY only supports and is optimized for dense vector and dense matrix manipulation Because the original BLAS library was written in FORTRAN, BLAS historically uses column-major array storage and one-based indexing. This contrasts with the semantics of C/C++ from which cuBLAS is called, which is row-major, meaning that elements in the same row are stored adjacent to each other. Row-Major: f(m, n) = m × N + n Column-Major: f(m, n) = n × M + m
THE cuBLAS LIBRARY For compatibility reasons, the cuBLAS library also chooses to use column-major storage. However, the cuBLAS library has no control over the semantics of the C/C++ programming language in which it is built, so it must use zero-based indexing. The cuBLAS library comes with two APIs.
Managing cuBLAS Data All operations are done on dense cuBLAS vectors or matrices use cudaMalloc() to allocate memory Use cublasSetVector/cublasGetVector and cublasSetMatrix/cublasGetMatrix to transfer data between the host and device. well-optimized to transfer both strided and unstrided data. cublasStatus_t cublasSetMatrix(int rows, int cols, int elementSize, const void *hA, int lda, void *dB, int ldb); cublasSetMatrix(M, N, sizeof(float), A, M, dA, M); The use of the fi fth and seventh arguments might be less clear. lda and ldb specify the leading dimension of the source matrix A and destination matrix B. The leading dimension is the total number of rows in the respective matrix. This is useful if only a submatrix of a matrix in host memory is being transferred to the GPU. In other words, if the full matrices stored at A and B are being transferred, lda and ldb should both equal M. If only a submatrix in those matrices is being transferred, the values of lda and ldb should be the row length of the full matrix. lda and ldb should also always be greater than or equal to rows.
Managing cuBLAS Data cublasStatus_t cublasSetVector(int n, int elemSize, const void *hx, int incx, void *dx, int incy) cublasSetVector(M, sizeof(float), A, 1, dV, 1); cublasSetVector(N, sizeof(float), A, M, dV, 1); cublasSetVector(N, sizeof(float), A + i, M, dV, 1);