GPU Computing CIS-543 Lecture 10: CUDA Libraries

Slides:



Advertisements
Similar presentations
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Advertisements

Introduction to the CUDA Platform
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
An Introduction to Programming with CUDA Paul Richmond
Wavelet Transforms CENG 5931 GNU RADIO INSTRUCTOR: Dr GEORGE COLLINS.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
CIS 565 Fall 2011 Qing Sun
Data Structures and Algorithms Lecture 3 Instructor: Quratulain Date: 8 th September, 2009.
Working with Arrays in MATLAB
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Irregular Applications –Sparse Matrix Vector Multiplication
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Martin Kruliš by Martin Kruliš (v1.0)1.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Introduction toData structures and Algorithms
CUDA C/C++ Basics Part 2 - Blocks and Threads
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
Lecture 5 of Computer Science II
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Programming with ANSI C ++
Chapter 7 Matrix Mathematics
Concepts of Programming Languages
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Lecture: MATLAB Chapter 1 Introduction
MatLab Programming By Kishan Kathiriya.
MULTI-DIMENSIONAL ARRAY
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
BY GAWARE S.R. COMPUTER SCI. DEPARTMENT
JavaScript: Functions.
Chapter 15 QUERY EXECUTION.
Introduction to CuDNN (CUDA Deep Neural Nets)
Pipeline parallelism and Multi–GPU Programming
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
Introduction to MATLAB
The Future of Fortran is Bright …
Introduction to cuBLAS
Multiple Dimension Arrays
MASS CUDA Performance Analysis and Improvement
Introduction to CUDA C Slide credit: Slides adapted from
GPU Implementations for Finite Element Methods
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Use of Mathematics using Technology (Maltlab)
CS 179 Project Ideas.
EKT150 : Computer Programming
CSCI N207 Data Analysis Using Spreadsheet
APACHE MXNET By Beni Mulyana.
Communication and Coding Theory Lab(CS491)
ECE 498AL Lecture 15: Reductions and Their Implementation
Introduction to MATLAB
CUDA Grids, Blocks, and Threads
Simulation And Modeling
Working with Arrays in MATLAB
6- General Purpose GPU Programming
Presentation transcript:

GPU Computing CIS-543 Lecture 10: CUDA Libraries Dr. Muhammad Abid, DCIS, PIEAS

Introduction CUDA libraries: maximize productivity and efficiency designed to be high-level, highly-usable APIs with standardized data formats all computation implemented in the library is accelerated using a GPU The APIs of many CUDA libraries are deliberately made similar to those in a standard library in the same domain low maintenance overheads for software developers.

Introduction

CUDA Libraries GPU-Accelerated Libraries NVIDIA cuFFT: Fast Fourier Transforms NVIDIA cuBLAS: Linear Algebra (BLAS Library) CULA Tools: Linear Algebra MAGMA: Next generation Linear Algebra IMSL Fortran Numerical Library: Mathematics and Statistics NVIDIA cuSPARSE: Sparse Linear Algebra NVIDIA CUSP: Sparse Linear Algebra and Graph Computations

CUDA Libraries AccelerEyes ArrayFire: Mathematics, Signal and Image Processing, and Statistics NVIDIA cuRAND: Random Number Generation NVIDIA NPP Image and Signal Processing NVIDIA CUDA Math Library: Mathematics Thrust: Parallel Algorithms and Data Structures HiPLAR: Linear Algebra in R

CUDA Libraries Geometry Performance Primitives: Computational Geometry Paralution: Sparse Iterative Methods AmgX: Core Solvers cuDNN:library of primitives for deep neural networks FFmpeg: popular open-source multi-media framework cuSOLVER: A collection of dense and sparse direct solvers

CUDA Libraries NVBIO: High-Throughput Sequence Analysis AND SO ON

A Common Library Workfl ow Many CUDA libraries share concepts, features, and a common workflow when being called from a host application: Create a library-specific handle that manages contextual information useful for the library’s operation. Allocate device memory for inputs and outputs to the library function. If inputs are not already in a library-supported format, convert them to be accessible by the library.

A Common Library Workfl ow Populate the pre-allocated device memory with inputs in a supported format. Configure the library computation to be executed. Execute a library call that offloads the desired computation to the GPU. Retrieve the results of that computation from device memory, possibly in a library determined format. If necessary, convert the retrieved data to the application’s native format. Release CUDA resources. Continue with the remainder of the application.

The cuSPARSE Library implements a wide range of general-purpose sparse linear algebra functions. supports a collection of dense and sparse data formats on which those functions operate. separates functions into levels. All Level 1 functions operate exclusively on dense and sparse vectors. All Level 2 functions operate on sparse matrices and dense vectors. All Level 3 functions operate on sparse matrices and dense matrices.

The cuSPARSE Library

cuSPARSE Data Storage Formats Dense matrix: contains primarily non-zero values. stored in a multi-dimensional array. Sparse matrices & vectors: consist primarily of zero-valued entries and can be more compactly represented by storing only the non-zero values and their coordinates rather than many redundant zero values. There are many ways of representing sparse matrices, eight of which are currently supported by cuSPARSE

cuSPARSE Data Storage Formats Dense Coordinate (COO): the coordinate (COO) sparse matrix format stores the non-zero value with both its row index and its column index. The point at which a coordinate-formatted matrix consumes less space than a dense matrix depends on the sparsity of a matrix, the size of the values, and the size of the type used to store their coordinates. For example, given a sparse matrix storing 32-bit fl oating-point values and a coordinate format that uses 32-bit integers to represent matrix coordinates, space savings are achieved when less than one third of the cells in the matrix contain non-zero values. This is true because storing a non-zero entry in this particular coordinate format requires triple the space of storing only the value in a dense format.

cuSPARSE Data Storage Formats Compressed Sparse Row (CSR): similar to coordinate format Rather than storing the row index for each value explicitly, CSR instead stores an offset to where all of the values belonging to the same row are stored in the value and column arrays. When storing large matrices with many elements per row, representing each row with simply an offset and a length is clearly much more efficient than storing a row index for every value.

cuSPARSE Data Storage Formats Compressed Sparse Row (CSR): R with length nRows+1 The length of row i can be derived from the difference between the offset of row i+1 and i. Essentially, the value at R[i+1] is the total number of non-zero values stored in rows 0, 1, …, and i. R[nRows+1] is the total number of non-zero values in M.

cuSPARSE Data Storage Formats CSR: float *h_csrVals; int *h_csrCols; int *h_csrRows; float *d_csrVals; int *d_csrCols; int *d_csrRows; cudaMalloc((void **)&d_csrVals, n_vals * sizeof(float)); cudaMalloc((void **)&d_csrCols, n_vals * sizeof(int)); cudaMalloc((void **)&d_csrRows, (n_rows + 1) * sizeof(int)); cudaMemcpy(d_csrVals, h_csrVals, n_vals * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_csrCols, h_csrCols, n_vals * sizeof(int), cudaMemcpy(d_csrRows, h_csrRows, (n_rows + 1) * sizeof(int),

Formatting Conversion with cuSPARSE performing matrix-vector or matrix-matrix operations in cuSPARSE requires that the inputs and outputs are in CSR, BSR, BSRX, or HYB formats.

THE cuBLAS LIBRARY collection of linear algebra routines. port of a legacy linear algebra library, the Basic Linear Algebra Subprograms (BLAS) library. Like BLAS, cuBLAS subroutines are split into multiple classes based on the data types on which they operate. Level 1 contains vector-only operations like vector addition. Level 2 contains matrix-vector operations like matrix-vector multiplication. Level 3 contains matrix-matrix operations like matrix-multiplication.

THE cuBLAS LIBRARY only supports and is optimized for dense vector and dense matrix manipulation Because the original BLAS library was written in FORTRAN, BLAS historically uses column-major array storage and one-based indexing. This contrasts with the semantics of C/C++ from which cuBLAS is called, which is row-major, meaning that elements in the same row are stored adjacent to each other. Row-Major: f(m, n) = m × N + n Column-Major: f(m, n) = n × M + m

THE cuBLAS LIBRARY For compatibility reasons, the cuBLAS library also chooses to use column-major storage. However, the cuBLAS library has no control over the semantics of the C/C++ programming language in which it is built, so it must use zero-based indexing. The cuBLAS library comes with two APIs.

Managing cuBLAS Data All operations are done on dense cuBLAS vectors or matrices use cudaMalloc() to allocate memory Use cublasSetVector/cublasGetVector and cublasSetMatrix/cublasGetMatrix to transfer data between the host and device. well-optimized to transfer both strided and unstrided data. cublasStatus_t cublasSetMatrix(int rows, int cols, int elementSize, const void *hA, int lda, void *dB, int ldb); cublasSetMatrix(M, N, sizeof(float), A, M, dA, M); The use of the fi fth and seventh arguments might be less clear. lda and ldb specify the leading dimension of the source matrix A and destination matrix B. The leading dimension is the total number of rows in the respective matrix. This is useful if only a submatrix of a matrix in host memory is being transferred to the GPU. In other words, if the full matrices stored at A and B are being transferred, lda and ldb should both equal M. If only a submatrix in those matrices is being transferred, the values of lda and ldb should be the row length of the full matrix. lda and ldb should also always be greater than or equal to rows.

Managing cuBLAS Data cublasStatus_t cublasSetVector(int n, int elemSize, const void *hx, int incx, void *dx, int incy) cublasSetVector(M, sizeof(float), A, 1, dV, 1); cublasSetVector(N, sizeof(float), A, M, dV, 1); cublasSetVector(N, sizeof(float), A + i, M, dV, 1);