© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.

Slides:



Advertisements
Similar presentations
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Computational Physics Linear Algebra Dr. Guy Tel-Zur Sunset in Caruaru by Jaime JaimeJunior. publicdomainpictures.netVersion , 14:00.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
OpenFOAM on a GPU-based Heterogeneous Cluster
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Performance and Energy Efficiency of GPUs and FPGAs
CUDA Linear Algebra Library and Next Generation
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Implementing a Speech Recognition System on a GPU using CUDA
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
GPU Architecture and Programming
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
My Coordinates Office EM G.27 contact time:
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Martin Kruliš by Martin Kruliš (v1.0)1.
Computer Engg, IIT(BHU)
Analysis of Sparse Convolutional Neural Networks
A survey of Exascale Linear Algebra Libraries for Data Assimilation
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
D. Gratadour : Introducing YoGA, Yorick with GPU acceleration
GPU VSIPL: High Performance VSIPL Implementation for GPUs
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Introduction to cuBLAS
Introduction to CUDA C Slide credit: Slides adapted from
GPU Implementations for Finite Element Methods
VSIPL++: Parallel Performance HPEC 2004
6- General Purpose GPU Programming
Presentation transcript:

© NVIDIA Corporation 2013 CUDA Libraries

© NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN

© NVIDIA Corporation 2013 CUDA Math Libraries High performance math routines for your applications: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS Library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library NPP – Performance Primitives for Image & Video Processing Thrust – Templated C++ Parallel Algorithms & Data Structures math.h - C99 floating-point Library Included in the CUDA Toolkit Free

© NVIDIA Corporation 2013 Linear Algebra

© NVIDIA Corporation 2013 A Birds Eye View on Linear Algebra Matrix Solver Matrix Vector

© NVIDIA Corporation 2013 A Birds Eye View on Linear Algebra Matrix Solver Matrix Vector Dense Sparse Multi Node Single Node

© NVIDIA Corporation 2013 Sometimes it seems as if there’s only three … Matrix Solver Matrix Vector Dense Sparse ScaLAPACK LAPACK BLAS Single Node Multi Node

© NVIDIA Corporation but there is more … Matrix Solver Matrix Vector Dense Sparse ScaLAPACK LAPACK BLAS PaStix PLAPACK Paradiso Sparse- BLAS SuperLU WSMP Spooles TAUCS MUMPS UMFPACK EISPACK LINPACK PBLAS Single Node Multi Node

© NVIDIA Corporation 2013 Trilinos PETSc Matlab … and even more.. Matrix Solver Matrix Vector Dense Sparse ScaLAPACK LAPACK BLAS PaStix PLAPACK Paradiso Sparse- BLAS SuperLU WSMP Spooles TAUCS MUMPS UMFPACK EISPACK LINPACK PBLAS Single Node R, IDL, Python, Ruby,.. Multi Node

© NVIDIA Corporation 2013 NVIDIA CUDA Library Approach Provide basic building blocks Make them easy to use Make them fast Provides a quick path to GPU acceleration Enables ISVs to focus on their “secret sauce” Ideal for applications that use CPU libraries

© NVIDIA Corporation 2013 Single Node NVIDIA’s Foundation for LinAlg on GPUs Matrix Solver Matrix Vector Dense Sparse Parallel NVIDIA cuBLAS NVIDIA cuSPARSE

© NVIDIA Corporation 2013 cuBLAS: >1 TFLOPS double-precision cuBLAS 5.0, K20 MKL , Intel SandyBridge 3.10GHZ Up to 1 TFLOPS sustained performance and >8x speedup over Intel MKL Performance may vary based on OS version and motherboard configuration

© NVIDIA Corporation 2013 cuBLAS: Legacy and Version 2 Interface Legacy Interface Convenient for quick port of legacy code Version 2 Interface Reduces data transfer for complex algorithms Return values on CPU or GPU Scalar arguments passed by reference Support for streams and multithreaded environment Batching of key routines

© NVIDIA Corporation 2013 Version 2 Interface helps reducing memory transfers Legacy Interface idx = cublasIsamax(n, d_column, 1); err = cublasSscal(n, 1./d_column[idx], row, 1); Index transferred to CPU, CPU needs vector elements for scale factor

© NVIDIA Corporation 2013 Version 2 Interface helps reducing memory transfers Legacy Interface idx = cublasIsamax(n, d_column, 1); err = cublasSscal(n, 1./d_column[idx], row, 1); Version 2 Interface err = cublasIsamax(handle, n, d_column, 1, d_maxIdx); kernel >> (d_column, d_maxIdx, d_val); err = cublasSscal(handle, n, d_val, d_row, 1); Index transferred to CPU, CPU needs vector elements for scale factor All data remains on the GPU

© NVIDIA Corporation 2013 The cuSPARSE - CUSP Relationship Solver Matrix Vector Dense Sparse Parallel NVIDIA cuSPARSE Single Node

© NVIDIA Corporation 2013 Third Parties Extend the Building Blocks Solver Matrix Vector Dense Sparse IMSL Library FLAME Library Single Node Multi Node

© NVIDIA Corporation 2013 Third Parties Extend the Building Blocks Solver Matrix Vector Dense Sparse IMSL Library FLAME Library Single Node Multi Node

© NVIDIA Corporation 2013 Different Approaches to Linear Algebra CULA tools (dense, sparse) LAPACK based API Solvers, Factorizations, Least Squares, SVD, Eigensolvers Sparse: Krylov solvers, Preconditioners, support for various formats culaSgetrf(M, N, A, LDA, IPIV, INFO) ArrayFire “Matlab-esque” interface \C \Fortran Array container object Solvers, Factorizations, SVD, Eigensolvers array out = lu(A) ArrayFire Matrix Computations

© NVIDIA Corporation 2013 Different Approaches to Linear Algebra (cont.) MAGMA LAPACK conforming API Magma BLAS and LAPACK High performance by utilizing both GPU and CPU magma_sgetrf(M, N, A, LDA, IPIV, INFO) LibFlame LAPACK compatibility interface Infrastructure for rapid linear algebra algorithm development FLASH_LU_piv(A, p) FLAME Library

© NVIDIA Corporation 2013 Toolkits are increasingly supporting GPUs PETSc GPU support via extension to Vec and Mat classes Partially dependent on CUSP MPI parallel, GPU accelerated solvers Trilinos GPU support in KOKKOS package Used through vector class Tpetra MPI parallel, GPU accelerated solvers

© NVIDIA Corporation 2013 Signal Processing

© NVIDIA Corporation 2013 Common Tasks in Signal Processing FilteringCorrelation Segmentation

© NVIDIA Corporation 2013 NVIDIA NPP Parallel Computing Toolbox Libraries for GPU Accelerated Signal Processing ArrayFire Matrix Computations

© NVIDIA Corporation 2013 Basic concepts of cuFFT Interface modeled after FFTW Simple migration from CPU to GPU fftw_plan_dft2_2d => cufftPlan2d “Plan” describes data layout, transformation strategy Depends on dimensionality, layout, type of transform Independent of actual data, direction of transform Reusable for multiple transforms Execution of plan Depends on transform direction, data cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD)

© NVIDIA Corporation 2013 Efficient use of cuFFT Perform multiple transforms with the same plan Use e.g. in forward/inverse transform for convolution, transform at each simulation timestep, etc. Transform in streams cufft functions do not take a stream argument Associate a plan with a stream via cufftSetStream(plan, stream) Batch transforms Concurrent execution of multiple identical transforms Support for 1D, 2D and 3D transforms

© NVIDIA Corporation 2013 High 1D transform performance is key to efficient 2D and 3D transforms Measured on sizes that are exactly powers-of-2 cuFFT 5.0 on K20 Performance may vary based on OS version and motherboard configuration

© NVIDIA Corporation 2013 Basic concepts of NPP Collection of high-performance GPU processing Initial focus on Image, Video and Signal processing Growth into other domains expected Support for multi-channel integer and float data C API => name disambiguates between data types, flavor nppiAdd_32f_C1R (…) “Add” two single channel (“C1”) 32-bit float (“32f”) images, possibly masked by a region of interest (“R”)

© NVIDIA Corporation 2013 NPP features a large set of functions Arithmetic and Logical Operations Add, mul, clamp,.. Threshold and Compare Geometric transformations Rotate, Warp, Perspective transformations Various interpolations Compression jpeg de/compression Image processing Filter, histogram, statistics NVIDIA NPP

© NVIDIA Corporation 2013 GPU-VSIPL – Vector Image Signal Processing Library Open Industry Standard for Signal Processing Focus on embedded space, but support for GPU, CPU,.. Separate memory spaces integral part of API Support for single precision float, fft, matrix factorization GPU enabled version from Georgia Tech Lite and Core API vsip_ccfftmip_f(d->fft_plan_fast,d->z_cmview);

© NVIDIA Corporation 2013 cuRAND

© NVIDIA Corporation 2013 Random Number Generation on GPU Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library! Large suite of generators and distributions XORWOW, MRG323ka, MTGP32, (scrambled) Sobol uniform, normal, log-normal Single and double precision Two APIs for cuRAND Host: Ideal when generating large batches of RNGs on GPU Device: Ideal when RNGs need to be generated inside a kernel

© NVIDIA Corporation 2013 cuRAND: Host vs Device API Host API #include “curand.h” curandCreateGenarator(&gen, CURAND_RNG_PSEUDO_DEFAULT); curandGenerateUniform(gen, d_data, n); Device API #include “curand_kernel.h” __global__ void generate_kernel(curandState *state) { int id = threadIdx.x + blockIdx.x * 64; x = curand(&state[id]); } Generate set of random numbers at once Generate random numbers per thread

© NVIDIA Corporation 2013 cuRAND Performance compared to Intel MKL Performance may vary based on OS version and motherboard configuration cuSPARSE 5.0 on K20X, input and output data on device MKL on Intel SandyBridge 3.10GHz

© NVIDIA Corporation 2013 Next steps..

© NVIDIA Corporation 2013 Thurst: STL-like CUDA Template Library Device and host vector class thrust::host_vector H(10, 1.f); thrust::device_vector D = H; Iterators thrust::fill(D.begin(), D.begin()+5, 42.f); float* raw_ptr = thrust::raw_pointer_cast(D); Algorithms Sort, reduce, transformation, scan,.. thrust::transform(D1.begin(), D1.end(), D2.begin(), D2.end(), thrust::plus ()); // D2 = D1 + D2 C++ STL Features for CUDA

© NVIDIA Corporation 2013 OpenACC: New Open Standard for GPU Computing Faster, Easier, Portability

© NVIDIA Corporation 2013 void vec_add(float *x,float *y,int n) { #pragma acc kernels for (int i=0;i<n;++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x,y,n); free(x); free(y); Vector Addition using OpenACC #pragma acc kernels: run the loop in parallel on GPU void vec_add(float *x,float *y,int n) { for (int i=0;i<n;++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x,y,n); free(x); free(y);

© NVIDIA Corporation 2013 OpenACC Basics Compute construct for offloading calculation to GPU #pragma acc parallel Data construct for controlling data movement between CPU-GPU #pragma acc data copy (list) / copyin (list) / copyout (list) / present (list) #pragma acc parallel for (i=0; i<n;i++) a[i] = a[i] + b[i]; #pragma acc data copy(a[0:n-1]) copyin(b[0:n-1]) { #pragma acc parallel for (i=0; i<n;i++) a[i] = a[i] + b[i]; #pragma acc parallel for (i=0; i<n;i++) a[i] *= 2; }

© NVIDIA Corporation 2013 Math.h: C99 floating-point library + extras

© NVIDIA Corporation 2013 Explore the CUDA (Libraries) Ecosystem CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda- tools-ecosystem Attend GTC library talks

© NVIDIA Corporation 2013 Examples

Questions