Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.

Slides:



Advertisements
Similar presentations
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to the Thrust Parallel Algorithms Library.
Massively Parallel LDPC Decoding on GPU
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Performance and Energy Efficiency of GPUs and FPGAs
CUDA Linear Algebra Library and Next Generation
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Implementing a Speech Recognition System on a GPU using CUDA
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.
Accelerating MATLAB with CUDA
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg.
Martin Kruliš by Martin Kruliš (v1.0)1.
Analysis of Sparse Convolutional Neural Networks
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
CS427 Multicore Architecture and Parallel Computing
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
D. Gratadour : Introducing YoGA, Yorick with GPU acceleration
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Introduction to cuBLAS
Introduction to CUDA C Slide credit: Slides adapted from
CS 179 Project Ideas.
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

Using CUDA Libraries with OpenACC

3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications CUDA Libraries are interoperable with OpenACC

3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications CUDA Languages are interoperable with OpenACC, too!

CUDA Libraries Overview

NVIDIA cuBLASNVIDIA cuRANDNVIDIA cuSPARSENVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cuFFT C++ STL Features for CUDA IMSL Library GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications Building-block Algorithms for CUDA

CUDA Math Libraries High performance math routines for your applications: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS Library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library NPP – Performance Primitives for Image & Video Processing Thrust – Templated C++ Parallel Algorithms & Data Structures math.h - C99 floating-point Library Included in the CUDA Toolkit Free More information on CUDA libraries:

cuFFT: Multi-dimensional FFTs New in CUDA 4.1 Flexible input & output data layouts for all transform types Similar to the FFTW “Advanced Interface” Eliminates extra data transposes and copies API is now thread-safe & callable from multiple host threads Restructured documentation to clarify data layouts

FFTs up to 10x Faster than MKL Measured on sizes that are exactly powers-of-2 cuFFT 4.1 on Tesla M2090, ECC on MKL , TYAN FT72-B7015 Xeon x GHz 1D used in audio processing and as a foundation for 2D and 3D FFTs Performance may vary based on OS version and motherboard configuration

CUDA 4.1 optimizes 3D transforms Consistently faster than MKL >3x faster than 4.0 on average cuFFT 4.1 on Tesla M2090, ECC on MKL , TYAN FT72-B7015 Xeon x GHz Performance may vary based on OS version and motherboard configuration

cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex New in CUDA 4.1 New batched GEMM API provides >4x speedup over MKL Useful for batches of 100+ small matrices from 4x4 to 128x128 5%-10% performance improvement to large GEMMs

cuBLAS Level 3 Performance 4Kx4K matrix size cuBLAS 4.1, Tesla M2090 (Fermi), ECC on MKL , TYAN FT72-B7015 Xeon x GHz Up to 1 TFLOPS sustained performance and >6x speedup over Intel MKL Performance may vary based on OS version and motherboard configuration

cuBLAS 4.1 on Tesla M2090, ECC on MKL , TYAN FT72-B7015 Xeon x GHz Performance may vary based on OS version and motherboard configuration ZGEMM Performance vs Intel MKL

cuBLAS 4.1 on Tesla M2090, ECC on MKL , TYAN FT72-B7015 Xeon x GHz Performance may vary based on OS version and motherboard configuration cuBLAS Batched GEMM API improves performance on batches of small matrices

cuSPARSE: Sparse linear algebra routines Sparse matrix-vector multiplication & triangular solve APIs optimized for iterative methods New in 4.1 Tri-diagonal solver with speedups up to 10x over Intel MKL ELL-HYB format offers 2x faster matrix-vector multiplication

cuSPARSE is >6x Faster than Intel MKL Performance may vary based on OS version and motherboard configuration cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL , TYAN FT72-B7015 Xeon x GHz *Average speedup over single, double, single complex & double-complex

Up to 40x faster with 6 CSR Vectors Performance may vary based on OS version and motherboard configuration cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL , TYAN FT72-B7015 Xeon x GHz

Performance may vary based on OS version and motherboard configuration cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL , TYAN FT72-B7015 Xeon x GHz Tri-diagonal solver performance vs. MKL *Parallel GPU implementation does not include pivoting

cuRAND: Random Number Generation Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results reported in documentation New commonly used RNGs in CUDA 4.1 MRG32k3a RNG MTGP11213 Mersenne Twister RNG

cuRAND Performance compared to Intel MKL Performance may vary based on OS version and motherboard configuration cuRAND 4.1, Tesla M2090 (Fermi), ECC on MKL , TYAN FT72-B7015 Xeon 3.33 GHz

1000+ New Imaging Functions in NPP 4.1 NVIDIA Performance Primitives (NPP) library includes over 2200 GPU-accelerated functions for image & signal processing Arithmetic, Logic, Conversions, Filters, Statistics, etc. Most are 5x-10x faster than analogous routines in Intel IPP * NPP 4.1, NVIDIA C2050 (Fermi) * IPP 6.1, Dual Socket Core™ i7 2.67GHz Up to 40x speedups

Using CUDA Libraries with OpenACC

Sharing data with libraries CUDA libraries and OpenACC both operate on device arrays OpenACC provides mechanisms for interop with library calls deviceptr data clause host_data construct Note: same mechanisms useful for interop with custom CUDA C/C++/Fortran code

deviceptr Data Clause deviceptr( list ) Declares that the pointers in list refer to device pointers that need not be allocated or moved between the host and device for this pointer. Example: C #pragma acc data deviceptr(d_input) Fortran $!acc data deviceptr(d_input)

host_data Construct Makes the address of device data available on the host. deviceptr( list ) Tells the compiler to use the device address for any variable in list. Variables in the list must be present in device memory due to data regions that contain this construct Example C #pragma acc host_data use_device(d_input) Fortran $!acc host_data use_device(d_input)

Example: 1D convolution using CUFFT Perform convolution in frequency space 1.Use CUFFT to transform input signal and filter kernel into the frequency domain 2.Perform point-wise complex multiply and scale on transformed signal 3.Use CUFFT to transform result back into the time domain We will perform step 2 using OpenACC Code walk-through follows. Code available with exercises. In exercises/cufft-acc

// Transform signal and kernel error = cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD); error = cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD); // Multiply the coefficients together and normalize the result printf("Performing point-wise complex multiply and scale.\n"); complexPointwiseMulAndScale(new_size, (float *restrict)d_signal, (float *restrict)d_filter_kernel); // Transform signal back error = cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE); Source Excerpt This function must execute on device data

OpenACC convolution code void complexPointwiseMulAndScale(int n, float *restrict signal, float *restrict filter_kernel) { // Multiply the coefficients together and normalize the result #pragma acc data deviceptr(signal, filter_kernel) { #pragma acc kernels loop independent for (int i = 0; i < n; i++) { float ax = signal[2*i]; float ay = signal[2*i+1]; float bx = filter_kernel[2*i]; float by = filter_kernel[2*i+1]; float s = 1.0f / n; float cx = s * (ax * bx - ay * by); float cy = s * (ax * by + ay * bx); signal[2*i] = cx; signal[2*i+1] = cy; } Note: The PGI C compiler does not currently support structs in OpenACC loops, so we cast the Complex* pointers to float* pointers and use interleaved indexing

Linking CUFFT #include “cufft.h” Compiler command line options: CUDA_PATH = /usr/local/pgi/linux86-64/2012/cuda/4.0 CCFLAGS = -I$(CUDA_PATH)/include –L$(CUDA_PATH)/lib64 -lcudart -lcufft Must use PGI-provided CUDA toolkit paths Must link libcudart and libcufft

Results cufft-acc]$./cufft_acc Transforming signal cufftExecC2C Performing point-wise complex multiply and scale. Transforming signal back cufftExecC2C Performing Convolution on the host and checking correctness Signal size: , filter size: 33 Total Device Convolution Time: ms ( for point-wise convolution) Test PASSED OpenACCOpenACC CUFFT + cudaMemcpy

Summary Use deviceptr data clause to pass pre-allocated device data to OpenACC regions and loops Use host_data to get device address for pointers inside acc data regions The same techniques shown here can be used to share device data between OpenACC loops and Your custom CUDA C/C++/Fortran/etc. device code Any CUDA Library that uses CUDA device pointers

Thank you