GPU Programming EPCC The University of Edinburgh.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Grids, Blocks, and Threads
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CUDA Programming Model
Some things are naturally parallel
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Grids, Blocks, and Threads
General Purpose Graphics Processing Units (GPGPUs)
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Presentation transcript:

GPU Programming EPCC The University of Edinburgh

Overview Motivation and need for CUDA Introduction to CUDA –CUDA kernels, decompositions –CUDA memory management –C and Fortran OpenCL 2

NVIDIA CUDA Traditional languages alone are not sufficient for programming GPUs CUDA allows NVIDIA GPUs to be programmed in C and C++ –defines language extensions for defining kernels –kernels execute in multiple threads concurrently on the GPU –provides API functions for e.g. device memory management PGI provide a commercial Fortran version of CUDA (very similar to CUDA C) –July 2013: NVIDIA acquire PGI 3

4

GPGPU: Stream Computing Data set decomposed into a stream of elements A single computational function (kernel) operates on each element –“thread” defined as execution of kernel on one data element Multiple cores can process multiple elements in parallel –i.e. many threads running in parallel Suitable for data-parallel problems 5

6 NVIDIA GPUs have a 2-level hierarchy: –Multiple Streaming Multiprocessors (SMs), each with multiple cores The number of SMs, and cores per SM, varies across generations

In CUDA, this is abstracted as Grid of Thread Blocks –The multiple blocks in a grid map onto the multiple SMs –Each block in a grid contains multiple threads, mapping onto the cores in an SM We don’t need to know the exact details of the hardware (number of SMs, cores per SM). –Instead, oversubscribe, and system will perform scheduling automatically –Use more blocks than SMs, and more threads than cores –Same code will be portable and efficient across different GPU versions. 7

CUDA dim3 type CUDA introduces a new dim3 type –Simply contains a collection of 3 integers, corresponding to each of X,Y and Z directions. C: dim3 my_xyz_values(xvalue,yvalue,zvalue); Fortran: type(dim3) :: my_xyz_values my_xyz_values = dim3(xvalue,yvalue,zvalue) 8

X component can be accessed as follows: C: my_xyz_values.x Fortran: my_xyz_values%x And similar for Y and Z E.g. for my_xyz_values = dim3(6,4,12) then my_xyz_values%z has value 12 9

10

Analogy You check in to the hotel, as do your classmates –Rooms allocated in order Receptionist realises hotel is less than half full –Decides you should all move from your room number i to room number 2i –so that no-one has a neighbour to disturb them 11

Serial Solution: –Receptionist works out each new number in turn 12

Parallel Solution: 13 “Everybody: check your room number. Multiply it by 2, and move to that room.”

Serial solution: 14 for (i=0;i<N;i++){ result[i] = 2*i; } We can parallelise by assigning each iteration to a separate CUDA thread.

CUDA C Example 15 Replace loop with function Add __global__ specifier To specify this function is to form a GPU kernel Use internal CUDA variables to specify array indices threadidx.x is an internal variable unique to each thread in a block. X component of dim3 type. Since our problem is 1D, we are not using the Y or Z components (more later) __global__ void myKernel(int *result) { int i = threadIdx.x; result[i] = 2*i; }

CUDA C Example 16 And launch this kernel by calling the function on multiple CUDA threads using >> syntax dim3 blocksPerGrid(1,1,1); //use only one block dim3 threadsPerBlock(N,1,1); //use N threads in the block myKernel >>(result);

CUDA FORTRAN Equivalent Kernel: attributes(global) subroutine myKernel(result) integer, dimension(*) :: result integer :: i i = threadidx%x result(i) = 2*i end subroutine Launched as follows: blocksPerGrid = dim3(1, 1, 1) threadsPerBlock = dim3(N, 1, 1) call myKernel >> (result) 17

CUDA C Example The previous example only uses 1 block, i.e. only 1 SM on the GPU, so performance will be very poor. In practice, we need to use multiple blocks to utilise all SMs, e.g.: 18 __global__ void myKernel(int *result) { int i = blockIdx.x * blockDim.x + threadIdx.x; result[i] = 2*i; }... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); myKernel >>(result);...

FORTRAN attributes(global) subroutine myKernel(result) integer, dimension(*) :: result integer :: i i = (blockidx%x-1)*blockdim%x + threadidx%x result(i) = 2*i end subroutine... blocksPerGrid = dim3(N/256, 1, 1) !assuming 256 divides N exactly threadsPerBlock = dim3(256, 1, 1) call myKernel >> (result) We have chosen to use 256 threads per block, which is typically a good number (see practical).

CUDA C Example More realistic 1D example: vector addition 20 __global__ void vectorAdd(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vectorAdd >>(a, b, c);...

CUDA FORTRAN Equivalent attributes(global) subroutine vectorAdd(a, b, c) real, dimension(*) :: a, b, c integer :: i i = (blockidx%x-1)*blockdim%x + threadidx%x c(i) = a(i) + b(i) end subroutine... blocksPerGrid = dim3(N/256, 1, 1) threadsPerBlock = dim3(256, 1, 1) call vectorAdd >> (a, b, c)... 21

CUDA C Internal Variables For a 1D decomposition (e.g. the previous examples) blockDim.x : Number of threads per block –Takes value 256 in previous example threadIdx.x: unique to each thread in a block –Ranges from 0 to 255 in previous example blockIdx.x : Unique to every block in the grid –Ranges from 0 to (N/ ) in previous example 22

CUDA Fortran Internal Variables For a 1D decomposition (e.g. the previous example) blockDim%x : Number of threads per block –Takes value 256 in previous example threadIdx%x: unique to each thread in a block –Ranges from 1 to 256 in previous example blockIdx%x : Unique to every block in the grid –Ranges from 1 to (N/256) in previous example 23

2D Example 24 2D or 3D CUDA decompositions also possible, e.g. for matrix addition (2D): __global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N]) { int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; } int main() { dim3 blocksPerGrid(N/16,N/16,1); // (N/16)x(N/16) blocks/grid (2D) dim3 threadsPerBlock(16,16,1); // 16x16=256 threads/block (2D) matrixAdd >>(a, b, c); }

CUDA Fortran Equivalent ! Kernel declaration attributes(global) subroutine matrixAdd(N, a, b, c) integer, value :: N real, dimension(N,N) :: a, b, c integer :: i, j i = (blockidx%x-1)*blockdim%x + threadidx%x j = (blockidx%y-1)*blockdim%y + threadidx%y c(i,j) = a(i,j) + b(i,j) end subroutine ! Kernel invocation blocksPerGrid = dim3(N/16, N/16, 1) ! (N/16)x(N/16) blocks/grid (2D) threadsPerBlock = dim3(16, 16, 1) ! 16x16=256 threads/block (2D) call matrixAdd >> (N, a, b, c) 25

Memory Management - allocation The GPU has a separate memory space from the host CPU Data accessed in kernels must be on GPU memory Need to manage GPU memory and copy data to and from it explicitly cudaMalloc is used to allocate GPU memory cudaFree releases it again float *a; cudaMalloc(&a, N*sizeof(float)); … cudaFree(a); 26

Memory Management - cudaMemcpy Once we've allocated GPU memory, we need to be able to copy data to and from it cudaMemcpy does this: cudaMemcpy(array_device, array_host, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(array_host, array_device, N*sizeof(float), cudaMemcpyDeviceToHost); The first argument always corresponds to the destination of the transfer. Transfers between host and device memory are relatively slow and can become a bottleneck, so should be minimised when possible 27

CUDA FORTRAN – Data management Data management is more intuitive than CUDA C Because Fortran has array syntax, and also compiler knows if a pointer is meant for CPU or GPU memory Can use allocate() and deallocate() as for host memory real, device, allocatable, dimension(:) :: d_a allocate( d_a(N) ) … deallocate ( d_a ) Can copy data between host and device using assignment d_a = a(1:N) Or can instead use CUDA API (similar to C), e.g. istat = cudaMemcpy(d_a, a, N) 28

Synchronisation between host and device Kernel calls are non-blocking. This means that the host program continues immediately after it calls the kernel –Allows overlap of computation on CPU and GPU Use cudaThreadSynchronize() to wait for kernel to finish 29 Standard cudaMemcpy calls are blocking –Non-blocking variants exist vectorAdd >>(a, b, c); //do work on host (that doesn’t depend on c) cudaThreadSynchronise(); //wait for kernel to finish

Synchronisation between CUDA threads Within a kernel, to syncronise between threads in the same block use the syncthreads() call Therefore, threads in the same block can communicate through memory spaces that they share, e.g. assuming x local to each thread and array in a shared memory space 30 It is not possible to communicate between different blocks in a kernel: must instead exit kernel and start a new one if (threadIdx.x == 0) array[0]=x; syncthreads(); if (threadIdx.x == 1) x=array[0];

Compiling CUDA Code CUDA C code is compiled using nvcc : nvcc –o example example.cu CUDA Fortran is compiled using PGI compiler –either use.cuf filename extension for CUDA files –and/or pass –Mcuda to the compiler command line pgf90 -Mcuda –o example example.cuf 31

OpenCL Open Compute Language (OpenCL): “The Open Standard for Heterogeneous Parallel Programming” –Open cross-platform framework for programming modern multicore and heterogeneous systems Supports wide range of applications and architectures, including GPUs –Supported on NVIDIA Tesla + AMD FireStream See 32

OpenCL vs CUDA on NVIDIA NVIDIA support both CUDA and OpenCL as APIs to the hardware. –But put much more effort into CUDA –CUDA more mature, well documented and performs better OpenCL and C for CUDA conceptually very similar –Very similar abstractions, basic functionality etc –Different names e.g. “Thread” CUDA -> “Work Item” (OpenCL) –Porting between the two should in principle be straightforward OpenCL is a lower level API than C for CUDA –More work for programmer OpenCL obviously portable to other systems –But in reality work will still need to be done for efficiency on different architecture OpenCL may well catch up with CUDA given time 33

Example OpenCL kernel kernel void vectoradd(global const float *a, global const float *b, global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } kernel and global are OpenCL extensions to C get_global_id is OpenCL function –Performs similar role to CUDA’s blockIdx/threadIdx Invoking kernels is a much more involved process than in CUDA 34

Summary Traditional languages alone are not sufficient for programming GPUs CUDA allows NVIDIA GPUs to be programmed using C/C++ –defines language extensions and APIs to enable this PGI provide commercial Fortran version of CUDA –Very similar to CUDA C, more intuitive options for data management We introduced the key CUDA concepts and gave examples OpenCL provides another means of programming GPUs in C –conceptually similar to CUDA, but less mature and lower-level –supports other hardware as well as NVIDIA GPUs 35