CUDA Programming Model

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
CUDA Grids, Blocks, and Threads
Cuda Streams Presented by Savitha Parur Venkitachalam.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
CUDA C/C++ Basics Part 2 - Blocks and Threads
GPU Computing CIS-543 Lecture 03: Introduction to CUDA
CUDA Programming Model
Device Routines and device variables
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel programming with GPGPU coprocessors
Device Routines and device variables
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Grids, Blocks, and Threads
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

CUDA Programming Model These notes will introduce: Basic GPU programming model CUDA kernel Simple CUDA program to add two vectors together Compiling the code on a Linux system ITCS4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, April 3, 2012

Programming Model GPUs historically designed for creating image data for displays. That application involves manipulating image pixels (picture elements) and often the same operation each pixel SIMD (single instruction multiple data) model - An efficient mode of operation in which the same operation is done on each data element at the same time

SIMD (Single Instruction Multiple Data) model Also know as data parallel computation. One instruction specifies the operation: Instruction a[] = a[] + k ALUs a[0] a[1] a[n-2] a[n-1] Very efficient of this is what you want to do. One program. Can design computers to operate this way.

Single Instruction Multiple Thread Programming Model A version of SIMD used in GPUs. GPUs use a thread model to achieve very high parallel performance and to hide memory latency Multiple threads, each execute the same instruction sequence. On a GPU, a very large number of threads (10,000’s) possible. Threads mapped onto available processors on GPU (100’s of processors all executing same program sequence)

Programming applications using SIMT model Matrix operations -- very amenable to SIMT Same operations done on different elements of matrices Some “embarassingly” parallel computations such as Monte Carlo calculations Monte Carlo calculations use random selections Random selections are independent of each other Data manipulations Some sorting can be done quite efficiently …

CUDA kernel routine To write a SIMT program, one needs to write a code sequence that all the threads on the GPU will do. In CUDA, this code sequence is called a Kernel routine Kernal code will be regular C except one typically needs to use the thread ID in expressions to ensure each thread accesses different data: Example … index = ThreadID; A[index] = B[index] + C[index]; All theads do this

CPU and GPU memory Program once compiled has code executed on CPU and (kernel) code executed on GPU Separate memories on CPU and GPU Need to Explicitly transfer data from CPU to GPU for GPU computation, and Explicitly transfer results in GPU memory copied back to CPU memory CPU CPU main memory Copy from CPU to GPU Copy from GPU to CPU GPU global memory GPU

Basic CUDA program structure int main (int argc, char **argv ) { 1. Allocate memory space in device (GPU) for data 2. Allocate memory space in host (CPU) for data 3. Copy data to GPU 4. Call “kernel” routine to execute on GPU (with CUDA syntax that defines no of threads and their physical structure) 5. Transfer results from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return; }

1. Allocating memory space in “device” (GPU) for data Use CUDA malloc routines: int size = N *sizeof( int); // space for N integers int *devA, *devB, *devC; // devA, devB, devC ptrs cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

2. Allocating memory space in “host” (CPU) for data Use regular C malloc routines: int *a, *b, *c; … a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(size); or statically declare variables: #define N 256 int a[N], b[N], c[N];

3. Transferring data from host (CPU) to device (GPU) Use CUDA routine cudaMemcpy cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice); where: devA and devB are pointers to destination in device A and B are pointers to host data Destination Source

4. Declaring “kernel” routine to execute on device (GPU) CUDA introduces a syntax addition to C: Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters: myKernel<<< n, m >>>(arg1, … ); n and m will define organization of thread blocks and threads in a block. For now, we will set n = 1, which say one block and m = N, which says N threads in this block. arg1, … , -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMallac.

Example – Adding to vectors A and B Declaring a Kernel Routine Two underscores each side A kernel defined using CUDA specifier __global__ Example – Adding to vectors A and B #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // allocate device memory & // copy data to device // device mem. ptrs devA,devB,devC vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block … CUDA structure that provides thread ID in block Each of the N threads performs one pair-wise addition: Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; Thread N-1: devC[N-1] = devA[N-1]+devB[N-1]; Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA

5. Transferring data from device (GPU) to host (CPU) Use CUDA routine cudaMemcpy cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost); where: devC is a pointer in device and C is a pointer in host. Destination Source

6. Free memory space in “device” (GPU) Use CUDA cudaFree routine: cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c);

7. Free memory space in (CPU) host (if CPU memory allocated with malloc) Use regular C free routine to deallocate memory if previously allocated with malloc: free( a ); free( b ); free( c );

Complete CUDA program Adding two vectors, A and B N elements in A and B, and N threads (without code to load arrays with data) #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { int size = N *sizeof( int); int a[N], b[N], c[N], *devA, *devB, *devC; cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( devB, b size, cudaMemcpyHostToDevice); vecAdd<<<1, N>>>(devA, devB, devC); cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); return (0);

Complete, with keyboard input for blocks/threads int main(int argc, char *argv[]) { int T = 10, B = 1; // threads per block/blocks per grid int a[N],b[N],c[N]; int *dev_a, *dev_b, *dev_c; printf("Size of array = %d\n", N); do { printf("Enter number of threads per block: "); scanf("%d",&T); printf("\nEnter nuumber of blocks per grid: "); scanf("%d",&B); if (T * B < N) printf("Error T x B < N, try again"); } while (T * B < N); cudaMalloc((void**)&dev_a,N * sizeof(int)); cudaMalloc((void**)&dev_b,N * sizeof(int)); cudaMalloc((void**)&dev_c,N * sizeof(int)); for(int i=0;i<N;i++) { // load arrays with some numbers a[i] = i; b[i] = i*1; } cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice); add<<<B,T>>>(dev_a,dev_b,dev_c); cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost); for(int i=0;i<N;i++) { printf("%d+%d=%d\n",a[i],b[i],c[i]); cudaFree(dev_a); // clean up cudaFree(dev_b); cudaFree(dev_c); return 0; Complete, with keyboard input for blocks/threads (without timing execution, see later) #include <stdio.h> #include <cuda.h> #include <stdlib.h> #include <time.h> #define N 4096 // size of array __global__ void add(int *a,int *b, int *c) { int tid = blockIdx.x*blockDim.x + threadIdx.x; if(tid < N){ c[tid] = a[tid]+b[tid]; }

Compiling CUDA programs “nvcc” NVIDIA provides nvcc -- the NVIDIA CUDA “compiler driver”. Will separate out code for host and for device Regular C/C++ compiler used for host (needs to be available) Programmer simply uses nvcc instead of gcc/cc compiler on a Linux system Command line options include for GPU features

Compiling code - Linux Command line: nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcuda –lcudart CUDA source file that includes device code has the extension .cu nvcc separates code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU. Make file convenient – see next. Directories for #include files Optimization level if you want optimized code Directories for libraries Libraries to be linked See “The CUDA Compiler Driver NVCC” from NVIDIA for more details

Very simple sample Make file NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm prog1: cc -o prog1 prog1.c –lm prog2: cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 –lm prog3: $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu prog4: $(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4 prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm A regular C program A C program with X11 graphics A CUDA program A CUDA program with X11 graphics

Executable file a “fat” binary” with both host and device code Compilation process nvcc “wrapper” divides code into host and device parts. Host part compiled by regular C compiler Device part compiled by NVIDIA “ptxas” assembler Two compiled parts combined into one executable nvcc –o prog prog.cu –I/includepath -L/libpath nvcc ptxas gcc Combine Object file executable Executable file a “fat” binary” with both host and device code

Executing Program Simple type name of executable created by nvcc: File includes all the code for host and for device in a “fat binary” file Host code starts running When first encounter device kernel, GPU code physically sent to GPU and function launched on GPU Hence first launch will be slow!! Run time environment (cudart) controls memcpy timing and synchronization

Questions