1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical Symposium on Computer Science Education Wednesday March 9, 2011, 7:00 pm - 10:00 pm Dr. Barry Wilkinson University of North Carolina Charlotte Dr. Yaohang Li Old Dominion University SIGCSE 2011 Workshop 9 Session1.ppt © 2010 B. Wilkinson Modification date: Feb 23, 2011

2 Session 1 7:15 pm - 8:25 pmSession 1: Basic CUDA programming Presentation (about 35 minutes) Kernel calls, data movement, predefined variables, thread organization, code examples Hands-on experience using remote GPU server (about 35 minutes) 2

3 CUDA Program A CUDA program has code to be executed on CPU and code to be executed on GPU in one source file (in simple cases) and one executable when compiled A CUDA kernel is a routine to be executed on the GPU -- a SIMT code sequence. Kernel code will be regular C except one typically needs to use the thread ID in expressions to ensure each thread accesses different data. When a kernel is reached in the code for the first time, it will launched onto the GPU.

4 CPU and GPU Memories Separate memories on CPU (host) and GPU (device)* Usually need to Explicitly transfer data from CPU to GPU for GPU computation, and Explicitly transfer results in GPU memory copied back to CPU memory Copy from CPU to GPU Copy from GPU to CPU GPU CPU CPU main memory GPU global memory * assuming a separate GPU card. Integrated systems might share some memory.

5 Basic CUDA program structure int main (int argc, char **argv ) { 1. Allocate memory space in device (GPU) for data 2. Allocate memory space in host (CPU) for data 3. Copy data to GPU 4. Call “kernel” routine to execute on GPU ( with CUDA syntax that defines no of threads and their physical structure) 5. Transfer results from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return; }

6 1. Allocating memory space in “device” (GPU) for data Use CUDA malloc routines: int size = N *sizeof( int); // space for N integers int *devA, *devB, *devC; // devA, devB, devC ptrs cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

7 2. Allocating memory space in “host” (CPU) for data Use regular C malloc routines: int *a, *b, *c; … a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(size); or statically declare variables: #define N 256 … int a[N], b[N], c[N];

8 3. Transferring data from host (CPU) to device (GPU) Use CUDA routine cudaMemcpy cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice); where devA and devB are pointers to destination in device and A and B are pointers to host data

9 4. Declaring “kernel” routine to execute on device (GPU) CUDA introduces a >> syntax addition to C for kernel calls: myKernel >>(arg1, … ); >> contains thread organization for this particular kernel call in two parameters, n and m: For now, we will set n = 1, which say one block and m = N, which says N threads in this block. arg1, …, -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMallac.

10 Example – Adding to vectors A and B #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // allocate device memory & // copy data to device // device mem. ptrs devA,devB,devC vecAdd >>(devA,devB,devC); … } Loosely derived from CUDA C programming guide, v 3.2, 2010, NVIDIA Kernel Routine Defined using CUDA specifier __global__ Each thread performs one pair-wise addition: Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; Thread 2: devC[2] = devA[2] + devB[2];. One block of N threads CUDA structure that provides thread ID in block

11 5. Transferring data from device (GPU) to host (CPU) Use CUDA routine cudaMemcpy cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost); where devC is a pointer in device and C is a pointer in host.

12 Free memory space In “device” (GPU) -- Use CUDA cudaFree routine: cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); In (CPU) host (if CPU memory allocated with malloc) -- Use regular C free routine: free( a ); free( b ); free( c );

13 Complete CUDA program Adding two vectors, A and B N elements in A and B, and N threads (without code to load arrays with data) #define N 256 __global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { int size = N *sizeof( int); int a[N], b[N], c[N], *devA, *devB, *devC; cudaMalloc( (void**)&devA, size) ); cudaMalloc( (void**)&devB, size ); cudaMalloc( (void**)&devC, size ); a = (int*)malloc(size); b = (int*)malloc(size);c = (int*)malloc(size); cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice); vecAdd >>(devA, devB, devC); cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost); cudaFree( dev_a); cudaFree( dev_b); cudaFree( dev_c); free( a ); free( b ); free( c ); return (0); }

14 So far, organization of threads is one block of N threads. GPUs are actually organized to execute blocks of threads in 1 or 2 dimensions – the collection of blocks being called a grid The blocks themselves can be organized in 1-D 2-D or 3-D.

15 Can be 1 or 2 dimensions Can be 1, 2 or 3 dimensions CUDA C programming guide, v 3.2, 2010, NVIDIA CUDA SIMT Thread Structure Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU. Linked to internal organization Threads in one block execute together.

16 NVIDIA defines “compute capabilities”, 1.0, 1.1, … with these limits and features supported. Compute capability 1.0 Maximum number of threads per block = 512 Maximum sizes of x- and y- dimension of thread block = 512 Maximum size of each dimension of grid of thread blocks = 65535 Device characteristics -- some limitations

17 Need to provide each kernel call with values for two key structures: Number of blocks in each dimension Threads per block in each dimension myKernel >>(arg1, … ); B – a structure that defines the number of blocks in grid in each dimension (1D or 2D). T – a structure that defines the number of threads in a block in each dimension (1D, 2D, or 3D). Defining Grid/Block Structure

18 1-D grid and/or 1-D blocks If want a 1-D structure, can use a integer for B and T in: myKernel >>(arg1, … ); B – An integer would define a 1D grid of that size T –An integer would define a 1D block of that size Example myKernel >>(arg1, … );

19 CUDA Built-in Variables for a 1-D grid and 1-D block threadIdx.x -- “thread index” within block in “x” dimension blockIdx.x -- “block index” within grid in “x” dimension blockDim.x -- “block dimension” in “x” dimension (i.e. number of threads in a block in the x dimension) Full global thread ID in x dimension can be computed by: x = blockIdx.x * blockDim.x + threadIdx.x;

20 Example -- x direction A 1-D grid and 1-D block 4 blocks, each having 8 threads 01234765012347650123476501234765 threadIdx.x blockIdx.x = 3 threadIdx.x blockIdx.x = 1blockIdx.x = 0 Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010. blockIdx.x = 2 gridDim = 4 x 1 blockDim = 8 x 1 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing Global ID 26

21 #define N 2048 // size of vectors #define T 256 // number of threads per block __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { … vecAdd >>(devA, devB, devC); // assumes N/T is an integer … return (0); } Code example with a 1-D grid and blocks Vector addition Number of blocks to map each vector across grid, one element of each vector per thread. N/T assumed an integer Note: __global__ CUDA function qualifier. __ is two underscores __global__ must return a void

22 1-D grid and 1-D block suitable for processing one dimensional data Higher dimensional grids and blocks convenient for higher dimensional data: Processing 2-D arrays might use a two dimensional grid and two dimensional block Might need higher dimensions because of limitation on sizes of block in each dimension CUDA provided with built-in variables and structures to define number of blocks in grid in each dimension and number of threads in a block in each dimension. Higher dimensional grids/blocks

23 dim3 – can be considered essentially as CUDA-defined structure of unsigned integers: x, y, z, i.e.: struct dim3 { x; y; z; }; Grid/Block Sizes dim3 gridDim -- Grid dimensions, x and y (z not used). No of blocks in grid = gridDim.x * gridDim.y dim3 blockDim -- Size of block dimensions x, y, and z. No of threads in a block = blockDim.x * blockDim.y * blockDim.z Built-in CUDA data types and structures to define multidimensional structures

24 To set dimensions, use for example: dim3 grid(16, 16); // Grid -- 16 x 16 blocks dim3 block(32, 32); // Block -- 32 x 32 threads myKernel >>(...); which sets: gridDim.x = 16 gridDim.y = 16 blockDim.x = 32 blockDim.y = 32 blockDim.z = 1 (although you do not initial CUDA structure elements that way) Example Initializing Values

25 uint3 – can be considered essentially as CUDA-defined structure of unsigned integers: x, y, z, i.e.: struct uint3 { x; y; z; }; Block index within grid uint3 blockIdx -- blockIdx.x, blockIdx.y (z not used) Thread index within block uint3 threadIdx -- threadIdx.x, threadIdx.y, threadId.z CUDA Built-in Variables for Grid/Block Indices

26 2-D Grids and 2-D blocks threadID.x threadID.y Thread blockIdx.x * blockDim.x + threadIdx.x blockIdx.y * blockDim.y + threadIdx.y

27 Flattening arrays onto linear memory Generally memory allocated dynamically on device (GPU) and we cannot not use two-dimensional indices (e.g. A[row][column]) to access array as we might otherwise. Need to know how array is laid out in memory and then compute distance from the beginning of the array. C uses row-major order --- rows are stored one after the other in memory, i.e. row 0 then row 1 etc.

28 Flattening an array Number of columns, N column Array element a[row][column] = a[offset] offset = column + row * N where N is number of column in array row * number of columns row 0 0 N-1

29 int col = blockIdx.x*blockDim.x+threadIdx.x; int row = blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; // thread ID A[index] = … With one thread per array element

30 CUDA version using 2-D grid and 2-D blocks Adding two arrays where one thread handles one element in each array #define N 2048 // size of arrays __global__void addMatrix (int *a, int *b, int *c) { int col = blockIdx.x*blockDim.x+threadIdx.x; int row =blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; if ( col < N && row < N) c[index]= a[index] + b[index]; } int main() {... dim3 dimBlock (16,16); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); addMatrix >>(devA, devB, devC); … }

31 Compiling CUDA programs “nvcc” NVIDIA provides nvcc -- the NVIDIA CUDA “compiler driver”. Will separate out code for host and for device Regular C/C++ compiler used for host (needs to be available) Programmer simply uses nvcc instead of gcc/cc compiler on a Linux system Command line options include for GPU features

32 Compiling code - Linux Command line: nvcc –O3 –o -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcuda –lcudart CUDA source file that includes device code has the extension.cu nvcc separates code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU. Make file convenient – see next. See “The CUDA Compiler Driver NVCC” from NVIDIA for more details Optimization level if you want optimized code Directories for #include files Directories for librariesLibraries to be linked

33 Very simple sample Make file NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm prog1: cc -o prog1 prog1.c –lm prog2: cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 –lm prog3: $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu prog4: $(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4 prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm A regular C program A C program with X11 graphics A CUDA program A CUDA program with X11 graphics

34 Compilation process nvcc gccptxas nvcc “wrapper” divides code into host and device parts. Host part compiled by regular C compiler Device part compiled by NVIDIA “ptxas” assembler Two compiled parts combined into one executable executable Combine Object file nvcc –o prog prog.cu –I/includepath -L/libpath Executable file a “fat” binary” with both host and device code

35 Executing Program Simple type name of executable created by nvcc:./prog1 File includes all the code for host and for device in a “fat binary” file Host code starts running When first encounter device kernel, GPU code physically sent to GPU and function launched on GPU Hence first launch will be slow!! Run time environment (cudart) controls memcpy timing and synchronization

36 Ways to measure time of execution Generally instrument code Measure time at two places and get difference Ways to measure time: C clock() or time() routines CUDA “events” (seems the best way) CUDA SDK timer

37 Timing GPU Execution with CUDA events Code cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event. cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait for all device work to complete cudaEventElapsedTime(&elapsedTime, start, stop); //time between events cudaEventDestroy(start);//destroy start event cudaEventDestroy(stop);); //destroy stop event Time period

38 Recording Events cudaEventRecord(event1, 0) record an “event” into default “stream” (0). Device will record a timestamp for the event when it reaches that event in the stream, that is, after all preceding operations have completed. (Default stream 0 will mean completed in CUDA context) NOTE: This operation is asynchronous and may return before recording event!

39 Making event actually recorded cudaEventSynchronize(event) -- waits until named event actually recorded. Event recorded when all work done by threads to complete prior to specified event (Not strictly be necessary if synchronous CUDA call in code.)

40 Measuring time between two events cudaEventElapsedTime(&time, event1, event2) will return (pointer argument) the time elapsed between two events, in milliseconds. Resolution approx ½ millisecond. Timing measured using GPU clock.

Questions

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical.

Similar presentations

Presentation on theme: "1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical.

Similar presentations

Presentation on theme: "1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE 2011 - The 42 nd ACM Technical."— Presentation transcript:

Similar presentations

About project

Feedback