Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich
Ferienakademie 2007 Alexander Heinecke (TUM) 2 Overview 1.Differences CPU – GPU3 1.General CPU/GPU properties 2.Compare specifications 2.CUDA Programming Model10 1.Application stack 2.Thread implementation 3.Memory Model 3.CUDA API13 1.Extension of the C/C++ Programming Lang. 2.Example structure of a CUDA application 4.Examples15 1.Matrix Addition 2.Matrix Multiplication 3.Jacobi & Gauß – Seidel 5.Benchmark Results21
Ferienakademie 2007 Alexander Heinecke (TUM) 3 Differences between CPU and GPU GPU: nearly all transistors are ALUs CPU: most of the transistors are Cache (taken from [NV1])
Ferienakademie 2007 Alexander Heinecke (TUM) 4 AMD Opteron Dieshot
Ferienakademie 2007 Alexander Heinecke (TUM) 5 Intel Itanium2 Dual-Core Dieshot
Ferienakademie 2007 Alexander Heinecke (TUM) 6 Intel Core Architecture Pipeline / Simple Example (taken from IN1) IFETCH #1IFETCH #2 IDEC #1 IFETCH #3 IDEC #2 OFETCH #1 IFETCH #4 IDEC #3 OFETCH #2 EXEC #1 IFETCH #5 IDEC #4 OFETCH #3 EXEC #2 RET #1 IFETCH #6 IDEC #5 OFETCH #4 EXEC #3 RET #2 IFETCH #7 IDEC #6 OFETCH #5 EXEC #4 RET #3 Pipeline cycle Step 1 Step 2 Step 3 Step 4 Step
Ferienakademie 2007 Alexander Heinecke (TUM) 7 nVidia G80 Pipeline
Ferienakademie 2007 Alexander Heinecke (TUM) 8 Properties of CPU and GPU Intel Xeon X5355nVidia G80 (8800 GTX) Clock Speed2,66 GHz575 MHz #Cores / SPEs4128 Floats in register Max. GFlop/s (float) 84 (prac) 85 (theo) 460 (prac) 500 (theo) Max. InstructionsRAM limited2 Million G80 ASM Instr. typ. dur. Inst.1-2 cycles (SSE)min. 4 cycles Price (€)800500
Ferienakademie 2007 Alexander Heinecke (TUM) 9 History: Power of GPUs in the last four years (taken from [NV1])
Ferienakademie 2007 Alexander Heinecke (TUM) 10 Application stack of CUDA (taken from [NV1])
Ferienakademie 2007 Alexander Heinecke (TUM) 11 Thread organization in CUDA (taken from [NV1])
Ferienakademie 2007 Alexander Heinecke (TUM) 12 Memory organization in CUDA (taken from [NV1])
Ferienakademie 2007 Alexander Heinecke (TUM) 13 Extensions to C (functions and varaible) CUDA Code is saved in special files (*.cu) These are precompiled by nvcc (nvidia compiler) There are some function type qualifiers, which decide the execution place: –__host__ (CPU only, called by CPU) –__global__ (GPU only, called by CPU) –__device__ (GPU only, called by GPU) For varaibles: __device__, __constant__, __shared__
Ferienakademie 2007 Alexander Heinecke (TUM) 14 Example structure of a CUDA application min. two functions to isolate CUDA Code from your app. First function: –Init CUDA –Copy data to device –Call kernel with execution settings –Copy data to host and shut down (automatic) Second function (kernel): –Contains problem for ONE thread
Ferienakademie 2007 Alexander Heinecke (TUM) 15 Tested Algorithms (2D Arrays) All tested algorithms operate on 2D Arrays Matrix Addtion Matrix Multiplication Jacobi & Gauß-Seidel (iterative solver)
Ferienakademie 2007 Alexander Heinecke (TUM) 16 Example Matrix Addition (Init function) CUT_DEVICE_INIT(); // allocate device memory float* d_A; CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size)); … // copy host memory to device CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size, cudaMemcpyHostToDevice) ); … cudaBindTexture(0, texRef_MaA, d_A, mem_size); // texture binding … dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU); dim3 grid(n_dim / threads.x, n_dim / threads.y); // execute the kernel cuMatrixAdd_kernel >>(d_C, n_dim); cudaUnbindTexture(texRef_MaA); // texture unbinding … // copy result from device to host CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size, cudaMemcpyDeviceToHost) ); … CUDA_SAFE_CALL(cudaFree(d_A));
Ferienakademie 2007 Alexander Heinecke (TUM) 17 Example Matrix Addition (kernel) // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU; C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);
Ferienakademie 2007 Alexander Heinecke (TUM) 18 Example Matrix Multiplication (kernel) int tx2 = tx + BLOCK_SIZE_GPU; int ty2 = n_dim * ty; float Csub1 = 0.0; float Csub2 = 0.0; int b = bBegin; for (int a = aBegin; a <= aEnd; a += aStep) { __shared__ float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU]; AS(ty, tx) = A[a + ty2 + tx]; __shared__ float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2]; B1S(ty, tx) = B[b + ty2 + tx]; B1S(ty, tx2) = B[b + ty2 + tx2]; __syncthreads(); Csub1 += AS(ty, 0) * B1S(0, tx); // more calcs b+= bStep; } __syncthreads(); // Write result back
Ferienakademie 2007 Alexander Heinecke (TUM) 19 Example Jacobi (kernel), no internal loops // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x+1; int ty = threadIdx.y+1; int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU); float res = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh; res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1); res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx); res = 0.25f * res; ma_u[ustart + (ty * n_dim) + tx] = res;
Ferienakademie 2007 Alexander Heinecke (TUM) 20 Example Jacobi (kernel), internal loops int tx = threadIdx.x+1; int ty = threadIdx.y+1; // *some more inits* // load to calc u_ij __shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2]; US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx); // *init edge u* … for (unsigned int i = 0; i < n_intern_loops; i++) { res = funk; res += US(ty, tx - 1) + US(ty, tx + 1); res += US(ty - 1, tx) + US(ty + 1, tx); res = 0.25f * res; __syncthreads(); // not used in parallel jacobi US(ty, tx) = res; } ma_u[ustart + (ty * n_dim) + tx] = res;
Ferienakademie 2007 Alexander Heinecke (TUM) 21 Performance Results (1)
Ferienakademie 2007 Alexander Heinecke (TUM) 22 Performance Results (2)
Ferienakademie 2007 Alexander Heinecke (TUM) 23 Performance Results (3)
Ferienakademie 2007 Alexander Heinecke (TUM) 24 Performance Results (4)
Ferienakademie 2007 Alexander Heinecke (TUM) 25 Conclusion (Points to take care of) Be care of / you should use: min. number of memory accesses use unrolling instead of for loops use blocking algorithms only algorithms, which are not extremly memory bounded (NOT matrix addition) should be implemented with CUDA try to do not use the if statement, or other programmecontrolling statements (slow)
Ferienakademie 2007 Alexander Heinecke (TUM) 26 Appendix - References [NV1]NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version 1.0, [IN1/2/3]Intel Architecture Handbook, Version November 2006 [NR]Numerical receipies (online generated pdf)