Download presentation
Presentation is loading. Please wait.
Published bySophia Preston Modified over 8 years ago
1
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008
2
Previously Organization Organization –Course structure, timings, locations –People, reference material Brief intro to GPGPU, CUDA Brief intro to GPGPU, CUDA Sign up for access to CUDA machine Sign up for access to CUDA machine –Check today during exercise
3
Today Grading Grading Course website unchanged Course website unchanged –http://theinf2.informatik.uni-jena.de/For+Students/CUDA.html http://theinf2.informatik.uni-jena.de/For+Students/CUDA.html The CUDA programming model The CUDA programming model –exercise
4
Grading Need 50% of marks from exercises to qualify for the final project Need 50% of marks from exercises to qualify for the final project Final grade will be determined by an exam based on the project Final grade will be determined by an exam based on the project
5
Recap... GPGPU GPGPU –Graphical Processing Unit –Handles values of pixels displayed on screen Highly parallel computation Highly parallel computation –Optimized for parallel computations
6
Recap... GPGPU GPGPU –General Purpose computing on GPU –Many non-graphics applications can be parallelized Can then be ported to a GPU implementation Can then be ported to a GPU implementation
8
Recap... CUDA – Compute Unified Device Architecture CUDA – Compute Unified Device Architecture –Software: minimal extension to C programming language –Hardware: supports the software Thus, CUDA enables Thus, CUDA enables –GPGPU for non-graphics people
9
CUDA
10
The CUDA Programming Model
11
CUDA
12
GPU as co-processor The application runs on the CPU (host) The application runs on the CPU (host) Compute intensive parts are delegated to the GPU (device) Compute intensive parts are delegated to the GPU (device) These parts are written as C functions (kernels) These parts are written as C functions (kernels) The kernel is executed on the device simultaneously by N threads The kernel is executed on the device simultaneously by N threads
14
GPU as co-processor Compute intensive tasks are defined as kernels Compute intensive tasks are defined as kernels The host delegates kernels to the device The host delegates kernels to the device The device executes a kernel with N parallel threads The device executes a kernel with N parallel threads Each thread has a thread ID Each thread has a thread ID The thread ID is accessible in a kernel via the threadIdx variable The thread ID is accessible in a kernel via the threadIdx variable
15
Example: Vector addition CPU version CPU version Total time = N * time for 1 addition Total time = N * time for 1 addition Thread 1
16
Example: Vector addition GPU version GPU version Total time = time for 1 addition Total time = time for 1 addition Thread 1 Thread 2 Thread 3 Thread 4 Thread N
17
CUDA kernel Example: definition __global__ void vecAdd (float* A,float* B,float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } Example: definition __global__ void vecAdd (float* A,float* B,float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }
18
CUDA kernel Example: invocation int main() { // init host vectors, size N: h_A, h_B, h_C // init device // copy to device: h_A->d_A, h_B->d_B, h_C->d_C vecAdd >>(d_A, d_B, d_C); // copy to host, d_C->h_C // do stuff // free host variables // free device variables } Example: invocation int main() { // init host vectors, size N: h_A, h_B, h_C // init device // copy to device: h_A->d_A, h_B->d_B, h_C->d_C vecAdd >>(d_A, d_B, d_C); // copy to host, d_C->h_C // do stuff // free host variables // free device variables }
19
Thread organization Thread are organized in blocks. Thread are organized in blocks. A block can be a 1D, 2D or 3D array of threads A block can be a 1D, 2D or 3D array of threads –threadIdx is a 3-component vector –Depends on how the kernel is called
21
Thread organization Example of 1D block Invoke (in main): int N; // assign some value to N vecAdd >>(d_A, d_B, d_C); Access (in kernel): int i = threadIdx.x; Example of 1D block Invoke (in main): int N; // assign some value to N vecAdd >>(d_A, d_B, d_C); Access (in kernel): int i = threadIdx.x;
22
Thread organization Example of 2D block Invoke (in main): dim3 blockDimension (N,N); // N pre-assigned matAdd >>(d_A, d_B, d_C); Access (in kernel): __global__ void matAdd (float A[N][N], float B[N][N],float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } Example of 2D block Invoke (in main): dim3 blockDimension (N,N); // N pre-assigned matAdd >>(d_A, d_B, d_C); Access (in kernel): __global__ void matAdd (float A[N][N], float B[N][N],float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; }
23
Thread organization Similarly, for a 3D block Invoke (in main): dim3 blockDimension (N,N,3); // N pre-assigned matAdd >>(d_A, d_B, d_C); Access (in kernel): int i = threadIdx.x; int j = threadIdx.y; int k = threadIdx.z; Similarly, for a 3D block Invoke (in main): dim3 blockDimension (N,N,3); // N pre-assigned matAdd >>(d_A, d_B, d_C); Access (in kernel): int i = threadIdx.x; int j = threadIdx.y; int k = threadIdx.z;
24
Thread organization Each thread in a block has a unique thread ID. Each thread in a block has a unique thread ID. Thread ID is NOT the same as threadIdx Thread ID is NOT the same as threadIdx –1D block. dim Dx: thread index x. thread ID = x –2D block. dim (Dx,Dy): thread index (x,y). thread ID = x + y. Dx –3D block. dim (Dx,Dy,Dz): thread index (x,y,z). thread ID = x + y. Dx + z. Dx. Dy
25
Thread organization All threads in a block have a shared memory All threads in a block have a shared memory –Very fast access For efficient/safe cooperation between threads, use __syncthreads() For efficient/safe cooperation between threads, use __syncthreads() –All threads complete execution up to that point, and then resume together
26
Memory available to threads Kernel definition __global__ void vecAdd (float* A,float* B,float* C) // A,B,C reside on global memory Kernel definition __global__ void vecAdd (float* A,float* B,float* C) // A,B,C reside on global memory Global memory is slower than shared memory Global memory is slower than shared memory
27
Memory available to threads Good idea: Good idea: –Global -> shared on entry –Shared -> global on exit __global__ void doStuff (float* in,float* out) { // init SData, shared memory // copy in -> SData // do stuff with SData // copy Sdata -> out }
29
Device Compute Capability The compute capability of a CUDA device is a number of the sort Major.Minor The compute capability of a CUDA device is a number of the sort Major.Minor –Major is the major revision number Fundamental change in card architecture Fundamental change in card architecture –Minor is the minor revision number Incremental changes within the major revision Incremental changes within the major revision A device is CUDA-ready if its compute capability is >= 1.0 A device is CUDA-ready if its compute capability is >= 1.0
30
All for today Next time Next time –Grids of thread blocks –Memory limitations –The hardware model
31
On to exercises!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.