Chapter 4:Parallel Programming in CUDA C

Chapter 4:Parallel Programming in CUDA C
Rui (Ray) Wu

Overview GPU Grid VS Block VS thread Sum two vectors and striding PA1
Metrics: speed-up, throughput

GPU structure blockIdx.x, blockDim.x, threadIdx.x, threadIdx.y
Start from 0 How to define the dimension? dim3 something(x,y,z) dim3 my_grid(10) // a grid of 10*1*1 blocks dim3 my_block(10,20,30) // a block of 10*20*30 threads add<<<my_grid, my_block>>>(…); Different device different maximum dimension Cuda by example book, page 32 examples to check max dim Interesting reading materials:

GPU structure blockDim.x,y,z gives the number of threads in a block, in the particular direction gridDim.x,y,z gives the number of blocks in a grid, in the particular direction Here is an example: dim3 my_grid(2,2); dim3 my_block(3,3) if our add is like this add<<< my_grid, my_block>>> If blockIdx.x = 1, threadIdx.y = 0, threadIdx.x = 1, where is this thread?

GPU structure gridDim.x =2 Here blockDim.x =3 If blockIdx.x = 1, blockIdx.y = 0, threadIdx.x = 0, threadIdx.y = 1, where is this thread? Next class: how to calculate global thread id (how many threads are in front of the current one) gridDim.y =2 blockDim.y =3

Sum two vectors and striding

Striding Vector striding example: PA0 -> int tid = blockIdx.x;
If we do not have enough thread, what should we do? More than one thread for each block

Striding Advantage: scalability
If you change blockDim and gridDim, you don’t need to worry about changing your code. More information, e.g. how to grid-stride:

PA1: Matrix addition Due date: January 30, next Tuesday
Task:Add two N*N matrices using CUDA Requirements: Dynamic size for the matrices: dynamically allocated memory and add keyboard input statements to specify N Should be LARGE matrices (at least one million) You need to compare with and without striding Different CUDA grid/block structures and sizes – Add keyboard statements to input different values for numbers of threads in a block and number of blocks in a grid Include checks for invalid input Timing -- Add statements to time the execution of the code using CUDA events, both for the host-only (CPU) computation and with the device (GPU) computation, and display results. Compute and graph the appropriate metrics (runtime, speed-up factor, throughput…). Introduce later

PA1: Matrix addition Two parts: Report:
Results: multiple timings of runs of various sizes Appropriate graphs Code ONLY GITHUB, don’t use any library: Sequential C part Cuda part Same repository and output comparing results

PA1: Matrix addition What is throughput? What is speed-up?
How many numbers can your program add per second? What is speed-up? Sequential runtime over parallel program runtime

Question! If we fill arrays on the GPU instead of on the CPU, will it be faster or slower? Avoid data transportation! Slow your program.

Questions & Comments? Img credit:

Chapter 4:Parallel Programming in CUDA C

Similar presentations

Presentation on theme: "Chapter 4:Parallel Programming in CUDA C"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4:Parallel Programming in CUDA C

Similar presentations

Presentation on theme: "Chapter 4:Parallel Programming in CUDA C"— Presentation transcript:

Similar presentations

About project

Feedback