ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIMT execution © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 4 Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 Reference: CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract
A Multi-Dimensional Grid Example host device Grid 1 Block (0, 0) Block (0, 1) Kernel 1 Block (1, 0) Block (1, 1) Grid 2 Block (1,1) Kernel 2 (1,0,0) (1,0,1) (1,0,2) (1,0,3) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Built-In Variables 1D-3D Grid of thread blocks Built-in: gridDim gridDim.x, gridDim.y, gridDim.z Built-in: blockDim blockDim.x, blockDim.y, blockDim.z Example dim3 dimGrid (32,2,2) - 3D grid of thread blocks dim3 dimGrid (2,2,1) - 2D grid of thread blocks dim3 dimGrid (32,1,1) - 1D grid of thread blocks dim3 dimGrid ((n/256.0),1,1) - 1D grid of thread blocks my_kernel<<<dimGrid, dimBlock>>>(..) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Built-In Variables (2) 1D-3D grid of threads in a thread block Built-in: blockIdx blockIDx.x, blockIDx.y, blockIDx.z Built-in: threadIDx threadIDx.x, threadIDx.y, threadIDX.z All blocks have the same thread configuration Example dim3 dimBlock (4,2,2) - 3D grid of thread blocks dim3 dimBlock (2,2,1) - 2D grid of thread blocks dim3 dimBlock (32,1,1) - 1D grid of thread blocks my_kernel<<<dimGrid, dimBlock>>>(..) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Built-In Variables (3) 1D-3D grid of threads in a thread block Built-in: blockIdx blockIDx.x, blockIDx.y, blockIDx.z Built-in: threadIDx threadIDx.x, threadIDx.y, threadIDX.z Initialized by the runtime through a kernel call Range fixed by the compute capability and target devices You can query the device (later) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
2D Examples
Processing a Picture with a 2D Grid 16×16 blocks Show the example calling code from the text and identify the thread configuration Explain sizing of grid blocks and grid dimensions 72x62 pixels © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Row-Major Layout in C/C++ Row*Width+Col = 2*4+1 = 9 M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Source Code of the Picture Kernel __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n,int m) { // Calculate the row # of the d_Pin and d_Pout element to process int Row = blockIdx.y*blockDim.y + threadIdx.y; // Calculate the column # of the d_Pin and d_Pout element to process int Col = blockIdx.x*blockDim.x + threadIdx.x; // each thread computes one element of d_Pout if in range if ((Row < m) && (Col < n)) { d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col]; } © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
STirag Approach Summary Storage layout of data Assign unique ID Map IDs to Data (access) Figure 4.5 Covering a 76×62 picture with 16×blocks. © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
A Simple Running Example Matrix Multiplication A simple illustration of the basic features of memory and thread management in CUDA programs Thread index usage Memory layout Register usage Assume square matrix for simplicity Leave shared memory usage until later © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Square Matrix-Matrix Multiplication P = M * N of size WIDTH x WIDTH Each thread calculates one element of P Each row of M is loaded WIDTH times from global memory Each column of N is loaded WIDTH times from global memory N WIDTH M P WIDTH WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Row-Major Layout in C/C++ Row*Width+Col = 2*4+1 = 9 M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {map double sum = 0; for (int k = 0; k < Width; ++k) double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; N k j WIDTH M P Point out that I and j will become global thread indices. i WIDTH k © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign WIDTH WIDTH
Kernel Version: Functional Description col for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col]; N k j Which thread is at coordinate (row,col)? Threads self allocate WIDTH M P Point out that I and j will become global thread indices. row i blockDim.y WIDTH blockDim.x k © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign WIDTH WIDTH
Kernel Function - A Small Example Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Block(0,0) Block(0,1) P0,0 P0,1 P0,2 P0,3 WIDTH = 4; TILE_WIDTH = 2 Each block has 2*2 = 4 threads P1,0 P1,1 P1,2 P1,3 P2,0 P2,1 P2,2 P2,3 WIDTH/TILE_WIDTH = 2 Use 2* 2 = 4 blocks P3,0 P3,1 P2,3 P3,3 Block(1,0) Block(1,1) © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
A Slightly Bigger Example WIDTH = 8; TILE_WIDTH = 2 Each block has 2*2 = 4 threads P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7 P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7 P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 4 Use 4* 4 = 16 blocks P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7 P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7 P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
A Slightly Bigger Example (cont.) WIDTH = 8; TILE_WIDTH = 4 Each block has 4*4 =16 threads P2,0 P2,1 P2,2 P2,3 P2,4 P2,5 P2,6 P2,7 P3,0 P3,1 P3,2 P3,3 P3,4 P3,5 P3,6 P3,7 P4,0 P4,1 P4,2 P4,3 P4,4 P4,5 P4,6 P4,7 WIDTH/TILE_WIDTH = 2 Use 2* 2 = 4 blocks P5,0 P5,1 P5,2 P5,3 P5,4 P5,5 P5,6 P5,7 P6,0 P6,1 P6,2 P6,3 P6,4 P6,5 P6,6 P6,7 P7,0 P7,1 P7,2 P7,3 P7,4 P7,5 P7,6 P7,7 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
Kernel Invocation (Host-side Code) // Setup the execution configuration // TILE_WIDTH is a #define constant dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Kernel Function // Matrix multiplication kernel – per thread code { __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Work for Block (0,0) in a TILE_WIDTH = 2 Configuration blockDim.x blockDim.y Col = 0 Col = 1 Col = 0 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3 Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
Work for Block (0,1) blockDim.x blockDim.y Col = 2 Col = 3 Col = 1 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 blockIdx.x blockIdx.y N3,0 N3,1 N2,3 N3,3 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3 Row = 0 M1,0 M1,1 M1,2 M1,3 P0,1 P1,1 P1,2 P1,3 Row = 1 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3 © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
A Simple Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the d_P element and d_M int Row = blockIdx.y*blockDim.y+threadIdx.y; // Calculate the column idenx of d_P and d_N int Col = blockIdx.x*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } Memory access patterns across threads in a block (inner loop) © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Questions? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign