ECE 8823A GPU Architectures Module 5: Execution and Resources - I
Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract
Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters Thread blocks, warps, memory behaviors, etc. Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth Cause of the DRAM bandwidth problem Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Closer Look: Formation of Warps 1D Thread Block warp 3D Thread Block How do you form warps out of multidimensional arrays of threads? Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)
Formation of Warps 3D Thread Block 2D Thread Block linear order T0,0,0 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3) T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3 linear order
Mapping Thread Blocks to Warps Thread Bock An Example with a warp size of 16 threads T0,0 T0,3 Warp 0 T3,0 T3,3 Warp 1 T7,0 T7,3 Follow row major order through the Z-dimension Linearize and then split into warps Understanding becomes important when optimizing global memory accesses
Execution of Warps Each warp executed as SIMD bundle How do we handle divergent control flow among threads in a warp? Execution semantics How is it implemented? (later) How can we optimize against it? © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Impact of Control Divergence Occurs within a warp Branches lead serialization branch dependent code Performance issue: low warp utilization if(…) {… } else { …} Idle threads Serialization Reconvergence!
Causes Traditional nested branches Loops Variable number of iterations/thread Loop condition based on thread ID? Switching on thread ID if(threadIDx.x > 5) {} © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Control Divergence Mitigation: Algorithmic Approach Flexibility of MIMD control flow + Benefits of SIMD execution Can algorithmic techniques maximize utilizations achieved by a warp?
Reduction A commonly used strategy for processing large input data sets There is no required order of processing elements in a data set (associative and commutative) Partition the data set into smaller chunks Have each thread to process a chunk Use a reduction tree to summarize the results from each chunk into the final answer We will focus on the reduction tree step for now. Google and Hadoop MapReduce frameworks are examples of this pattern © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
A parallel reduction tree algorithm performs N-1 Operations in log(N) steps 3 1 7 4 1 6 3 max max max max 3 7 4 6 max max 7 6 max 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Reduction: Approach 1 __shared__ float partialsum[]; .. unsigned int t = threadIDx.x; For (unsigned int stride =1; stride <blockDim.x; stride *=2) { __syncthread(); If(t%(2*stride) == 0) partialsum[t] +=partialsum[t+stride]; } threadID.x thread thread thread thread Data in shared memory 1 2 3 4 5 6 6 Thread Block 0+1 2+3 4+5 6+7 O(N) additions and therefore work efficient? Hardware efficiency? 0..3 4..7 0..7 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
A Better Strategy Principle: Shift the index usage to ensure high thread utilization in warp Remap thread indices Keep the active threads consecutive © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
An Example of 16 threads 0+16 15+31 No Divergence 1 2 3 … 13 14 15 16 1 2 3 … 13 14 15 16 17 18 19 0+16 15+31 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Reduction: Approach 2 Difference is in which threads diverge! __shared__ float partialsum[]; .. unsigned int t = threadIDx.x; For (unsigned int stride = blockDim.x; stride>1; stride /=2) { __syncthread(); If(t < stride) partialsum[t] +=partialsum[t+stride]; } Difference is in which threads diverge! For a thread block of 512 threads Threads 0-255 take the branch, 256-511 do not For a warp size of 32, all threads in a warp have identical branch conditions no divergence! When #active threads <warp-size, old problem © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Global Memory Bandwidth How can we map thread access patterns to global memory addresses to maximize bandwidth utilization? Need to understand the organization of DRAMs! Hierarchy of latencies © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012
Basic Organization 1 1 decode Sense amps and buffer Mux 1 1 decode Example: 32x32 = 1024 bit array Sense amps and buffer Mux ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 I/O pins
1Gb Micron DD2 SDRAM Column access time Row access time
Technology Trends How? increasing burst length Past two decades, Courtesy: Synopsis DesignWare Technical Bulletin Past two decades, Data rate increase ~ 1000x RAS/CAS latency decrease = 56% How? increasing burst length
DRAM Bursting for a 8x2 Bank Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Non-burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. Burst timing ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Multiple DRAM Banks 1 1 decode decode Sense amps Sense amps Bank 1 1 1 decode decode Sense amps Sense amps Mux Mux Bank 1 Bank 0 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
DRAM Bursting for the 8x2 Bank Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
First-order Look at the GPU off-chip memory subsystem nVidia V100 Volta GPU: Peak global memory bandwidth = 900 GB/s Global memory (HBM2) interface @ 4096 bits Prior generation GPUs (e.g., Keplar) 384 bit wide GDDR5 @ 224GBytes/sec ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Multiple Memory Channels Divide the memory address space into N parts N is number of memory channels Assign each portion to a channel “You can buy bandwidth but you can’t bribe God” -- Unknown Bank Bank Bank Bank Channel 0 Channel 1 Channel 2 Channel 3 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Lessons Organize data accesses to maximize burst mode bandwidth Access consecutive locations Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions 32 addresses for a warp size of 32 Coalesce these accesses to create smaller number of memory transactions maximize memory bandwidth More later as we discuss microarchitecture
Memory Coalescing Warp LD LD LD LD Memory references are coalesced into sequence of memory transactions Accesses to a segment are coalesced, e.g., 128 byte segments) Ability and extent of coalescing depends on compute capability
Implications of Memory Coalescing Warp Schedulers Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to re- map entries from each access back to threads SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1 access bandwidth L1/Shared Memory DRAM access bandwidth DRAM
Placing a 2D C array into linear memory space linearized order in increasing address ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012
Lets look at these access patterns Two Access Patterns Lets look at these access patterns d_M d_N Thread 1 H T D I Thread 2 W WIDTH (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
N accesses are coalesced. T0 T1 T2 T3 Load iteration 0 Load iteration 1 Access direction in kernel code (one thread) … N0,2 N1,1 N0,1 N0,0 N1,0 N0,3 N1,2 N1,3 N2,1 N2,0 N2,2 N2,3 N3,1 N3,0 N3,2 N3,3 Across successive threads in a warp d_N[k*Width+Col] (each thread) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
M accesses are not coalesced. Access direction in Kernel code (in a thread) M0,0 M0,1 M0,2 M0,3 Access across successive threads in a warp M1,0 M1,1 M1,2 M1,3 d_M[Row*Width+k] M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 … Load iteration 1 T0 T1 T2 T3 Leads to many distinct memory transactions for accessing d_M Load iteration 0 T0 T1 T2 T3 M M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
Using Shared Memory Original Access Pattern Copy into scratchpad d_N WIDTH Original Access Pattern Tiled Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Shared Memory Accesses Shared memory is banked No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?
__global__ void MatrixMulKernel(float. d_M, float. d_N, float __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts
Coalescing Behavior … Col Row m*TILE_WIDTH d_N d_M d_P Pdsub TILE_WIDTHE m*TILE_WIDTH Col Row … ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Thread Granularity Consider instruction bandwidth vs. memory bandwidth Fetch/Decode Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Warp Schedulers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1/Shared Memory DRAM
Thread Granularity Tradeoffs d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Preserving instruction bandwidth (memory bandwidth) Increase thread granularity Merge adjacent tiles: sharing tile data ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Thread Granularity Tradeoffs (2) d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Impact on parallelism #TBs, #registers/thread Need to explore impact autotuning ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012
Any more questions? Read Chapter 6!