1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I
Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide – guide/#abstract 2
Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters –Thread blocks, warps, memory behaviors, etc. –Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth –Cause of the DRAM bandwidth problem –Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
Formation of Warps 4 How do you form warps out of multidimensional arrays of threads? –Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) warp 1D Thread Block 3D Thread Block
Formation of Warps 5 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) T 0,0,0 T 0,0,1 T 0,0,2 T 0,0,3 T 0,1,0 T 0,1,1 T 0,1,2 T 0,1,3 T 1,0,0 T 1,0,1 T 1,0,2 T 1,0,3 T 1,1,0 T 1,1,1 T 1,1,2 T 1,1,3 linear order 2D Thread Block 3D Thread Block
Execution of Warps © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Each warp executed as SIMD bundle How do we handle divergent control flow among threads in a warp? –Execution semantics –How is it implemented? (later) warp
Reduction: Approach 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride =1; stride <blockDim.x; stride *=2) 4.{ 5.__syncthread(); 6.If(t%(2*stride) == 0) 7. partialsum[t] +=partialsum[t+stride]; 8. } threadID.x Thread Block
Reduction: Approach 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride = blockDim.x; stride>1; stride /=2) 4.{ 5.__syncthread(); 6.If(t < stride) 7. partialsum[t] +=partialsum[t+stride]; 8. } Difference is in which threads diverge! For a thread block of 512 threads –Threads take the branch, do not For a warp size of 32, all threads in a warp have identical branch conditions no divergence! When #active threads <warp-size, old problem
Global Memory (DRAM) Bandwidth IdealReality ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,
Row Addr ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, DRAM Bank Organization Each core array has about 1M bits Each bit is stored in a tiny capacitor, made of one transistor Memory Cell Core Array Row Decoder Sense Amps Column Latches Mux Column Addr Off-chip Data Wide Narrow Pin Interface 10
A very small (8x2 bit) DRAM Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode 011 Sense amps Mux 11
DRAM core arrays are slow. Reading from a cell in the core array is a very slow process –DDR: Core speed = ½ interface speed –DDR2/GDDR3: Core speed = ¼ interface speed –DDR3/GDDR4: Core speed = ⅛ interface speed –… likely to be worse in the future decode To sense amps A very small capacitance that stores a data bit About 1000 cells connected to each vertical line ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,
DRAM Bursting. For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface: –Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed –DDR2/GDDR3: buffer width = 4X interface width ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,
DRAM Bursting ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode 010 Sense amps Mux 14
DRAM Bursting decode 011 Sense amps and buffer Mux 15 “You can buy bandwidth but you can’t bribe God” -- Unknown
DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Non-burst timing Burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. 16
Multiple DRAM Banks ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode Sense amps Mux decode Sense amps Mux 0110 Bank 0 Bank 1 17
DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time 18
First-order Look at the GPU off-chip memory subsystem nVidia GTX280 GPU: –Peak global memory bandwidth = 141.7GB/s Global memory (GDDR3) 1.1GHz –(Core 276Mhz) –For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers per clock) –We need a lot more bandwith (141.7 GB/s) – thus 8 memory channels ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,
Multiple Memory Channels Divide the memory address space into N parts –N is number of memory channels –Assign each portion to a channel ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, Channel 0 Channel 1 Channel 2 Channel 3 Bank 20
Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect –DRAM controllers are interleaved –Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,
Lessons Organize data accesses to maximize burst mode bandwidth –Access consecutive locations –Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions –32 addresses in Fermi –Coalesce these accesses to create smaller number of memory transactions maximize memory bandwidth –More later as we discuss microarchitecture 22
Memory Coalescing Memory references are coalesced into sequence of memory transactions –Accesses to a segment are coalesced, e.g., 128 byte segments) 23 LD Opportunity to Coalesce 16*4= 64 bytes Warp
Implications of Memory Coalescing Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to re- map entries from each access back to threads 24 Warp Schedulers Register File SP L1/Shared Memory DRAM L1 access bandwidth DRAM access bandwidth
M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M linearized order in increasing address Placing a 2D C array into linear memory space
Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 26 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois,
Two Access Patterns 27 d_M d_N W I D T H WIDTH Thread 1 Thread 2 (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code
28 N accesses are coalesced. N T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in kernel code (one thread) … N 0,2 N 1,1 N 0,1 N 0,0 N 1,0 N 0,3 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 N 0,2 N 0,1 N 0,0 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 Across successive threads in a warp d_N[k*Width+Col]
M accesses are not coalesced. 29 M T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in Kernel code (in a thread) … M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 d_M[Row*Width+k] Access across successive threads in a warp
Using Shared Memory 30 d_M d_N WIDTH d_M d_N Original Access Pattern Tiled Access Pattern Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH
Shared Memory Accesses ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, Shared memory is banked –No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; } Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts
Coalescing Behavior ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … …
Thread Granularity 34 Warp Schedulers Register File SP L1/Shared Memory DRAM Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Fetch/Decode
Thread Granularity Tradeoffs ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Preserving instruction bandwidth (memory bandwidth) –Increase thread granularity –Merge adjacent tiles: sharing tile data
Thread Granularity Tradeoffs (2) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Impact on parallelism –#TBs, #registers/thread –Need to explore impact autotuning
ANY MORE QUESTIONS? READ CHAPTER 6! 37