Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

Similar presentations

Presentation on theme: "1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I."— Presentation transcript:

1 1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

2 Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide – guide/#abstract 2

3 Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters –Thread blocks, warps, memory behaviors, etc. –Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth –Cause of the DRAM bandwidth problem –Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012 3

4 Formation of Warps 4 How do you form warps out of multidimensional arrays of threads? –Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) warp 1D Thread Block 3D Thread Block

5 Formation of Warps 5 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) T 0,0,0 T 0,0,1 T 0,0,2 T 0,0,3 T 0,1,0 T 0,1,1 T 0,1,2 T 0,1,3 T 1,0,0 T 1,0,1 T 1,0,2 T 1,0,3 T 1,1,0 T 1,1,1 T 1,1,2 T 1,1,3 linear order 2D Thread Block 3D Thread Block

6 Execution of Warps © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012 6 Each warp executed as SIMD bundle How do we handle divergent control flow among threads in a warp? –Execution semantics –How is it implemented? (later) warp

7 Reduction: Approach 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012 7 1.__shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride =1; stride <blockDim.x; stride *=2) 4.{ 5.__syncthread(); 6.If(t%(2*stride) == 0) 7. partialsum[t] +=partialsum[t+stride]; 8. } 01243566 0+12+34+56+7 0..34..7 0..7 threadID.x Thread Block

8 Reduction: Approach 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012 8 1.__shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride = blockDim.x; stride>1; stride /=2) 4.{ 5.__syncthread(); 6.If(t < stride) 7. partialsum[t] +=partialsum[t+stride]; 8. } Difference is in which threads diverge! For a thread block of 512 threads –Threads 0-255 take the branch, 256-511 do not For a warp size of 32, all threads in a warp have identical branch conditions  no divergence! When #active threads <warp-size,  old problem

9 Global Memory (DRAM) Bandwidth IdealReality ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 9

10 Row Addr ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 DRAM Bank Organization Each core array has about 1M bits Each bit is stored in a tiny capacitor, made of one transistor Memory Cell Core Array Row Decoder Sense Amps Column Latches Mux Column Addr Off-chip Data Wide Narrow Pin Interface 10

11 A very small (8x2 bit) DRAM Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 decode 011 Sense amps Mux 11

12 DRAM core arrays are slow. Reading from a cell in the core array is a very slow process –DDR: Core speed = ½ interface speed –DDR2/GDDR3: Core speed = ¼ interface speed –DDR3/GDDR4: Core speed = ⅛ interface speed –… likely to be worse in the future decode To sense amps A very small capacitance that stores a data bit About 1000 cells connected to each vertical line ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 12

13 DRAM Bursting. For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface: –Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed –DDR2/GDDR3: buffer width = 4X interface width ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 13

14 DRAM Bursting ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 decode 010 Sense amps Mux 14

15 DRAM Bursting decode 011 Sense amps and buffer Mux 15 “You can buy bandwidth but you can’t bribe God” -- Unknown

16 DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Non-burst timing Burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. 16

17 Multiple DRAM Banks ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 decode Sense amps Mux decode Sense amps Mux 0110 Bank 0 Bank 1 17

18 DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time 18

19 First-order Look at the GPU off-chip memory subsystem nVidia GTX280 GPU: –Peak global memory bandwidth = 141.7GB/s Global memory (GDDR3) interface @ 1.1GHz –(Core speed @ 276Mhz) –For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers per clock) –We need a lot more bandwith (141.7 GB/s) – thus 8 memory channels ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 19

20 Multiple Memory Channels Divide the memory address space into N parts –N is number of memory channels –Assign each portion to a channel ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 Channel 0 Channel 1 Channel 2 Channel 3 Bank 20

21 Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect –DRAM controllers are interleaved –Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 21

22 Lessons Organize data accesses to maximize burst mode bandwidth –Access consecutive locations –Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions –32 addresses in Fermi –Coalesce these accesses to create smaller number of memory transactions  maximize memory bandwidth –More later as we discuss microarchitecture 22

23 Memory Coalescing Memory references are coalesced into sequence of memory transactions –Accesses to a segment are coalesced, e.g., 128 byte segments) 23 LD Opportunity to Coalesce 16*4= 64 bytes Warp

24 Implications of Memory Coalescing Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to re- map entries from each access back to threads 24 Warp Schedulers Register File SP L1/Shared Memory DRAM L1 access bandwidth DRAM access bandwidth

25 M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M linearized order in increasing address Placing a 2D C array into linear memory space

26 Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 26 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012

27 Two Access Patterns 27 d_M d_N W I D T H WIDTH Thread 1 Thread 2 (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code

28 28 N accesses are coalesced. N T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in kernel code (one thread) … N 0,2 N 1,1 N 0,1 N 0,0 N 1,0 N 0,3 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 N 0,2 N 0,1 N 0,0 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 Across successive threads in a warp d_N[k*Width+Col]

29 M accesses are not coalesced. 29 M T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in Kernel code (in a thread) … M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 d_M[Row*Width+k] Access across successive threads in a warp

30 Using Shared Memory 30 d_M d_N WIDTH d_M d_N Original Access Pattern Tiled Access Pattern Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH

31 Shared Memory Accesses ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 31 Shared memory is banked –No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?

32 __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; } Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts

33 Coalescing Behavior ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 33 d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … …

34 Thread Granularity 34 Warp Schedulers Register File SP L1/Shared Memory DRAM Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Fetch/Decode

35 Thread Granularity Tradeoffs ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 35 d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Preserving instruction bandwidth (memory bandwidth) –Increase thread granularity –Merge adjacent tiles: sharing tile data

36 Thread Granularity Tradeoffs (2) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 36 d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Impact on parallelism –#TBs, #registers/thread –Need to explore impact  autotuning


Download ppt "1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I."

Similar presentations

Ads by Google