1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

DRAM Bandwidth Slide credit: Slides adapted from

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Programming Massively Parallel Processors Performance Considerations

Memory and Data Locality

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide – guide/#abstract 2

Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters –Thread blocks, warps, memory behaviors, etc. –Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth –Cause of the DRAM bandwidth problem –Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Formation of Warps 4 How do you form warps out of multidimensional arrays of threads? –Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) warp 1D Thread Block 3D Thread Block

Formation of Warps 5 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0)(1,0,1)(1,0,2)(1,0,3) T 0,0,0 T 0,0,1 T 0,0,2 T 0,0,3 T 0,1,0 T 0,1,1 T 0,1,2 T 0,1,3 T 1,0,0 T 1,0,1 T 1,0,2 T 1,0,3 T 1,1,0 T 1,1,1 T 1,1,2 T 1,1,3 linear order 2D Thread Block 3D Thread Block

Execution of Warps © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Each warp executed as SIMD bundle How do we handle divergent control flow among threads in a warp? –Execution semantics –How is it implemented? (later) warp

Reduction: Approach 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride =1; stride <blockDim.x; stride *=2) 4.{ 5.__syncthread(); 6.If(t%(2*stride) == 0) 7. partialsum[t] +=partialsum[t+stride]; 8. } threadID.x Thread Block

Reduction: Approach 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __shared__ float partialsum[];.. 2.unsigned int t = threadIDx.x; 3.For (unsigned int stride = blockDim.x; stride>1; stride /=2) 4.{ 5.__syncthread(); 6.If(t < stride) 7. partialsum[t] +=partialsum[t+stride]; 8. } Difference is in which threads diverge! For a thread block of 512 threads –Threads take the branch, do not For a warp size of 32, all threads in a warp have identical branch conditions  no divergence! When #active threads <warp-size,  old problem

Global Memory (DRAM) Bandwidth IdealReality ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Row Addr ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, DRAM Bank Organization Each core array has about 1M bits Each bit is stored in a tiny capacitor, made of one transistor Memory Cell Core Array Row Decoder Sense Amps Column Latches Mux Column Addr Off-chip Data Wide Narrow Pin Interface 10

A very small (8x2 bit) DRAM Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode 011 Sense amps Mux 11

DRAM core arrays are slow. Reading from a cell in the core array is a very slow process –DDR: Core speed = ½ interface speed –DDR2/GDDR3: Core speed = ¼ interface speed –DDR3/GDDR4: Core speed = ⅛ interface speed –… likely to be worse in the future decode To sense amps A very small capacitance that stores a data bit About 1000 cells connected to each vertical line ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

DRAM Bursting. For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface: –Load (N × interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed –DDR2/GDDR3: buffer width = 4X interface width ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

DRAM Bursting ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode 010 Sense amps Mux 14

DRAM Bursting decode 011 Sense amps and buffer Mux 15 “You can buy bandwidth but you can’t bribe God” -- Unknown

DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Non-burst timing Burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. 16

Multiple DRAM Banks ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, decode Sense amps Mux decode Sense amps Mux 0110 Bank 0 Bank 1 17

DRAM Bursting for the 8x2 Bank ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, time Address bits to decoder Core Array access delay 2 bits to pin 2 bits to pin Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time 18

First-order Look at the GPU off-chip memory subsystem nVidia GTX280 GPU: –Peak global memory bandwidth = 141.7GB/s Global memory (GDDR3) 1.1GHz –(Core 276Mhz) –For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers per clock) –We need a lot more bandwith (141.7 GB/s) – thus 8 memory channels ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Multiple Memory Channels Divide the memory address space into N parts –N is number of memory channels –Assign each portion to a channel ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, Channel 0 Channel 1 Channel 2 Channel 3 Bank 20

Memory Controller Organization of a Many-Core Processor GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect –DRAM controllers are interleaved –Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Lessons Organize data accesses to maximize burst mode bandwidth –Access consecutive locations –Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions –32 addresses in Fermi –Coalesce these accesses to create smaller number of memory transactions  maximize memory bandwidth –More later as we discuss microarchitecture 22

Memory Coalescing Memory references are coalesced into sequence of memory transactions –Accesses to a segment are coalesced, e.g., 128 byte segments) 23 LD Opportunity to Coalesce 16*4= 64 bytes Warp

Implications of Memory Coalescing Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to re- map entries from each access back to threads 24 Warp Schedulers Register File SP L1/Shared Memory DRAM L1 access bandwidth DRAM access bandwidth

M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M linearized order in increasing address Placing a 2D C array into linear memory space

Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 26 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois,

Two Access Patterns 27 d_M d_N W I D T H WIDTH Thread 1 Thread 2 (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code

28 N accesses are coalesced. N T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in kernel code (one thread) … N 0,2 N 1,1 N 0,1 N 0,0 N 1,0 N 0,3 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 N 0,2 N 0,1 N 0,0 N 0,3 N 1,1 N 1,0 N 1,2 N 1,3 N 2,1 N 2,0 N 2,2 N 2,3 N 3,1 N 3,0 N 3,2 N 3,3 Across successive threads in a warp d_N[k*Width+Col]

M accesses are not coalesced. 29 M T0T0 T1T1 T2T2 T3T3 Load iteration 0 T0T0 T1T1 T2T2 T3T3 Load iteration 1 Access direction in Kernel code (in a thread) … M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 d_M[Row*Width+k] Access across successive threads in a warp

Using Shared Memory 30 d_M d_N WIDTH d_M d_N Original Access Pattern Tiled Access Pattern Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH

Shared Memory Accesses ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, Shared memory is banked –No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; } Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts

Coalescing Behavior ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … …

Thread Granularity 34 Warp Schedulers Register File SP L1/Shared Memory DRAM Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Fetch/Decode

Thread Granularity Tradeoffs ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Preserving instruction bandwidth (memory bandwidth) –Increase thread granularity –Merge adjacent tiles: sharing tile data

Thread Granularity Tradeoffs (2) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … Impact on parallelism –#TBs, #registers/thread –Need to explore impact  autotuning

ANY MORE QUESTIONS? READ CHAPTER 6! 37