ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

CS/EE 217 – GPU Architecture and Parallel Programming

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

DRAM Bandwidth Slide credit: Slides adapted from

CS/EE 217 – GPU Architecture and Parallel Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Parallel Computation Patterns (Reduction)

Programming Massively Parallel Processors Performance Considerations

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Mattan Erez The University of Texas at Austin

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters Thread blocks, warps, memory behaviors, etc. Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth Cause of the DRAM bandwidth problem Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

Closer Look: Formation of Warps 1D Thread Block warp 3D Thread Block How do you form warps out of multidimensional arrays of threads? Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)

Formation of Warps 3D Thread Block 2D Thread Block linear order T0,0,0 Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3) T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3 linear order

Mapping Thread Blocks to Warps Thread Bock An Example with a warp size of 16 threads T0,0 T0,3 Warp 0 T3,0 T3,3 Warp 1 T7,0 T7,3 Follow row major order through the Z-dimension Linearize and then split into warps Understanding becomes important when optimizing global memory accesses

Execution of Warps Each warp executed as SIMD bundle How do we handle divergent control flow among threads in a warp? Execution semantics How is it implemented? (later) How can we optimize against it? © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

Impact of Control Divergence Occurs within a warp Branches lead serialization branch dependent code Performance issue: low warp utilization if(…) {… } else { …} Idle threads Serialization Reconvergence!

Causes Traditional nested branches Loops Variable number of iterations/thread Loop condition based on thread ID? Switching on thread ID if(threadIDx.x > 5) {} © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

Control Divergence Mitigation: Algorithmic Approach Flexibility of MIMD control flow + Benefits of SIMD execution Can algorithmic techniques maximize utilizations achieved by a warp?

Reduction A commonly used strategy for processing large input data sets There is no required order of processing elements in a data set (associative and commutative) Partition the data set into smaller chunks Have each thread to process a chunk Use a reduction tree to summarize the results from each chunk into the final answer We will focus on the reduction tree step for now. Google and Hadoop MapReduce frameworks are examples of this pattern © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A parallel reduction tree algorithm performs N-1 Operations in log(N) steps 3 1 7 4 1 6 3 max max max max 3 7 4 6 max max 7 6 max 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Reduction: Approach 1 __shared__ float partialsum[]; .. unsigned int t = threadIDx.x; For (unsigned int stride =1; stride <blockDim.x; stride *=2) { __syncthread(); If(t%(2*stride) == 0) partialsum[t] +=partialsum[t+stride]; } threadID.x thread thread thread thread Data in shared memory 1 2 3 4 5 6 6 Thread Block 0+1 2+3 4+5 6+7 O(N) additions and therefore work efficient? Hardware efficiency? 0..3 4..7 0..7 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

A Better Strategy Principle: Shift the index usage to ensure high thread utilization in warp Remap thread indices Keep the active threads consecutive © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

An Example of 16 threads 0+16 15+31 No Divergence 1 2 3 … 13 14 15 16 1 2 3 … 13 14 15 16 17 18 19 0+16 15+31 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Reduction: Approach 2 Difference is in which threads diverge! __shared__ float partialsum[]; .. unsigned int t = threadIDx.x; For (unsigned int stride = blockDim.x; stride>1; stride /=2) { __syncthread(); If(t < stride) partialsum[t] +=partialsum[t+stride]; } Difference is in which threads diverge! For a thread block of 512 threads Threads 0-255 take the branch, 256-511 do not For a warp size of 32, all threads in a warp have identical branch conditions  no divergence! When #active threads <warp-size,  old problem © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

Global Memory Bandwidth How can we map thread access patterns to global memory addresses to maximize bandwidth utilization? Need to understand the organization of DRAMs! Hierarchy of latencies © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

Basic Organization 1 1 decode Sense amps and buffer Mux 1 1 decode Example: 32x32 = 1024 bit array Sense amps and buffer Mux ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 I/O pins

1Gb Micron DD2 SDRAM Column access time Row access time

Technology Trends How?  increasing burst length Past two decades, Courtesy: Synopsis DesignWare Technical Bulletin Past two decades, Data rate increase ~ 1000x RAS/CAS latency decrease = 56% How?  increasing burst length

DRAM Bursting for a 8x2 Bank Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Non-burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. Burst timing ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Multiple DRAM Banks 1 1 decode decode Sense amps Sense amps Bank 1 1 1 decode decode Sense amps Sense amps Mux Mux Bank 1 Bank 0 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

DRAM Bursting for the 8x2 Bank Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

First-order Look at the GPU off-chip memory subsystem nVidia V100 Volta GPU: Peak global memory bandwidth = 900 GB/s Global memory (HBM2) interface @ 4096 bits Prior generation GPUs (e.g., Keplar) 384 bit wide GDDR5 @ 224GBytes/sec ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Multiple Memory Channels Divide the memory address space into N parts N is number of memory channels Assign each portion to a channel “You can buy bandwidth but you can’t bribe God” -- Unknown Bank Bank Bank Bank Channel 0 Channel 1 Channel 2 Channel 3 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Lessons Organize data accesses to maximize burst mode bandwidth Access consecutive locations Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions 32 addresses for a warp size of 32 Coalesce these accesses to create smaller number of memory transactions  maximize memory bandwidth More later as we discuss microarchitecture

Memory Coalescing Warp LD LD LD LD Memory references are coalesced into sequence of memory transactions Accesses to a segment are coalesced, e.g., 128 byte segments) Ability and extent of coalescing depends on compute capability

Implications of Memory Coalescing Warp Schedulers Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to remap entries from each access back to threads SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1 access bandwidth L1/Shared Memory DRAM access bandwidth DRAM

Placing a 2D C array into linear memory space linearized order in increasing address ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Lets look at these access patterns Two Access Patterns Lets look at these access patterns d_M d_N Thread 1 H T D I Thread 2 W WIDTH (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

N accesses are coalesced. T0 T1 T2 T3 Load iteration 0 Load iteration 1 Access direction in kernel code (one thread) … N0,2 N1,1 N0,1 N0,0 N1,0 N0,3 N1,2 N1,3 N2,1 N2,0 N2,2 N2,3 N3,1 N3,0 N3,2 N3,3 Across successive threads in a warp d_N[k*Width+Col] (each thread) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

M accesses are not coalesced. Access direction in Kernel code (in a thread) M0,0 M0,1 M0,2 M0,3 Access across successive threads in a warp M1,0 M1,1 M1,2 M1,3 d_M[Row*Width+k] M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 … Load iteration 1 T0 T1 T2 T3 Leads to many distinct memory transactions for accessing d_M Load iteration 0 T0 T1 T2 T3 M M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

Using Shared Memory Original Access Pattern Copy into scratchpad d_N WIDTH Original Access Pattern Tiled Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Shared Memory Accesses Shared memory is banked No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?

__global__ void MatrixMulKernel(float. d_M, float. d_N, float __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[tx][k] * Nds[k][ty]; 14. __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts

Coalescing Behavior … Col Row m*TILE_WIDTH d_N d_M d_P Pdsub TILE_WIDTHE m*TILE_WIDTH Col Row … ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Thread Granularity Consider instruction bandwidth vs. memory bandwidth Fetch/Decode Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Warp Schedulers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1/Shared Memory DRAM

Thread Granularity Tradeoffs d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Preserving instruction bandwidth (memory bandwidth) Increase thread granularity Merge adjacent tiles: sharing tile data ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Thread Granularity Tradeoffs (2) d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Impact on parallelism #TBs, #registers/thread Need to explore impact  autotuning ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

Any more questions? Read Chapter 6!