GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Intermediate GPGPU Programming in CUDA
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Lecture 7 CUDA Shared Memory Kyu Ho Park Mar. 29, 2016 Ref:[PCCP] Professional CUDA C Programming, Cheng, Grossman, McKercher, 2014.
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
DRAM Bandwidth Slide credit: Slides adapted from
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Shared Memory Accesses
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Parallel Computation Patterns (Stencil)
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory Dr. Muhammad Abid, DCIS, PIEAS

Introduction to Shared Memory On-chip program-managed memory a key enabler for many high-performance computing applications. HPC applications cannot rely on GPU caches Difficult to reason about GPU caches coz hundreds of threads share L1$ and thousands of threads share L2$ 20 to 30 times lower latency than device memory Greater than 10 times higher bandwidth than device memory

Introduction to Shared Memory Shared among all thread blocks per SM Same lifetime as thread block Shared memory usage can limit parallelism Gives full control when data is moved/ evicted into/ out of memory Supports multicast and broadcast Greatly reduce the global memory bandwidth needed by kernels Accessed on per-warp basis

Introduction to Shared Memory

Applications of Shared Memory An intra-block thread communication channel A program-managed cache for global memory data Exploiting temporal locality in prog Scratch pad memory for transforming data access pattern to improve global memory access patterns

Shared Memory Allocation 1D, 2D, and 3D shared memory arrays. Can be declared as either local to a CUDA kernel or globally in a CUDA source code file statically or dynamically Static allocation: __shared__ float tile[size_y][size_x]; Dynamic allocation: Only 1D allowed extern __shared__ float tile[]; <<<g,b, size * sizeof(int)>>>

Shared Memory Banks Memory Banks: divided into 32 equally-sized memory modules, called banks,  high bandwidth, 1D address space. If a shared memory load or store operation issued by a warp does not access more than one memory location per bank, the operation can be serviced by one memory transaction. Otherwise, the operation is serviced by multiple memory transactions, thereby decreasing memory bandwidth utilization.

Shared Memory Coflicts Bank Conflict: Occurs when multiple addresses map to the same memory bank causing the request to be replayed. The hardware splits a request with a bank conflict into as many separate conflict-free transactions as necessary Shared memory access pattern: Parallel access: multiple addresses accessed across multiple banks Serial access: multiple addresses accessed within the same bank Broadcast access: a single address read in a single bank

Shared Memory Coflicts A bank conflict does not occur when two threads from the same warp access the same address. for read accesses, the word is broadcast to the requesting threads, and for write accesses, the word is written by only one of the threads — which thread performs the write is undefined.

Shared Memory Access Pattern (a) No Bank Conflict (b) No Bank Conflict (c) No Bank Conflict coz same address

Shared Memory Access Mode Shared memory bank width defines which shared memory addresses are in which shared memory banks. Bank widths: 4 bytes (32-bits) for devices of compute capability 2.x 8 bytes (64-bits) for devices of compute capability 3.x Shared memory address to bank index: bank index = (byte address ÷ 4 bytes/bank) % 32 banks Successive 32-bit words map to successive banks.

Shared Memory Access Mode For a Fermi device, the bank width is 32-bits and there are 32 banks. Each bank has a bandwidth of 32 bits per two clock cycles. For Fermi

Shared Memory Access Mode For Kepler devices, shared memory has 32 banks with the following two access modes: 64-bit mode: bank width 64-bit 32-bit mode: bank width 32-bit Shared memory address to bank index: bank index = (byte address ÷ 8 bytes/bank) % 32 banks If two threads access any sub-word within the same 64-bit word  No conflict 64-bit mode always causes the same or fewer bank conflicts for the same access pattern on Kepler devices relative to Fermi. In 64-bit mode, successive 64-bit words map to successive banks. Each bank has a bandwidth of 64 bits per clock cycle.

Shared Memory Access Mode 32-bit mode, Kepler device

Shared Memory Access Mode 64-bit mode, Kepler device, no conflict

Access Mode Configuration cudaError_t cudaDeviceGetSharedMemConfig (cudaSharedMemConfig *pConfig); cudaError_t cudaDeviceSetSharedMemConfig(cudaSharedMemConfig config); The supported bank confi gurations are: cudaSharedMemBankSizeDefault cudaSharedMemBankSizeFourByte cudaSharedMemBankSizeEightByte Changing the shared memory bank size: no increase in shared memory usage or affect occupancy of kernels, perhaps a major effect on performance. -Recall that Kepler devices support 4-byte and 8-byte shared memory access modes. The default is 4-byte mode. - Changing the shared memory configuration between kernel launches might require an implicit device synchronization point. -A large bank size may yield higher bandwidth for shared memory access, but may result in more bank confl icts depending on the application’s shared memory access patterns.

Shared Memory Size Per-device configuration: Per-kernel configuration: cudaError_t cudaDeviceSetCacheConfig(cudaFuncCache cacheConfig); Per-kernel configuration: cudaError_t cudaFuncSetCacheConfig(const void* func, enum cudaFuncCacheca cheConfig); cacheConfig: cudaFuncCachePreferNone, cudaFuncCachePreferShared cudaFuncCachePreferL1, cudaFuncCachePreferEqual ➤ Prefer more shared memory when a kernel uses more shared memory. ➤ Prefer more L1 cache when a kernel uses more registers.

Using Shared Memory When writing a kernel, focus on the following two concepts: Mapping data elements across memory banks Mapping from thread index to shared memory offset Optimal: threads in the same warp accessing separate banks. Square Shared Memory Rectangle Shred Memory

Memory Padding Memory padding is one way to avoid bank conflicts One way to resolve this type of bank confl ict is to add a word of padding after every N elements, where N is the number of banks. This changes the mapping from words to banks. Bank Conflict No Bank Conflict

Matrix-Matrix Multiplication using Shared Memory

Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; }

Idea: Use Shared Memory to reuse global memory data Each input element is read by WIDTH threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth Tiled algorithms N WIDTH P M ty WIDTH tx WIDTH WIDTH

Outline of Technique Identify a block/tile of global memory content that are accessed by multiple threads Load the block/tile from global memory into on-chip memory Have the multiple threads to access their data from the on-chip memory Move on to the next block/tile

Shared Memory Blocking Basic Idea Global Memory … Thread 1 Thread 2 Global Memory in On-chip Memory … Thread 1 Thread 2

Work for Block (0,0) in a TILE_WIDTH = 2 Configuration blockDim.x blockDim.y Col = 0 Col = 1 Col = 0 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 blockIdx.x blockIdx.y N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3 Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Tiled Multiply Md Nd Pd Pdsub TILE_WIDTH WIDTH bx tx 1 TILE_WIDTH-1 2 by ty TILE_WIDTHE Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

Loading a Tile All threads in a block participate Each thread loads one Md element and one Nd element in based tiled code Assign the data elements to each thread such that the accesses within each warp is coalesced. Inner most dimension, e.g. x, shoud be multiple of warp size, e.g. 32.

Work for Block (0,0) N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 SN N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 N1,0 N1,1 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Work for Block (0,0) N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 SM N1,0 N1,1 N1,2 N1,3 N1,0 N1,1 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Work for Block (0,0) N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 SM N1,0 N1,1 N1,2 N1,3 N1,0 N1,1 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Work for next Block N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N0,0 N0,1 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Work for next Block N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 SM N1,0 N1,1 N1,2 N1,3 N1,0 N1,1 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Work for next Block N0,0 N0,1 N0,2 N0,3 N0,0 N0,1 N1,0 N1,1 N1,2 N1,3 SM N1,0 N1,1 N1,2 N1,3 N1,0 N1,1 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3 SM M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3 M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3 M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3 M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

Barrier Synchronization An API function call in CUDA __syncthreads() All threads in the same block must reach the __synctrheads() before any can move on Best used to coordinate tiled algorithms To ensure that all elements of a tile are loaded To ensure that all elements of a tile are consumed

Loading an Input Tile bx tx by ty m Accessing tile 0 2D indexing: Md Nd Pd Pdsub TILE_WIDTH WIDTH bx tx 1 TILE_WIDTH-1 2 by ty TILE_WIDTHE m Accessing tile 0 2D indexing: M[Row][tx] N[ty][Col] bx k Row = by * TILE_WIDTH +ty by m k

Loading an Input Tile bx tx by ty m Accessing tile 1 in 2D indexing: Md Nd Pd Pdsub TILE_WIDTH WIDTH bx tx 1 TILE_WIDTH-1 2 by ty TILE_WIDTHE m Accessing tile 1 in 2D indexing: M[Row][1*TILE_WIDTH+tx] N[1*TILE_WIDTH+ty][Col] bx k Row = by * TILE_WIDTH +ty by m k

Loading Input Tile m However, M and N are dynamically allocated and can only use 1D indexing: M[Row][m*TILE_WIDTH+tx] M[Row*Width + m*TILE_WIDTH + tx] N[m*TILE_WIDTH+ty][Col] N[(m*TILE_WIDTH+ty) * Width + Col] d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row …

Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx]; ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width]; __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += ds_M[ty][k] * ds_N[k][tx]; 14. __synchthreads(); 15.} 16. d_P[Row*Width+Col] = Pvalue; }

Compare with the Base Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; }