GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Intermediate GPGPU Programming in CUDA

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CUDA Programming. Floating Point Operations for the CPU and the GPU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Introduction to CUDA 2 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

GPU Programming with CUDA

CS427 Multicore Architecture and Parallel Computing

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Slides from “PMPP” book

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Programming Massively Parallel Processors Performance Considerations

Memory and Data Locality

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

GPU PROGRAMMING GPU Programming 1

Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute compatible GPU Due Jun 07 We have no final exams GPU Programming 2

GPU Resources Download CUDA toolkit from the web Very good text book: Programming Massively Parallel Processors Wen-mei Hwu and David Kirk Available at GPU Programming 3

Acknowledgments Slides and material from Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) GPU Programming 4

5 Why GPU Programming More processing power + higher memory bandwidth GPU in every PC and workstation – massive volume and potential impact

Current CPU 4 Cores 4 float wide SIMD 3GHz 48-96GFlops 2x HyperThreaded 64kB $L1/core 20GB/s to Memory $ W CPU 0CPU 1 CPU 2CPU 3 L2 Cache

Current GPU 32 Cores 32 Float wide 1GHz 1TeraFlop 32x “HyperThreaded” 64kB $L1/Core 150GB/s to Mem $200, 200W SIMD L2 Cache SIMD

Bandwidth and Capacity 8 CPU 50GFlops CPU 50GFlops GPU 1TFlop GPU 1TFlop CPU RAM 4-6 GB GPU RAM 1 GB 10GB/s100GB/s 1GB/s All values are approximate GPU Programming

9 CUDA “Compute Unified Device Architecture” General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU

Languages with Similar Capabilities CUDA OpenCL DirectCompute You are free to use any of the above for assignment 4 I will focus on CUDA for the rest of the lecture Same abstractions present in all three with different (and confusing) names GPU Programming 10

CUDA Programming Model: The GPU = compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel GPU program = kernel Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few GPU Programming 11

A CUDA Program 1. Host performs some CPU computation 2. Host copies input data into the device 3. Host instructs the device to execute a kernel 4. Device executes the kernel produces results 5. Host copies the results 6. Goto step 1 GPU Programming 12

CUDA Kernel is a SPMD program All threads run the same code Each thread uses its id to Operate on different memory addresses Make control decisions GPU Programming 13 Kernel: … i = input[tid]; o = f(i); output[tid] = o; … Kernel: … i = input[tid]; o = f(i); output[tid] = o; … SPMD = Single Program Multiple Data

CUDA Kernel is a SPMD program All threads run the same code Each thread uses its id to Operate on different memory addresses Make control decisions Difference with SIMD Threads can execute different control flow At a performance cost GPU Programming 14 Kernel: … i = input[tid]; if(i%2 == 0) o = f(i); else o = g(i); output[tid] = o; … Kernel: … i = input[tid]; if(i%2 == 0) o = f(i); else o = g(i); output[tid] = o; … SPMD = Single Program Multiple Data

Threads Organization Kernel threads = Grid of Thread Blocks (1D or 2D) Thread Block = Array of Threads (1D or 2D or 3D) Simplifies memory addressing for multidimensional data GPU Programming 15 Host Kernel 1 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2

Threads Organization Kernel threads = Grid of Thread Blocks (1D or 2D) Thread Block = Array of Threads (1D or 2D or 3D) Simplifies memory addressing for multidimensional data GPU Programming 16 Host Kernel 1 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Thread (0,0) Thread (1,0) Thread (0,1) Thread (1,1)

Threads within a Block Execute in lock step Can share memory Can synchronize with each other GPU Programming 17 CUDA Thread Block Thread Id #: … m Thread program Courtesy: John Nickolls, NVIDIA

GPU Programming 18 CUDA Function Declarations host __host__ float HostFunc() hostdevice __global__ void KernelFunc() device __device__ float DeviceFunc() Only callable from the: Executed on the: __global__ defines a kernel function Must return void __device__ and __host__ can be used together

GPU Programming19 CUDA Function Declarations (cont.)‏ __device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments

Putting it all together GPU Programming 20 __global__ void KernelFunc(…) dim3 DimGrid(100, 50); dim3 DimBlock(4, 8, 8); KernelFunc >>(...); __global__ void KernelFunc(…) dim3 DimGrid(100, 50); dim3 DimBlock(4, 8, 8); KernelFunc >>(...);

CUDA Memory Model Registers Read/write per thread Local memory Read/write per thread Shared memory Read/write per block Global memory Read/write per grid Constant memory Read only, per grid Texture memory Read only, per grid GPU Programming 21 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Texture Memory

Memory Access Efficiency Registers Fast Local memory Not cached -> Slow Registers spill into local memory Shared memory On chip -> Fast Global memory Not cached -> Slow Constant memory Cached – Fast if good reuse Texture memory Cached – Fast if good reuse GPU Programming 22 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Texture Memory

GPU Programming23 CUDA Variable Type Qualifiers __device__ is optional when used with __local__, __shared__, or __constant__ Automatic variables without any qualifier reside in a register Except arrays that reside in local memory Variable declarationMemoryScopeLifetime __device__ __local__ int LocalVar; localthread __device__ __shared__ int SharedVar; sharedblock __device__ int GlobalVar; globalgridapplication __device__ __constant__ int ConstantVar; constantgridapplication

GPU Programming24 Variable Type Restrictions Pointers can only point to memory allocated or declared in global memory: Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) Obtained as the address of a global variable: float* ptr = &GlobalVar;

Simple Example: Matrix Multiplication GPU Programming 25

GPU Programming26 Matrix Multiplication P = M * N of size WIDTH x WIDTH Simple strategy One thread calculates one element of P M and N are loaded WIDTH times from global memory M N P WIDTH

GPU Matrix Multiplication: Host GPU Programming 27 float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

GPU Matrix Multiplication: Host GPU Programming 28 float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size); float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size);

GPU Matrix Multiplication: Host GPU Programming 29 float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size); // call kernel cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size); // call kernel cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

GPU Matrix Multiplication: Host GPU Programming 30 float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size); // call kernel cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); cudaFree(Md); cudaFree(Nd); cudaFree(Pd); float *M, *N, *P; int width; int size = width * width * sizeof(float); cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); cudaMalloc(&Pd, size); // call kernel cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); cudaFree(Md); cudaFree(Nd); cudaFree(Pd);

GPU Matrix Multiplication: Host How many threads do we need? GPU Programming 31 M N P WIDTH

GPU Matrix Multiplication: Host GPU Programming 32 M N P WIDTH dim3 dimGrid(1,1); dim3 dimBlock(width, width); MatrixMul >> (Md, Nd, Pd, width); dim3 dimGrid(1,1); dim3 dimBlock(width, width); MatrixMul >> (Md, Nd, Pd, width);

GPU Matrix Multiplication: Kernel GPU Programming 33 __global__ void MatrixMul( float* Md, float* Nd, float* Pd, int width) { Pd[ty*width + tx] = … } __global__ void MatrixMul( float* Md, float* Nd, float* Pd, int width) { Pd[ty*width + tx] = … } Md Nd Pd WIDTH tx ty short forms: tx = threadIdx.x; ty = threadIdx.y; short forms: tx = threadIdx.x; ty = threadIdx.y;

GPU Matrix Multiplication: Kernel GPU Programming 34 __global__ void MatrixMul(…){ for(k=0; k<width; k++){ r = Md[ty*width+k] + Nd[k*width+tx]; Pd[ty*width + tx] = r; }} __global__ void MatrixMul(…){ for(k=0; k<width; k++){ r = Md[ty*width+k] + Nd[k*width+tx]; Pd[ty*width + tx] = r; }} Md Nd Pd WIDTH tx ty

GPU Programming 35 Only One Thread Block Used One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 48 Thread (2, 2)‏ WIDTH Md Pd Nd

GPU Programming36 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on G80? All threads access global memory for their input matrix elements Compute: GFLOPS Memory bandwidth: 86.4 GBps

GPU Programming37 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on G80? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak GFLOPS

GPU Programming 38 G80 Example: Executing Thread Blocks Threads are assigned to Streaming Multiprocessors in block granularity Up to 8 blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM maintains thread/block id #s SM manages/schedules thread execution t0 t1 t2 … tm Blocks SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm Blocks SM 1SM 0

GPU Programming39 G80 Example: Thread Scheduling Each Block is executed as 32-thread Warps –Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1 Streaming Multiprocessor Shared Memory … t0 t1 t2 … t31 … Block 1 Warps

GPU Programming40 G80 Example: Thread Scheduling Each Block is executed as 32-thread Warps –Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? –Each Block is divided into 256/32 = 8 Warps –There are 8 * 3 = 24 Warps … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1 Streaming Multiprocessor Shared Memory … t0 t1 t2 … t31 … Block 1 Warps

GPU Programming 41 SM Warp Scheduling SM hardware implements zero- overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected warp 8 instruction 11 SM multithreaded Warp scheduler warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction time warp 3 instruction 96

GPU Programming42 G80 Block Granularity Considerations For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks? Each SM can take max 8 blocks and max 768 threads

GPU Programming43 G80 Block Granularity Considerations For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks? For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

GPU Programming44 A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

GPU Programming45 A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/Only  constant memory (very fast if in cache) R/W shared within Block  shared memory (very fast) R/W within each thread  registers (very fast) R/W inputs/results  global memory (very slow)

GPU Programming46 Idea: Use Shared Memory to reuse global memory data Each input element is read by Width threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth Tiled algorithms M N P WIDTH ty tx

GPU Programming47 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

GPU Programming48 Pd 1,0 A Small Example: 2X2 Tiling of P Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0,0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3

GPU Programming 49 Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 Access order

GPU Programming 50 Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 Access order

GPU Programming Pd 1,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0,0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 Breaking Md and Nd into Tiles Break up the inner product loop of each thread into phases At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory Everyone access the Md and Nd elements from the shared memory during the phase

GPU Programming Pd 1,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0,0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 Breaking Md and Nd into Tiles Break up the inner product loop of each thread into phases At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory Everyone access the Md and Nd elements from the shared memory during the phase

Tiled Kernel GPU Programming 53 __global__ void Tiled(float* Md, float* Nd, float* Pd, int Width){ __shared __float Mds[TILE_WIDTH][TILE_WIDTH]; __shared __float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; // compute Pvalue Pd[Row*Width + Col] = Pvalue; } __global__ void Tiled(float* Md, float* Nd, float* Pd, int Width){ __shared __float Mds[TILE_WIDTH][TILE_WIDTH]; __shared __float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; // compute Pvalue Pd[Row*Width + Col] = Pvalue; }

Tiled Kernel: Computing Pvalue GPU Programming 54 //… float Pvalue = 0; // Loop over the Md and Nd tiles required for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; __syncthreads(); } Pd[Row*Width + Col] = Pvalue; //… float Pvalue = 0; // Loop over the Md and Nd tiles required for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; __syncthreads(); } Pd[Row*Width + Col] = Pvalue; //…

GPU Programming55 CUDA Code – Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);

GPU Programming56 First-order Size Considerations in G80 Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity) Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. Memory bandwidth no longer a limiting factor

GPU Programming57 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m

GPU Programming58 G80 Shared Memory and Threading Each SM in G80 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. The shared memory can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) The threading model limits the number of thread blocks to 3 so shared memory is not the limiting factor here The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 86.4B/s bandwidth can now support (86.4/4)*16 = GFLOPS!

GPU Programming59 Parallel Memory Architecture In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

GPU Programming60 Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

GPU Programming61 Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8

GPU Programming62 How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp

GPU Programming63 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

GPU Programming64 Linear Addressing Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 s=1

GPU Programming65 Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more. A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path