1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois,

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu,
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, , SSL 2014.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
CS427 Multicore Architecture and Parallel Computing
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
Slides from “PMPP” book
L18: CUDA, cont. Memory Hierarchy and Examples
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Presentation transcript:

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 5 CUDA Programming Guide – guide/#abstract 2

Objective To understand the different memory spaces in the CUDA programming model To learn to efficiently use the important levels of the CUDA memory hierarchy –Registers, shared memory, global memory –Tiled algorithms and barrier synchronization © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, The Von-Neumann Model Memory Control Unit I/O ALU Reg File PCIR Processing Unit Consequences of limited global memory bandwidth

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, Going back to the program Every instruction needs to be fetched from memory, decoded, then executed. –The decode stage typically accesses register file Instructions come in three flavors: Operate, Data transfer, and Program Control Flow. An example instruction cycle is the following: Fetch | Decode | Execute | Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, Operate Instructions Example of an operate instruction: ADD R1, R2, R3 Instruction cycle for an operate instruction: Fetch | Decode | Execute | Memory Register Space

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, Memory Access Instructions Examples of memory access instruction: LDR R1, R2, offset STRR1, R2, offset Instruction cycle for an operate instruction: Fetch | Decode | Execute | Memory Global Address Space

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, Registers vs Memory Registers are “free” –No additional memory access instruction –Very fast to use, however, there are very few of them –More energy efficient Memory is expensive (slow), but very large Additional levels –Cache/scratch pad

9 Programmer View of CUDA Memories Each thread can: –Read/write per-thread registers (~1 cycle) –Read/write per-block shared memory (~5 cycles) –Read/write per-grid global memory (~500 cycles) –Read/only per-grid constant memory (~5 cycles with caching) Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Relationship to the Programming Model ?

Automatic Variables © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Automatic variables Automatic array variables Private version created for each thread block

Matrix Multiplication Revisited © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the d_P element and d_M int Row = blockIdx.y*blockDim.y+threadIdx.y; // Calculate the column idenx of d_P and d_N int Col = blockIdx.x*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } private to each thread Register File (128 KB) Register File (128 KB) L1 (16 KB) L1 (16 KB) Shared Memory (48 KB) Shared Memory (48 KB)

Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Shared by threads in a block Private version of a shared variable created for each thread block Parallel access

Shared Memory in CUDA A special type of memory whose contents are explicitly declared and used in the source code –Located in the processor –Accessed at much higher speed (in both latency and throughput) –Still accessed by memory access instructions –Commonly referred to as scratchpad memory in computer architecture –Shared by threads in a block Algorithmic optimizations to use shared memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Visible to all threads Persistent Means for a thread to collaborate across thread blocks Note: no synchronization between thread blocks Availability of atomics and fences (used in some circumstances)

Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Visible to all threads Persistent Declaration of constants across all threads Cached for faster access

Use of Constants: 1D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __global__ void convolution_1D_basic_kernel(float *N, float *M, float *P, int Mask_Width, int Width) { 2.int i = blockIDx.x*blockDim.x + threadIDx; 3.float Pvalue = 0; 4.int N_start = i – (Mask_Width/2); 5.For (int j= 0; j <Width; j++) { 6. if (N_start +j >=0 && N_start + j< Width){ 7. Pvalue += N[N_start +j] * M[j];} 8. } 9. P[i] = Pvalue; } N[] M[] (constant array) P[] Unique variable per thread Constant across threads ghost nodes

Use of Constant Memory Consider –Size of M[] is typically small –M[] is constant –All threads access the same elements at the same time © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, N[] M[] (constant array) P[] ghost nodes

Management © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory Constant memory must be explicitly managed –Allocation: constants are treated as global variables –Copying: copied into global memory and can be cached into a separate, dedicated cache Kernel functions access constants as global variables –Cached in the constant cache –Broadcast capability to all threads in a warp

Modified: 1D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, __global__ void convolution_1D_ba sic_kernel(float *N, float *P, int Mask_Width, int Width) { 2.int i = blockIDx.x*blockDim.x + threadIDx; 3.float Pvalue = 0; 4.int N_start = i – (Mask_Width/2); 5.For (int j= 0; j <Width; j++) { 6. if (N_start +j >=0 && N_start + j< Width){ 7. Pvalue += N[N_start +j] * M[j];} 8. } 9. P[i] = Pvalue; } N[] M[] (constant array) P[] ghost nodes M[] not passed in as a parameter Accessed as a global array Transfer using different API function cudaMemcpyToSymbol(dest,src,size)

__device__ is optional when used with __shared__, or __constant__ Automatic variables without any qualifier reside in a register –Except per-thread arrays that reside in global memory Note programmer controlled placement 20 CUDA Variable Type Qualifiers Variable declarationMemoryScopeLifetime int LocalVar; registerthread __device__ __shared__ int SharedVar; sharedblock __device__ int GlobalVar; globalgridapplication __device__ __constant__ int ConstantVar; constantgridapplication © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Refining the Memory Hierarchy © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, For best performance, index into constant cache should not be a function of thread ID! Separate read-only cache that the compiler uses – distinct from from the constant cache –Accessed with additional qualifier (__restrict__) Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

22 Where to Declare Variables? yesno global constant register (automatic) shared local © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

24 A Common Programming Strategy Global memory resides in device memory (DRAM) - slow access So, a profitable way of performing computation on the device is to tile input data to take advantage of fast shared memory: –Partition data into subsets that fit into shared memory –Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

25 Matrix-Matrix Multiplication using Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Base Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 26 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

27 Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on Fermi? All threads access global memory for their input matrix elements –Two memory accesses (8 bytes) per floating point multiply-add –4B/s of memory bandwidth/FLOPS –4*1,000 = 4,000 GB/s required to achieve peak FLOP rating –150 GB/s limits the code at 37.5 GFLOPS The actual code runs at about 25 GFLOPS Need to drastically cut down memory accesses to get closer to the peak 1,000 GFLOPS © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Shared Memory Blocking Basic Idea Thread 1Thread 2 … in Global Memory Thread 1Thread 2 … Global Memory in On-chip Memory 28 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Basic Concept of Blocking/Tiling In a congested traffic system, significant reduction of vehicles can greatly improve the delay seen by all vehicles –Carpooling for commuters –Blocking/Tiling for global memory accesses drivers = threads, cars = data © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Some computations are more challenging to block/tile than others. Some carpools may be easier than others –More efficient if neighbors are also classmates or co- workers –Some vehicles may be more suitable for carpooling Similar variations exist in blocking/tiling © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Carpools need synchronization. Good – when people have similar schedule Bad – when people have very different schedule © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Worker A Worker B Time sleep work dinner Worker A Worker B time sleep work dinner party 31

Same with Blocking/Tiling Good – when threads have similar access timing Bad – when threads have very different timing © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, Thread 1 Thread 2 Time Thread 1 Thread 2 time … 32

Outline of Technique Identify a block/tile of global memory content that are accessed by multiple threads Load the block/tile from global memory into on- chip memory Have the multiple threads to access their data from the on-chip memory Move on to the next block/tile © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

34 Idea: Use Shared Memory to reuse global memory data Each input element is read by WIDTH threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth –Tiled algorithms M N P WIDTH ty tx © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Col = 0 * 2 + threadIdx.x Row = 0 * 2 + threadIdx.y Col = 0Col = 1 Work for Block (0,0) in a TILE_WIDTH = 2 Configuration P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 Row = 0 Row = 1 blockIdx.xblockIdx.y blockDim.xblockDim.y 35

36 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Loading a Tile All threads in a block participate –Each thread loads one Md element and one Nd element in based tiled code Assign the loaded element to each thread such that the accesses within each warp is coalesced (more later). © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Work for Block (0,0) P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 M 0,1 M 0,0 M 1,0 M 1,1 N 0,1 N 0,0 N 1,0 N 1,1 SM 38 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Work for Block (0,0) SM M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 1,1 N 0,1 N 0,0 N 1,0 N 1,1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Work for Block (0,0) SM M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 1,1 N 0,1 N 0,0 N 1,0 N 1,1 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

N 1,0 Work for Block (0,0) 41 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 1,1 N 0,1 N 0,0 N 1,1 SM

Work for Block (0,0) SM 42 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, M 0,1 M 0,0 M 1,0 M 0,2 M 0,3 M 1,1 M 2,0 M 2,2 M 2,3 M 2,1 M 1,3 M 1,2 M 3,0 M 3,2 M 3,3 M 3,1 N 0,1 N 0,0 N 1,0 N 0,2 N 0,3 N 1,1 N 2,0 N 2,2 N 2,3 N 2,1 N 1,3 N 1,2 N 3,0 N 3,2 N 3,3 N 3,1 P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 M 0,1 M 0,0 M 1,0 M 1,1 N 0,1 N 0,0 N 1,0 N 1,1 SM

Barrier Synchronization An API function call in CUDA –__syncthreads() All threads in the same block must reach the __syncthreads() before any can move on Best used to coordinate tiled algorithms –To ensure that all elements of a tile are loaded –To ensure that all elements of a tile are consumed © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

… Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 … Thread N-3 Thread N-2 Thread N-1 Time Barriers and Thread Behaviors © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

45 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Loading an Input Tile m kbx by k m ©Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17, 2012 Row = by * TILE_WIDTH +ty Accessing tile 0 2D indexing: M[Row][tx] N[ty][Col]

46 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Loading an Input Tile m kbx by k m ©Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17, 2012 Row = by * TILE_WIDTH +ty Accessing tile 1 in 2D indexing: M[Row][1*TILE_WIDTH+tx] N[1*TILE_WIDTH+ty][Col]

d_M d_N d_P Pd sub TILE_WIDTH WIDTH TILE_WIDTH TILE_WIDTHE WIDTH m*TILE_WIDTH Col Row … … However, M and N are dynamically allocated and can only use 1D indexing: M[Row][m*TILE_WIDTH+tx] M[Row*Width + m*TILE_WIDTH + tx] N[m*TILE_WIDTH+ty][Col] N[(m*TILE_WIDTH+ty) * Width + Col] Loading Input Tile m

48 Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory 9. ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx]; 10. ds_N[ty][tx] = d_N[Col+(m*TILE_WIDTH+ty)*Width]; 11. __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) 13. Pvalue += ds_M[ty][k] * ds_N[k][tx]; 14. __synchthreads(); 15.} 16. d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Compare with the Base Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } 49 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

50 First-order Size Considerations Each thread block should have many threads –TILE_WIDTH of 16 gives 16*16 = 256 threads –TILE_WIDTH of 32 gives 32*32 = 1024 threads For 16, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. For 32, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

51 Shared Memory and Threading Each SM in Fermi has 16KB or 48KB shared memory* –SM size is implementation dependent! –For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. –Can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) –The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 –The 86.4GB/s bandwidth can now support (86.4/4)*16 = GFLOPS! *Configurable vs L1, total 64KB © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Resource Constraints Number of available registers also limit the number of thread blocks that can concurrently execute –Reduction is in size of a thread block –Impact on utilization can be significant Auto-tune based on querying of device properties –E,g., fix tile size based on available shared memory size –Need to change the preceding tiled MM code 52

Device Query Number of devices in the system int dev_count; cudaGetDeviceCount( &dev_count); Capability of devices cudaDevicePropdev_prop; for (i = 0; i < dev_count; i++) { cudaGetDeviceProperties( &dev_prop, i); // decide if device has sufficient resources and capabilities } cudaDeviceProp is a built-in C structure type –dev_prop.dev_prop.maxThreadsPerBlock –Dev_prop.sharedMemoryPerBlock –… © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

54 Global variables declaration –__host__ –__device__... __global__, __constant__, __texture__ Function prototypes –__global__ void kernelOne(…) –float handyFunction(…) Main () –allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) –transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) –execution configuration setup –kernel call – kernelOne >>( args… ); –transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) –optional: compare against golden (host computed) solution Kernel – void kernelOne(type args,…) –variables declaration - auto, __shared__ automatic variables transparently assigned to registers –syncthreads()… Other functions –float handyFunction(int inVar…); Summary- Typical Structure of a CUDA Program repeat as needed © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

ANY MORE QUESTIONS? READ CHAPTER 5! © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,