Slides from “PMPP” book

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Using The CUDA Programming Model 1 Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin – Eau Claire.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
ME964 High Performance Computing for Engineering Applications
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Reduction)
Memory and Data Locality
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
ECE 498AL Spring 2010 Lecture 10: Control Flow
Parallel Computing 18: CUDA - I
Presentation transcript:

Slides from “PMPP” book David B. Kirk, NVIDIA Wen-mei W. Hwu, UIUC

Arrays of Parallel Threads <header> Arrays of Parallel Threads <date/time> A CUDA kernel is executed by an array of threads All threads run the same code (SPMD)‏ Each thread has an ID that it uses to compute memory addresses and make control decisions threadID 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; <footer> 2

Thread Blocks: Scalable Cooperation <header> <date/time> Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 threadID 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … float x = input[threadID]; float y = func(x); output[threadID] = y; … <footer> 3

Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D multidimensional data Image processing Solving PDEs on volumes …

CUDA Memory Model Overview <header> <date/time> CUDA Memory Model Overview Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Constant and texture memory will come later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory <footer> 5

CUDA Device Memory Allocation <header> CUDA Device Memory Allocation <date/time> cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() Frees object from device Global Memory Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory <footer> 6

CUDA Device Memory Allocation (cont.)‏ <header> CUDA Device Memory Allocation (cont.)‏ <date/time> Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md); <footer> 7

CUDA Host-Device Data Transfer <header> CUDA Host-Device Data Transfer <date/time> cudaMemcpy()‏ memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory <footer> 8

(cont.) Code example: Transfer a 64 * 64 single precision float array <header> <date/time> (cont.) Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); <footer> 9

Calling a Kernel Function – Thread Creation <header> <date/time> Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking <footer> 10

Matrix Multiplication <header> <date/time> Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity <footer> 11

Square Matrix Multiplication Example <header> <date/time> P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH times from global memory N WIDTH M P WIDTH WIDTH WIDTH <footer> 12

Memory Layout of a Matrix in C

A Simple Host Version in C <header> A Simple Host Version in C <date/time> // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; N k j WIDTH M P i WIDTH k WIDTH WIDTH <footer> 14

<header> (Host-side Code)‏ <date/time> void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); <footer> 15

(Host-side Code)‏ 2. // Kernel invocation code – to be shown later … <header> <date/time> (Host-side Code)‏ 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } <footer> 16

<header> <date/time> Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; <footer> 17

Step 4: Kernel Function (cont.)‏ <header> <date/time> Step 4: Kernel Function (cont.)‏ for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; Nd k WIDTH tx Md Pd ty ty This should be emphasized! Maybe another slide on “This is the first thing” WIDTH tx k WIDTH WIDTH <footer> 18

(Host-side Code) // Setup the execution configuration <header> <date/time> (Host-side Code) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); <footer> 19

Only One Thread Block Used <header> <date/time> Only One Thread Block Used Nd Grid 1 One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2)‏ 48 WIDTH Md Pd <footer> 20

Matrix Multiplication Using Multiple Blocks bx Matrix Multiplication Using Multiple Blocks 1 2 tx 1 2 TILE_WIDTH-1 Nd Break-up Pd into tiles Each block calculates one tile Each thread calculates one element Block size equal tile size WIDTH Md Pd Pdsub 1 ty 2 WIDTH by TILE_WIDTHE 1 TILE_WIDTH-1 TILE_WIDTH 2 WIDTH WIDTH

A Small Example Block(0,0) Block(1,0) P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2 P0,3 P1,3 P2,3 P3,3 Block(0,1) Block(1,1)

A Small Example: Multiplication Nd0,0 Nd1,0 Nd0,1 Nd1,1 Nd0,2 Nd1,2 Nd0,3 Nd1,3 Md0,0 Md1,0 Md2,0 Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Md0,1 Md1,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3

Revised Matrix Multiplication Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

(Host-side Code) // Setup the execution configuration <header> <date/time> (Host-side Code) // Setup the execution configuration dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); <footer> 27

A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/Only  constant memory (very fast if in cache) R/W shared within Block  shared memory (very fast) R/W within each thread  registers (very fast) R/W inputs/results  global memory (very slow) For texture memory usage, see NVIDIA document.

Parallel Reduction Given an array of values, “reduce” them to a single value in parallel Examples sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel implementation: Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads

A Vector Reduction Example Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory used to hold a partial sum vector Each iteration brings the partial sum vector closer to the final sum The final solution will be in element 0

A simple implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; }

Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements

Some Observations In each iterations, two control flow paths will be sequentially traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time All odd index threads are disabled right from the beginning! On average, less than ¼ of the threads will be activated for all warps over time. After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire

BAD: Divergence due to interleaved branch decisions Shortcomings of the implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } BAD: Divergence due to interleaved branch decisions

A better implementation Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; }

No Divergence until < 16 sub-sums Thread 0 Thread 1 Thread 2 Thread 14 Thread 15 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4