GPU Programming Lecture 5: Convolution

Slides:

Advertisements

Similar presentations

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.

1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W

CS/EE 217 – GPU Architecture and Parallel Programming

ME964 High Performance Computing for Engineering Applications

© David Kirk/NVIDIA and Wen-mei W

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Basic CUDA Programming

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

Slides from “PMPP” book

L18: CUDA, cont. Memory Hierarchy and Examples

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

DRAM Bandwidth Slide credit: Slides adapted from

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel Computation Patterns (Stencil)

Convolution Layer Optimization

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

6- General Purpose GPU Programming

© David Kirk/NVIDIA and Wen-mei W

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Objective To learn convolution, an important parallel computation pattern Widely used in signal, image and video processing Foundational to stencil computation used in many science and engineering Important techniques Taking advantage of cache memories © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Convolution Computation Each output element is a weighted sum of neighboring input elements The weights are defined as the convolution kernel Convolution kernel is also called convolution mask The same convolution mask is typically used for all elements of the array. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Gaussian Blur Simple Integer Gaussian Kernel © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

1D Convolution Example Commonly used for audio processing Mask size is usually an odd number for symmetry (5 in this example) Calculation of P[2] 3 4 5 8 15 16 N[0] P 57 1 2 6 7 N[3] N[1] N[2] N[5] N[4] N[6] M[0] M[3] M[1] M[2] M[4] P[0] P[3] P[1] P[2] P[5] P[4] P[6] N M

1D Convolution Boundary Condition Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with “ghost” elements Different policies (0, replicates of boundary values, etc.) 3 4 5 10 12 M N P 38 57 16 15 1 2 6 7 N[0] N[3] N[1] N[2] N[5] N[4] N[6] Filled in M[0] M[3] M[1] M[2] M[4] P[0] P[3] P[1] P[2] P[5] P[4] P[6]

A 1D Convolution Kernel with Boundary Condition Handling All elements outside the image to set to 0 __global__ void basic_1D_conv(float *N, float *M, float *P, int Mask_Width, int Width) { int i = blockIdx.x*blockDim.x + threadIdx.x; float Pvalue = 0; int N_start_point = i - (Mask_Width/2); for (int j = 0; j < Mask_Width; j++) { if (N_start_point + j >= 0 && N_start_point + j < Width) { Pvalue += N[N_start_point + j]*M[j]; } P[i] = Pvalue; int i = blockIdx.x*blockDim.x + threadIdx.x; if (i<Width) { float Pvalue = 0; int N_start_point = i - (Mask_Width/2); for (int j = 0; j < Mask_Width; j++) { if (N_start_point + j >= 0 && N_start_point + j < Width) { Pvalue += N[N_start_point + j]*M[j]; } P[i] = Pvalue; Is the kernel function completely correct?

2D Convolution 1 2 3 4 5 9 8 16 15 12 25 24 21 M N P 6 7 321

2D Convolution Boundary Condition P 1 2 3 4 5 6 7 1 2 3 4 5 2 3 4 5 6 7 8 112 3 4 5 6 3 4 5 6 7 8 9 3 4 5 6 7 4 5 6 7 8 5 6 4 5 6 7 8 5 6 7 8 5 6 7 5 6 7 8 5 6 7 8 9 1 2 7 8 9 1 2 3 M 1 2 3 2 1 4 6 6 2 3 4 3 2 3 4 5 4 3 10 12 12 2 3 4 3 2 12 12 10 1 2 3 2 1 12 10 6

Two optimizations Use Constant Memory for the convolution mask Use tiling technique © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Access Pattern for M M is referred to as mask (a.k.a. kernel, filter, etc.) Calculation of all output elements in P needs M Total of O(P*M) reads of M M is not changed during kernel Bonus - M elements are accessed in the same order when calculating all P elements M is a good candidate for Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Programmer View of CUDA Memories (Review) Each thread can: Read/write per-thread registers (~1 cycle) Read/write per-block shared memory (~5 cycles) Read/write per-grid global memory (~500 cycles) Read/only per-grid constant memory (~5 cycles with caching) Grid Global Memory Block (0, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Global, constant, and texture memory spaces are persistent across kernels called by the same application. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Memory Hierarchies If every time we need a piece of data, we have to go to main memory to get it, computers would take a lot longer to do anything On today’s processors, main memory accesses take hundreds of cycles One solution: Caches Use backpack analogy © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Cache Cache is unit of volatile memory storage A cache is an “array” of cache lines Cache line can usually hold data from several consecutive memory addresses When data is requested from memory, an entire cache line is loaded into the cache, in an attempt to reduce main memory requests Cache is transparent vs scratchpad which isn’t © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Caches - Cont’d Some definitions: Spatial locality: is when the data elements stored in consecutive memory locations are access consecutively Temporal locality: is when the same data element is access multiple times in short period of time Both spatial locality and temporal locality improve the performance of caches © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Scratchpad vs. Cache Scratchpad (shared memory in CUDA) is another type of temporary storage used to relieve main memory contention In terms of distance from the processor, scratchpad is similar to L1 cache Unlike cache, scratchpad memory requires explicit data transfer instructions, whereas cache doesn’t © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Cache Coherence Protocol A mechanism for caches to propagate updates by their local processor to other caches (processors) The chip Processor Processor Processor … regs regs regs L1 Cache L1 Cache L1 Cache Use back pack analogy here Main Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

CPU and GPU have different caching philosophy CPU L1 caches are usually coherent L1 is also replicated for each core Even data that will be changed can be cached in L1 Updates to local cache copy invalidates (or less commonly updates) copies in other caches Expensive in terms of hardware and disruption of services (cleaning bathrooms at airports..) GPU L1 caches are usually incoherent Avoid caching data that will be modified © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

How to Use Constant Memory Host code allocates, initializes variables the same way as any other variables that need to be copied to the device Use cudaMemcpyToSymbol(dest, src, size) to copy the variable into the device memory This copy function tells the device that the variable will not be modified by the kernel and can be safely cached © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

More on Constant Caching Each SM has its own L1 cache Low latency, high bandwidth access by all threads However, there is no way for threads in one SM to update the L1 cache in other SMs No L1 cache coherence Grid Global Memory Block (0, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Global, constant, and texture memory spaces are persistent across kernels called by the same application. This is not a problem if a variable is NOT modified by a kernel. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Some Header File Stuff for M #define MASK_WIDTH 5 // Mask Structure declaration typedef struct { unsigned int width; unsigned int height; float* elements; } Mask; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

AllocateMask // Allocate a mask of dimensions height*width // If init == 0, initialize to all zeroes. // If init == 1, perform random initialization. // If init == 2, initialize mask parameters, // but do not allocate memory Mask AllocateMask(int height, int width, int init) { Mask M; M.width = width; M.height = height; int size = M.width * M.height; M.elements = NULL; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

AllocateMask() (Cont.) // don't allocate memory on option 2 if(init == 2) return M; M.elements = (float*) malloc(size*sizeof(float)); for(unsigned int i = 0; i < M.height * M.width; i++) { M.elements[i] = (init == 0) ? (0.0f) : (rand() / (float)RAND_MAX); } return M; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Host Code // global variable, outside any function __constant__ float Mc[MASK_WIDTH][MASK_WIDTH]; … // allocate N, P, initialize N elements, copy N to Nd Mask M; M = AllocateMask(MASK_WIDTH, MASK_WIDTH, 1); // initialize M elements …. cudaMemcpyToSymbol(Mc, M.elements, MASK_WIDTH * MASK_WIDTH *sizeof(float)); ConvolutionKernel<<<dimGrid, dimBlock>>>(Nd, Pd, width, height); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Observation on Convolution Each element in N is used in calculating up to MASK_WIDTH * MASK_WIDTH elements in P 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 1 1 3 1 1 1 3 1 1 1 3 1 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 1 1 3 1 1 1 3 1 1 1 3 1

Input tiles need to be larger than output tiles. 1 2 3 2 1 2 3 4 3 2 3 4 5 4 3 2 3 4 3 2 Input Tile 1 2 3 2 1 Output Tile 1 2 3 2 1 2 3 4 3 2 3 4 5 4 3 2 3 4 3 2 1 2 3 2 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Dealing with Mismatch Use a thread block that matches input tile Why? Each thread loads one element of the input tile Special case: halo cells Not all threads participate in calculating output There will be if statements and control divergence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Shifting from output coordinates to input coordinates BLOCK_SIZE TILE_SIZE tx ty Output Tile © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Shifting from output coordinates to input coordinate int tx = threadIdx.x; int ty = threadIdx.y; int row_o = blockIdx.y * TILE_SIZE + ty; int col_o = blockIdx.x * TILE_SIZE + tx; int row_i = row_o - 2; // Assumes kernel size is 5 int col_i = col_o - 2; // Assumes kernel size is 5 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Threads that loads halos outside N should return 0.0 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Taking Care of Boundaries float output = 0.0f; __shared__ float Ns[TILE_SIZE+KERNEL_SIZE-1][TILE_SIZE+KERNEL_SIZE-1]; if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) ) Ns[ty][tx] = N.elements[row_i*N.width + col_i]; else Ns[ty][tx] = 0.0f; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Some threads do not participate in calculating output. if(ty < TILE_SIZE && tx < TILE_SIZE){ for(i = 0; i < 5; i++) for(j = 0; j < 5; j++) output += Mc[i][j] * Ns[i+ty][j+tx]; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Some threads do not write output if(row_o < P.height && col_o < P.width) P.elements[row_o * P.width + col_o] = output; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Setting Block Size #define BLOCK_SIZE (TILE_SIZE + 4) dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE); In general, block size should be tile size + (kernel size -1) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

In General BLOCK_SIZE is limited by the maximal number of threads in a thread block Thread Coarsening Input tile sizes could be integer multiples of BLOCK_SIZE Each thread calculates n output points n is limited by the shared memory size KERNEL_SIZE is decided by application needs © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

A Simple Analysis for a small 8X8 output tile example, 5x5 mask 12X12=144 N elements need to be loaded into shared memory The calculation of each P element needs to access 25 N elements 8X8X25 = 1600 global memory accesses are converted into shared memory accesses A reduction of 1600/144 = 11X © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

In General (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory The calculation of each element of P needs to access Mask_Width 2 elements of N Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Bandwidth Reduction for 2D The reduction is Mask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Width-1) 2 Tile_Width 8 16 32 64 Reduction Mask_Width = 5 11.1 19.7 22.1 Mask_Width = 9 20.3 36 51.8 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012