1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Objective To learn convolution, an important parallel computation pattern –Widely used in signal, image and video processing –Foundational to stencil computation used in many science and engineering Important techniques –Taking advance of cache memories © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 2

Convolution Computation Each output element is a weighted sum of neighboring input elements The weights are defined as the convolution kernel –Convolution kernel is also called convolution mask –The same convolution mask is typically used for all elements of the array. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 3

Gaussian Blur © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 4 Simple Integer Gaussian Kernel

1D Convolution Example Commonly used for audio processing –Mask size is usually an odd number of elements for symmetry (5 in this example) Calculation of P[2] 34543 38151615 N[0] P 38571615 1234567 33 N[3]N[1]N[2]N[5]N[4]N[6] M[0]M[3]M[1]M[2]M[4] P[0]P[3]P[1]P[2]P[5]P[4]P[6] N M

34543 041012 M N P 338571615 1234567 33 0 N[0]N[3]N[1]N[2]N[5]N[4]N[6] Filled in M[0]M[3]M[1]M[2]M[4] P[0]P[3]P[1]P[2]P[5]P[4]P[6] 1D Convolution Boundary Condition Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with “ghost” elements –Different policies (0, replicates of boundary values, etc.)

__global__ void basic_1D_conv(float *N, float *M, float *P, int Mask_Width, int Width) { int i = blockIdx.x*blockDim.x + threadIdx.x; float Pvalue = 0; int N_start_point = i - (Mask_Width/2); for (int j = 0; j < Mask_Width; j++) { if (N_start_point + j >= 0 && N_start_point + j < Width) { Pvalue += N[N_start_point + j]*M[j]; } P[i] = Pvalue; } A 1D Convolution Kernel with Boundary Condition Handling All elements outside the image to set to 0 Is the kernel function completely correct?

2D Convolution

12321 23432 34543 23432 12321 00000 00466 001012 00 10 0012106 M N P 123456 7 2345678 3456789 567856 5678567 6789012 7890123 12345 112 3456 34567 45678 56785 4 2D Convolution Boundary Condition

Two optimizations © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 10 Use Constant Memory for the convolution mask Use tiling technique

Access Pattern for M M is referred to as mask (a.k.a. kernel, filter, etc.) Calculation of all output P elements need M Total of O(P*M) reads of M M is not changed during kernel Bonus - M elements are accessed in the same order when calculating all P elements M is a good candidate for Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 11

12 Programmer View of CUDA Memories (Review) Each thread can: –Read/write per-thread registers (~1 cycle) –Read/write per-block shared memory (~5 cycles) –Read/write per-grid global memory (~500 cycles) –Read/only per-grid constant memory (~5 cycles with caching) Grid Global Memory Block (0, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

13 Memory Hierarchies If every time we needed a piece of data, we had to go to main memory to get it, computers would take a lot longer to do anything On today’s processors, main memory accesses take hundreds of cycles One solution: Caches

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 14 Cache Cache is unit of volatile memory storage A cache is an “array” of cache lines Cache line can usually hold data from several consecutive memory addresses When data is requested from memory, an entire cache line is loaded into the cache, in an attempt to reduce main memory requests

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 15 Caches - Cont’d Some definitions: –Spatial locality: is when the data elements stored in consecutive memory locations are access consecutively –Temporal locality: is when the same data element is access multiple times in short period of time Both spatial locality and temporal locality improve the performance of caches

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 16 Scratchpad vs. Cache Scratchpad (shared memory in CUDA) is another type of temporary storage used to relieve main memory contention. In terms of distance from the processor, scratchpad is similar to L1 cache. Unlike cache, scratchpad does not necessarily hold a copy of data that is also in main memory It requires explicit data transfer instructions, whereas cache doesn’t

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 17 Cache Coherence Protocol A mechanism for caches to propagate updates by their local processor to other caches (processors) Processor L1 Cache Main Memory regs The chip Processor L1 Cache regs Processor L1 Cache regs …

CPU and GPU have different caching philosophy CPU L1 caches are usually coherent –L1 is also replicated for each core –Even data that will be changed can be cached in L1 –Updates to local cache copy invalidates (or less commonly updates) copies in other caches –Expensive in terms of hardware and disruption of services (cleaning bathrooms at airports..) GPU L1 caches are usually incoherent –Avoid caching data that will be modified © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 18

How to Use Constant Memory Host code allocates, initializes variables the same way as any other variables that need o be copied to the device Use cudaMemcpyToSymbol(dest, src, size) to copy the variable into the device memory This copy function tells the device that the variable will not be modified by the kernel and can be safely cached. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 19

20 More on Constant Caching Each SM has its own L1 cache –Low latency, high bandwidth access by all threads However, there is no way for threads in one SM to update the L1 cache in other SMs –No L1 cache coherence Grid Global Memory Block (0, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory/L1 cache Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 This is not a problem if a variable is NOT modified by a kernel.

Some Header File Stuff for M #define MASK_WIDTH 5 // Mask Structure declaration typedef struct { unsigned int width; unsigned int height; float* elements; } Mask; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 21

AllocateMask // Allocate a mask of dimensions height*width // If init == 0, initialize to all zeroes. // If init == 1, perform random initialization. // If init == 2, initialize mask parameters, // but do not allocate memory Mask AllocateMask(int height, int width, int init) { Mask M; M.width = width; M.height = height; int size = M.width * M.height; M.elements = NULL; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 22

AllocateMask() (Cont.) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 23 // don't allocate memory on option 2 if(init == 2) return M; M.elements = (float*) malloc(size*sizeof(float)); for(unsigned int i = 0; i < M.height * M.width; i++) { M.elements[i] = (init == 0) ? (0.0f) : (rand() / (float)RAND_MAX); if(rand() % 2)M.elements[i] = - M.elements[i] } return M; }

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 Host Code // global variable, outside any function __constant__ float Mc[MASK_WIDTH][MASK_WIDTH]; … // allocate N, P, initialize N elements, copy N to Nd Mask M; M = AllocateMask(MASK_WIDTH, MASK_WIDTH, 1); // initialize M elements …. cudaMemcpyToSymbol(Mc, M.elements, MASK_WIDTH * MASK_WIDTH *sizeof(float)); ConvolutionKernel >>(Nd, Pd, width, height); 24

Observation on Convolution Each element in N is used in calculating up to MASK_WIDTH * MASK_WIDTH elements in P © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 25 34567 23456 12345 23567 01131 34567 23456 12345 23567 34567 23456 12345 23567 01131 34567 23456 12345 23567 01131 34567 23456 12345 23567 01131 34567 23456 12345 23567 01131 01131

Input tiles need to be larger than output tiles. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 26 Output Tile Input Tile 12321 23432 34543 23432 1232 12321 23432 34543 23432 12321 1

Dealing with Mismatch Use a thread block that matches input tile –Why? Each thread loads one element of the input tile –Special case: halo cells Not all threads participate in calculating output –There will be if statements and control divergence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 27

Shifting from output coordinates to input coordinates © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 28 TILE_SIZE BLOCK_SIZE tx ty Output Tile

Shifting from output coordinates to input coordinate int tx = threadIdx.x; int ty = threadIdx.y; int row_o = blockIdx.y * TILE_SIZE + ty; int col_o = blockIdx.x * TILE_SIZE + tx; int row_i = row_o - 2; // Assumes kernel size is 5 int col_i = col_o - 2; // Assumes kernel size is 5 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 29

Threads that loads halos outside N should return 0.0 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 30

Taking Care of Boundaries float output = 0.0f; __shared__ float Ns[TILE_SIZE+KERNEL_SIZE- 1][TILE_SIZE+KERNEL_SIZE-1]; if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) ) Ns[ty][tx] = N.elements[row_i*N.width + col_i]; else Ns[ty][tx] = 0.0f; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 31

Some threads do not participate in calculating output. if(ty < TILE_SIZE && tx < TILE_SIZE){ for(i = 0; i < 5; i++) for(j = 0; j < 5; j++) output += Mc[i][j] * Ns[i+ty][j+tx]; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 32

Some threads do not write output if(row_o < P.height && col_o < P.width) P.elements[row_o * P.width + col_o] = output; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 33

Setting Block Size #define BLOCK_SIZE (TILE_SIZE + 4) dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE); In general, block size should be tile size + (kernel size -1) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 34

In General BLOCK_SIZE is limited by the maximal number of threads in a thread block Thread Coarsening –Input tile sizes could be integer multiples of BLOCK_SIZE –Each thread calculates n output points –n is limited by the shared memory size KERNEL_SIZE is decided by application needs © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 35

A Simple Analysis for a small 8X8 output tile example, 5x5 mask 12X12=144 N elements need to be loaded into shared memory The calculation of each P element needs to access 25 N elements 8X8X25 = 1600 global memory accesses are converted into shared memory accesses A reduction of 1600/144 = 11X © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 36

In General (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory The calculation of each element of P needs to access Mask_Width 2 elements of N Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 37

Bandwidth Reduction for 2D The reduction is Mask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Width-1) 2 Tile_Width8163264 Reduction Mask_Width = 5 11.11619.722.1 Reduction Mask_Width = 9 20.33651.864 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012 38

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012.

Similar presentations

Presentation on theme: "1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012.

Similar presentations

Presentation on theme: "1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012."— Presentation transcript:

Similar presentations

About project

Feedback