Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011.

Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Usage of Convolution Convolutions are used by many applications for engineering and mathematics. Many types of blur filters or edge detection use convolutions. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Example Source Image

Blur convolution filter applied to the source image

Another Example of A Bur Filter
Blur Filter (mask) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Edge Detection Filter © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Another Use of an Edge Filter
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

2D Convolution N P 3 4 5 6 7 2 3 4 5 6 1 2 3 4 5 235 2 3 5 6 7 1 1 3 1 The M matrix (the mask, convolution filter matrix, convolution kernel matrix) specifies how you take a neighborhood of pixels or data elements and do a multiply and accumulate of all these pixels with the mask and generate a single value in the output. For a 5x5 mask we have 24 surrounding elements and the element itself to generate each output. For calculating every element of p , you need always the 25 elements of M (elements of M accessed by al threads that will be producing output). M does not changed and is usually small. 3 8 15 12 7 1 2 3 2 1 4 9 16 15 12 2 3 4 3 2 M 3 4 5 4 3 3 8 15 16 15 2 3 4 3 2 4 9 20 18 14 1 2 3 2 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2 3 6 1

Output Tiling and Thread Index (P)
Use a thread block to calculate a tile of P Each output tile is of TILE_SIZE for both x and y col_o = blockIdx.x * TILE_SIZE + tx; row_o = blockIdx.y*TILE_SIZE + ty; In order to parallelize convolution (very similar to vector addition and vector multiplication), you need to think about the number thread blocks and number threads per block in order to calculate all the output points. What we need to accomplish is to calculate al output pixel points. Divide the total output image we want to generate into output tiles. We need to distinguish between output tiles and input tiles. An output tile is related to the threading structure. Output tiles are assigned to thread blocks. In practice ties could be rectangular but for our purposes we will assume square tiles. You associate the thread indexes with the column and row of the output pixel you are computing. We do not have the same tile size for the input and the output because of the halo cells. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Tiling N Each N element is used in calculating up to KERNEL_SIZE * KERNEL_SIZE (5X5=25 in the example) P elements This could be seen in the following slides where we denote the neighbors as the E: East neighbor W: West neighbor N: North neighbor S: South neighbor The tiling for N is much more complicated (it causes bugs in the code). The calculation of each output is going to take a whole Mask of input elements. The output value of a point depends on the values of all its neighbors in the Mask. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

The middle red element is used in calculating: 1 middle element + 16 elements = 2E, 2W, 2N, 2S, 2NE, 2NW, 2SE, 2SW + Window of Usage 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 1 1 3 1 1 1 3 1 1 1 3 1 Think about a particular element of the input (say 3). It is going to be used when calculating the output value of itself. Also, the output value of its right neighbor. If you move the window of usage to right (up to twice) , 3 is still being used. But if you move the window of usage to the right more, then 3 is not being used. You can easly see that each element will be used in the calculation of 25 other elements. The data reuse is proportional to the size of your kernel (or your Mask). 3 4 5 6 7 3 4 5 6 7 3 4 5 6 7 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 5 6 7 2 3 5 6 7 2 3 5 6 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 1 1 3 1 1 1 3 1 1 1 3 1

It is also used in calculating 8 more elements
3 4 5 6 7 2 1 3 4 5 6 7 2 1 3 4 5 6 7 2 1 Hence, the middle red element (and each element of N) is reused in calculating 25 elements 01 middle element + 16 elements = 2E, 2W, 2N, 2S, 2NE, 2NW, 2SE, 2SW + 06 elements © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

High-Level Input Tiling Strategy
Load a tile of N into shared memory (SM) The input tile size should be somewhat larger than kernel size so that many filter operations can share data in the tile TILE_SIZE KERNEL_SIZE What we would like to do is to load into a whole bunch of N elements into shared memory. Since each element is potentially used 25 times hence we eliminate lots of memory traffic between global memory and shared memory which we can eliminate. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Input tiles need to be larger than output tiles.
3 4 5 6 7 2 3 4 5 6 1 2 3 4 5 2 3 5 6 7 Input Tile 1 1 3 1 Output Tile Dimension of output tile is equal to output tile + (Mask Size -1). If the kernel size is 5, then we will be loading two more elements on each side. This gives us all input elements we need and the complexity of mapping thread indexes to into input versus the output because now we have different sizes in the input tile versus output tile . We need to map the thread indexes to input or output tiles to avoid confusion. 3 4 5 6 7 2 3 4 5 6 1 2 3 4 5 2 3 5 6 7 1 1 3 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Dealing with Mismatch Use a thread block that matches input tile
Each thread loads one element of the input tile Some threads do not participate in calculating output There will be if statements and control divergence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Shifting from output coordinates to input coordinates

Shifting from output coordinates to input coordinate
int tx = threadIdx.x; int ty = threadIdx.y; int row_o = blockIdx.y * TILE_SIZE + ty; int col_o = blockIdx.x * TILE_SIZE + tx; int row_i = row_o - 2; int col_i = col_o - 2; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Threads that loads halos outside N should return 0.0

Taking Care of Boundaries
float output = 0.0f; if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) ) { Ns[ty][tx] = N.elements[row_i*N.width + col_i]; } else{ Ns[ty][tx] = 0.0f; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Some threads do not participate in calculating output.
if(ty < TILE_SIZE && tx < TILE_SIZE){ for(i = 0; i < 5; i++) { for(j = 0; j < 5; j++) { output += Mc[i][j] * Ns[i+ty][j+tx]; } For © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Some threads do not write output
if(row_o < P.height && col_o < P.width) P.elements[row_o * P.width + col_o] = output; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Setting Block Size #define BLOCK_SIZE (TILE_SIZE + 4) dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE); In general, block size should be tile size + (kernel size -1) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Tiling Benefit Analysis
Start with KERNEL_SIZE = 5 Each point in an input tile is used multiple times. Each boundary point (blue) is used at least 9 times Each second boundary point (yellow) is used at least 16 times Each inner boundary point (red) is used 25 times © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Reuse Analysis For TILE_SIZE = 12
44 boundary points (blue boundary elements) 36 boundary points (yellow boundary elements) 64 inside points (inside red elements) Total uses>= 44*9 + 36* *25 = = 2572 Average reuse >= 2572/144 = 17.9 As TILE_SIZE increases, the average reuse approach 25 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

In General The number of boundary layers is proportional to the KERNEL_SIZE The maximal reuse of each data point is (KERNEL_SIZE) 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

In General BLOCK_SIZE is limited by the maximal number of threads in a thread block Input tile sizes could be N*TILE_SIZE + (KERNEL_SIZE-1) By having each thread to calculate N input points (thread coarsening) N is limited by the shared memory size KERNEL_SIZE is decided by application needs © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011.

Similar presentations

Presentation on theme: "Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011.

Similar presentations

Presentation on theme: "Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2011."— Presentation transcript:

Similar presentations

About project

Feedback