1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 4: The CUDA Memory Model (Cont.)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
© David Kirk/NVIDIA and Wen-mei W
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
CUDA Parallelism Model
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
DRAM Bandwidth Slide credit: Slides adapted from
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Parallel Computation Patterns (Reduction)
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Parallel Computation Patterns (Stencil)
Convolution Layer Optimization
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

1 CS/EE 217: GPU Architecture and Parallel Programming Tiled Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois,

Objective To learn about tiled convolution algorithms –Some intricate aspects of tiling algorithms –Output tiles versus input tiles 2

ghost N Tile 0 Tile Tile Tile 3 Tiled 1D Convolution Basic Idea P halo

Loading the left halo 4 int n = Mask_Width/2; int halo_index_left = (blockIdx.x - 1)*blockDim.x + threadIdx.x; if (threadIdx.x >= blockDim.x - n) { N_ds[threadIdx.x - (blockDim.x - n)] = (halo_index_left < 0) ? 0 : N[halo_index_left]; } i =6halo_index_left = 2 n = N N_ds

Loading the internal elements 5 N_ds[n + threadIdx.x] = N[blockIdx.x*blockDim.x + threadIdx.x]; i =6halo = 2 n = N N_ds

Loading the right halo 6 int halo_index_right = (blockIdx.x + 1)*blockDim.x + threadIdx.x; if (threadIdx.x < n) { N_ds[n + blockDim.x + threadIdx.x] = (halo_index_right >= Width) ? 0 : N[halo_index_right]; } i =6 halo_index_right = 10 n = N N_ds

7 __global__ void convolution_1D_basic_kernel(float *N, float *P, int Mask_Width, int Width) { int i = blockIdx.x*blockDim.x + threadIdx.x; __shared__ float N_ds[TILE_SIZE + MAX_MASK_WIDTH - 1]; int n = Mask_Width/2; int halo_index_left = (blockIdx.x - 1)*blockDim.x + threadIdx.x; if (threadIdx.x >= blockDim.x - n) { N_ds[threadIdx.x - (blockDim.x - n)] = (halo_index_left < 0) ? 0 : N[halo_index_left]; } N_ds[n + threadIdx.x] = N[blockIdx.x*blockDim.x + threadIdx.x]; int halo_index_right = (blockIdx.x + 1)*blockDim.x + threadIdx.x; if (threadIdx.x < n) { N_ds[n + blockDim.x + threadIdx.x] = (halo_index_right >= Width) ? 0 : N[halo_index_right]; } __syncthreads(); float Pvalue = 0; for(int j = 0; j < Mask_Width; j++) { Pvalue += N_ds[threadIdx.x + j]*M[j]; } P[i] = Pvalue; }

Shared Memory Data Reuse Element 2 is used by thread 4 (1X) Element 3 is used by threads 4, 5 (2X) Element 4 is used by threads 4, 5, 6 (3X) Element 5 is used by threads 4, 5, 6, 7 (4X) Element 6 is used by threads 4, 5, 6, 7 (4X) Element 7 is used by threads 5, 6, 7 (3X) Element 8 is used by threads 6, 7 (2X) Element 9 is used by thread 7 (1X) N_ds Mask_Width is 5

N 0 N[0] N[3]N[1]N[2]N[5]N[4]N[6] 00 0 P[0] P[1] P[2] P[3] P[4] P[5] P[6] Ghost Cells 9

__global__ void convolution_1D_basic_kernel(float *N, float *P, int Mask_Width, int Width) { int i = blockIdx.x*blockDim.x + threadIdx.x; __shared__ float N_ds[TILE_SIZE]; N_ds[threadIdx.x] = N[i]; __syncthreads(); int This_tile_start_point = blockIdx.x * blockDim.x; int Next_tile_start_point = (blockIdx.x + 1) * blockDim.x; int N_start_point = i - (Mask_Width/2); float Pvalue = 0; for (int j = 0; j < Mask_Width; j ++) { int N_index = N_start_point + j; if (N_index >= 0 && N_index < Width) { if ((N_index >= This_tile_start_point) && (N_index < Next_tile_start_point)) { Value += N_ds[threadIdx.x+j-(Mask_Width/2)]*M[j]; } else { Pvalue += N[N_index] * M[j]; } P[i] = Pvalue; } 10

2D convolution with Tiling P Use a thread block to calculate a tile of P –Thread Block size determined by the TILE_SIZE 11

Tiling N Each element in the tile is used in calculating up to MASK_SIZE * MASK_SIZE P elements (all elements in the tile)

High-Level Tiling Strategy Load a tile of N into shared memory (SM) –All threads participate in loading –A subset of threads then use each N element in SM 13 TILE_SIZE KERNEL_SIZE

Output Tiling and Thread Index (P) Use a thread block to calculate a tile of P –Each output tile is of TILE_SIZE for both x and y 14 col_o = blockIdx.x * TILE_SIZE + tx; row_o = blockIdx.y*TILE_SIZE + ty;

Input tiles need to be larger than output tiles Output Tile Input Tile

Dealing with Mismatch Use a thread block that matches input tile –Each thread loads one element of the input tile –Some threads do not participate in calculating output There will be if statements and control divergence 16

Setting up blocks #define O_TILE_WIDTH 12 #define BLOCK_WIDTH (O_TILE_WIDTH + 4) dim3 dimBlock (BLOCK_WIDTH, BLOCK_WIDTH); dim3 dimGrid ((imageWidth – 1)/O_TILE_WIDTH + 1, (imageHeight-1)/O_TILE_WIDTH+1, 1); 17 In general, block width = Tile width + mask width – 1;

Using constant memory for mask Since mask is used by all threads and not modified: –All threads in a warp access the same locations at every time –Take advantage of the cachable constant memory! –Magnify memory bandwidth without consuming shared memory Syntax: __global__ void convolution_2D_kernel (float *P, *float N, height, width, channels, const float __restrict__ *M) { 18

Shifting from output coordinates to input coordinates 19 Input tile for thread (0,0) Output tile for thread (0,0)

Shifting from output coordinates to input coordinate int tx = threadIdx.x; int ty = threadIdx.y; int row_o = blockIdx.y * TILE_SIZE + ty; int col_o = blockIdx.x * TILE_SIZE + tx; int row_i = row_o - 2; //MASK_SIZE/2 int col_i = col_o - 2; //MASK_SIZE/2 20

Threads that loads halos outside N should return

Taking Care of Boundaries float output = 0.0f; if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) ) { Ns[ty][tx] = N.elements[row_i*N.width + col_i]; } else{ Ns[ty][tx] = 0.0f; } 22

Some threads do not participate in calculating output. if(ty < TILE_SIZE && tx < TILE_SIZE){ for(i = 0; i < MASK_SIZE; i++) { for(j = 0; j < MASK_SIZE; j++) { output += Mc[i][j] * Ns[i+ty][j+tx]; } 23

Some threads do not write output if(row_o < P.height && col_o < P.width) P.elements[row_o * P.width + col_o] = output; 24

In General BLOCK_SIZE is limited by the maximum number of threads in a thread block Input tile sizes could be could be k*TILE_SIZE + (MASK_SIZE-1) –For 1D convolution – what is it for 2D convolution? –By having each thread to calculate k input points (thread coarsening) –k is limited by the shared memory size MASK_SIZE is decided by application needs 25

ANY MORE QUESTIONS? READ CHAPTER 8 © David Kirk/NVIDIA and Wen-mei W. HwuUniversity of Illinois,