Convolution Layer Optimization

Slides:



Advertisements
Similar presentations
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Advertisements

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
CUDA Grids, Blocks, and Threads
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Tiled 2D Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
© David Kirk/NVIDIA and Wen-mei W
Image Convolution with CUDA
Data Mining, Neural Network and Genetic Programming
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
Recitation 2: Synchronization, Shared memory, Matrix Transpose
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CUDA Parallelism Model
DRAM Bandwidth Slide credit: Slides adapted from
CS/EE 217 – GPU Architecture and Parallel Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Parallel Computation Patterns (Reduction)
Memory and Data Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Grids, Blocks, and Threads
Convolutional Neural Networks
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Parallel Computation Patterns (Stencil)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Image recognition.
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

Convolution Layer Optimization ECE408/CS483/CSE408 Fall 2017 Convolution Layer Optimization Carl Pearson pearson@Illinois.edu

Sequential Code Convolution Layer Forward Path void convLayer_forward(int B, int M, int C, int H, int W, int K, float* X, float* W, float* Y) { int H_out = H – K + 1; int W_out = W – K + 1; for (int b = 0; b < B; ++b) // for each image in batch for(int m = 0; m < M; m++) // for each output feature map for(int h = 0; h < H_out; h++) // for each output element for(int w = 0; w < W_out; w++) { Y[b, m, h, w] = 0; for(int c = 0; c < C; c++) // sum over all input feature maps for(int p = 0; p < K; p++) // KxK filter for(int q = 0; q < K; q++) Y[b, m, h, w] += X[b, c, h + p, w + q] * W[m, c, p, q]; }

A Small Convolution Layer Example 1 2 3 Image b in mini batch A Small Convolution Layer Example 1 2 3 W[0,0,_, _] X[b,0,_, _] 2 1 3 1 2 3 ? W[0,1,_, _] Y[b,0,_, _] input channel X[b,1,_, _] X[0,_,_] W[0,1,_, _] 1 2 1 2 3 W[0,2,_, _] Y[0,_, _] X[b,2,_, _] output channel

A Small Convolution Layer Example c = 0 1 2 3 A Small Convolution Layer Example c = 0 1 2 3 W[0,0,_, _] X[b,0,_, _] 3+13+2 2 1 3 1 2 3 18 ? W[0,1,_, _] X[b,1,_, _] Y[b,0,_, _] 1 2 1 2 3 W[0,2,_, _] X[b,2,_, _]

A Small Convolution Layer Example c = 1 2 3 A Small Convolution Layer Example c = 1 1 2 3 W[0,0,_, _] X[b,0,_, _] … 18+ 7+3+3 2 1 3 1 2 3 31 ? W[0,1,_, _] X[b,1,_, _] Y[b,0,_, _] 1 2 1 2 3 W[0,2,_, _] X[b,2,_, _]

A Small Convolution Layer Example c = 2 1 2 3 A Small Convolution Layer Example c = 2 1 2 3 W[0,0,_, _] X[b,0,_, _] … 31+ 3+6+11 2 1 3 1 2 3 51 ? W[0,1,_, _] X[b,1,_, _] Y[b,0,_, _] 1 2 1 2 3 W[0,2,_, _] X[b,2,_, _]

Parallelism in a Convolution Layer All output feature maps can be calculated in parallel Usually a small number, not sufficient to fully utilize a GPU All output feature map pixels can be calculated in parallel All rows can be done in parallel All pixels in each row can be done in parallel Large number but diminishes as we go into deeper layers All input feature maps can be processed in parallel, but will need atomic operation or tree reduction

Design of a Basic Kernel Each block computes a tile of output pixels TILE_WIDTH pixels in each dimension The first (x) dimension in the grid maps to the M output feature maps The second (y) dimension in the grid maps to the tiles in the output feature maps

Host Code for a Basic Kernel: CUDA Grid W_out and H_out are the output feature map width and height # define TILE_WIDTH 16 // We will use 4 for small examples. W_grid = W_out/TILE_WIDTH; // number of horizontal tiles per output map H_grid = H_out/TILE_WIDTH; // number of vertical tiles per output map Y = H_grid * W_grid; dim3 blockDim(TILE_WIDTH, TILE_WIDTH, 1); dim3 gridDim(M, Y, 1); ConvLayerForward_Kernel<<< gridDim, blockDim>>>(…);

A Small Example Assume 4 output feature maps (M = 4) CUDA Grid and Threadblocks A Small Example Assume 4 output feature maps (M = 4) Each output feature map is 8x8 image (W_out = H_out = 8) We have 4 blocks in the x dimension If we use tiles of 4 pixels on each side (TILE_SIZE = 4) Top two blocks in each column calculates the top row of tiles in the corresponding output feature map Bottom two block in each column calculates the bottom row of tiles in the corresponding output feature map Output Feature Maps and Tiles

A Basic Conv. Layer Forward Kernel (Code is Incomplete!) __global__ void ConvLayerForward_Basic_Kernel(int C, int W_grid, int K, float* X, float* W, float* Y) { int m = blockIdx.x; int h = blockIdx.y / W_grid + threadIdx.y; int w = blockIdx.y % W_grid + threadIdx.x; float acc = 0.; for (int c = 0; c < C; c++) { // sum over all input channels for (int p = 0; p < K; p++) // loop over KxK filter for (int q = 0; q < K; q++) acc += X[c, h + p, w + q] * W[m, c, p, q]; } Y[m, h, w] = acc;

Some Observations The amount of parallelism is quite high as long as the total number of pixels across all output feature maps is large This matches the CNN architecture well Each input tile is loaded multiple times, once for each block that calculates the output tile that requires the input tile Not very efficient in global memory bandwith

Sequential Code for the Forward Path of a Subsampling Layer void poolingLayer_forward(int M, int H, int W, int K, float* Y, float* S) { for(int m = 0; m < M; m++) // for each output feature maps for(int h = 0; h < H/K; h++) // for each output element for(int w = 0; w < W/K; w++) { S[m, x, y] = 0.; for(int p = 0; p < K; p++) { // loop over KxK input samples for(int q = 0; q < K; q++) S[m, h, w] += Y[m, K*h + p, K*w + q] /(K*K); } // add bias and apply non-linear activation S[m, h, w] = sigmoid(S[m, h, w] + b[m])

Implementing a Convolution Layer with Matrix Multiplication

Function to generate “unrolled” X void unroll(int B, int C, int H, int W, int K, float *X, float *X_unroll) { int H_out = H – K + 1; int W_out = W – K + 1; for (int b = 0; b < B; ++b) for (int c = 0; c < C; ++c) { int w_base = c * (K*K); for (int p = 0; p < K; ++p) for (int q = 0; q < K; ++q) { for (int h = 0; h < H_out; ++h) for (int w = 0; w < W_out; ++w) {   int w_unroll = w_base + p * K + q; int h_unroll = h * W_out + w; X_unroll[b, h_unroll, w_unroll] = X[b, c, h + p, w + q]; }

Simple Matrix Multiplication 1 2 3 Each product matrix element is an output feature map pixel. This inner product generates element 0 of output feature map 0. Input feature maps 1 2 Convolution Filters 1 2 14 20 15 24 12 17 26

Tiled Matrix Multiplication 2x2 Example 1 2 3 Each block calculates one output tile – 2 elements from each output map Each input element is reused 2 times in the shared memory Input feature maps 1 2 Convolution Filters 1 2 14 20 15 24 12 17 26 1

Tiled Matrix Multiplication 2x4 Example 1 2 3 Each block calculates one output tile – 4 elements from each output map Each input element is reused 2 times in the shared memory Input feature maps 1 2 Convolution Filters 1 2 14 20 15 24 12 17 26 1

Efficiency Analysis: Total Input Replication Replicated input features are shared among output maps There are H_out * W_out output feature map elements Each requires K*K elements from the input feature maps So, the total number of input element after replication is H_out*W_out*K*K times for each input feature map The total number of elements in each original input feature map is (H_out+K-1) * (W_out+K-1)

Analysis of a Small Example H_out = 2 W_out = 2 K = 2 There are 3 input maps (channels) The total number of input elements in the replicated (“unrolled”) input matrix is 3*2*2*2*2 The replicating factor is (3*2*2*2*2)/(3*3*3) = 1.78

Memory Access Efficiency of Original Convolution Algorithm Assume that we use tiled 2D convolution For input elements Each output tile has TILE_WIDTH2 elements Each input tile has (TILE_WIDTH+K-1)2 The total number of input feature map element accesses was TILE_WIDTH2*K2 The reduction factor of the tiled algorithm is K2*TILE_WIDTH2/(TILE_WIDTH+K-1)2 The convolution filter weight elements are reused within each output tile

Efficiency of Tiled Matrix Multiplication Assuming we use TILE_WIDTH2 input and output tiles Each replicated input feature map element is reused TILE_WIDTH times Each convolution filter weight element is reused TILE_WIDTH times Matrix multiplication better if TILE_WIDTH is larger than K2*TILE_WIDTH2/(TILE_WIDTH+K-1)2

Problem with Later Stages The size (H_out, W_out) of each output feature map decreases as we go to the later stages of the CNN The TILE_WIDTH may be limited to very small sizes relative to K The benefit of 2D tiling will diminish as we go down the pipeline This is an intrinsic problem for 2D tiled convolution

Tile Matrix Multiplication: More Stable Performance for Different Sizes The filter matrix is an M x C*K*K matrix. The expanded input feature map matrix is a C*K*K x H_out*W_out matrix. Except for the height of the filter-bank matrix, the sizes of all dimensions depend on products of the parameters to the convolution, not the parameters themselves. While individual parameters can be small, their products tend to be large. The amount of work for each kernel launch will remain large as we go to the later stages of the CNN

Some other Optimizations Use streams to overlap the reading of the next set of input feature maps with the processing of the previous input feature maps. Create unrolled matrix elements on the fly, only when they are loaded into shared memory Use more advanced algorithms such as FFT to implement convolution