Parallel Multi Channel Convolution using General Matrix Multiplication

Parallel Multi Channel Convolution using General Matrix Multiplication
Towards faster and smaller CNNs Hello everyone Aravind Vasudevan – Postdoc – Trinity College Dublin Here presenting my work on making convolutional neural networks go faster by writing better inference primitives Let’s dive straight in Aravind Vasudevan, Andrew Anderson and David Gregg The 28th Annual IEEE International Conference on Application-specific Systems, Architectures and Processors July 10th 2017

Agenda CNN primer Current implementations of the convolutional layer
im2col & im2row Proposed implementation Results Road ahead The outline of the talk is fairly straightforward. First, in keeping with the theme of the morning I will present a very brief overview of Convoutional Neural Networks – what they are, how they are constructed etc. I will then construct multi channel convolution from the basics so apologies to those of you who already know this well. Then I will discuss one of the most widely used implementation of the convolutional layer followed by our proposed implementation. I will wrap up the talk with some brief results and the possible future work this opens up.

CNN Primer Dog – 0.2 Space – 0.45 Taco – 0.35 Dog – 0.4 Space – 0.3
Ground Truth Dog – 0.4 Space – 0.3 Taco – 0.3 Backpropagation Convolutional Layers 3D Tensor input 3D Tensor output -1 8 A convolutional neural network is a deep neural network that is primarily used for image classification Black box – Feed input – Get probability distribution as output

Single Channel Single Kernel Convolution (SCSK)

Multiple Channel Single Kernel Convolution (MCSK)
+=

Multiple Channel Multiple Kernel Convolution (MCMK)

Importance of convolutional layers
Figure: Distribution of forward inference time for AlexNet on CPU. Figure credit [1] This graph from Yangqing Jia’s thesis illustrates the importance of optimizing the convolutional layers So a proper choice of implementation for the convolutional layer is key About 89% of forward inference time spent on convolutional layers [1] – Jia, Yangqing. Learning semantic image representations at a large scale. University of California, Berkeley, 2014.

Current implementations
Loop based methods Caffe loop Knight’s landing loops Sum of single channel convolutions im2col

Convolutional layer implementation – im2col
This process of covering the input with overlapping patches and transforming it into an intermediate patch matrix is called “im2col”. Colloquially the name im2col is now used for the entire operation of the im2col transformation followed by the GEMM

im2col Widely used in popular deep learning frameworks
Expands the input by a factor of k*k Transformation process suffers from poor locality Locality gets better in the im2row transformation Yangqing Jia’s blogpost suggests he switched from the nested loops to im2col in caffe to leverage efficient implementations of GEMM

Vector-Scalar Multiplication
1x1 SCSK Proposed method from first principles = * x = Our method stems from a small observation. Let us consider 1x1 single channel single kernel convolution Vector-Scalar Multiplication

Matrix-Vector Multiplication (GEMV)
1x1 MCSK Proposed method from first principles x = Matrix-Vector Multiplication (GEMV)

Matrix-Matrix Multiplication (GEMM)
1x1 MCMK Proposed method from first principles = * x = Matrix-Matrix Multiplication (GEMM)

Our method – kn2col * = x = This idea of transforming a 1x1 multi channel multi kernel convolution into GEMM becomes the cornerstone in our algorithm Every channel of image  column We take all the Ath pixel location elements and store it as a column-major sub-matrix. So, left-top pixel of kernel 1 becomes column 1 and so on We do this for all other pixels to form this kernel-patch-matrix Note that this transformation is free as we can do this ahead-of-time as the kernels are known apriori One GEMM of the image and kernel gives us a temporary output that is blown up by a factor of k*k This is what we call the tube-of-toothpaste problem whereby when you try to squeeze out the memory expansion from the input side it shows up on the output side Some post-pass magic on this temporary result gives us the final result Along the similar lines, one can imagine kn2row where the kernel and image are laid out in rows and we do kernel times image to get the output. So, a transpose of this process.

Results – GoogLeNet

Results – VGG-16

Tube of toothpaste – In numbers
C and M vary differently through the layers im2 methods expand input to H*W*C*k*k while the kn2 methods expand the output to H*W*M*k*k Im2: H*W*k*k*c kn2: H*W*k*k*m

Road Ahead Accumulating versions of the kn2 methods
Prevents output memory explosion More optimizations to the im2 transformations Frequency domain implementations FFT based convolution Winograd domain convolution triNNity inference primitives ASPLOS 2018 Code will be released by the end of 2017 Winograd – Nervana

Key Takeaways MCMK convolution can leverage optimized GEMMs libraries
Memory layouts have significant impact on performance Aggressive constant propagation is key There is no one method to rule them all! Optimization framework for memory layouts and implementations Multi channel multi kernel convolution can be easily expressed as some composition of GEMMs without the need for input data replication; and perhaps without the need for output memory explosion either I am preaching to the choir here, but memory layouts have a significant impact on performance Giving the compiler as much information as possible at compile time works really well. So template parameters as opposed to function parameters From our experiments for this work and our current work we see that there is no one method that outperforms everything for a given layer which leads me to my final point… There is a need for an optimization framework that is architecture aware to make choices about memory layouts and implementations of the primitives in order to minimize the total runtime of the network

Thank You

Backup slides

Parallel Multi Channel Convolution using General Matrix Multiplication

Similar presentations

Presentation on theme: "Parallel Multi Channel Convolution using General Matrix Multiplication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Multi Channel Convolution using General Matrix Multiplication

Similar presentations

Presentation on theme: "Parallel Multi Channel Convolution using General Matrix Multiplication"— Presentation transcript:

Similar presentations

About project

Feedback