Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University of California, Berkeley 1 forresti@eecs.berkeley.edu

Overview Convolution is a recurring computational pattern in a broad range of computer vision applications Memory communication is the bottleneck for convolution on modern GPUs How to minimize memory communication overhead in convolution –Texture cache –Loop blocking Up to 4.5x speedup over existing GPU implementations from NVIDIA, OpenCV, and others 2 Forrest Iandola forresti@eecs.berkeley.edu

Why focus on convolution? Berkeley ParLab project identified 15 recurring computational patterns in computer vision 3 Forrest Iandola forresti@eecs.berkeley.edu Small filters (2x2 – 7x7) Feature extraction Sliding-window object detection If we want fast computer vision, we need fast convolution CVPR 2007 – 2011 object recognition track 15 Computer Vision Patterns

What limits the performance of convolution? Roofline model [1] divides a program’s execution time into two parts: –Computational cost (GFLOPS/s) –Communication cost (GB/s) – memory traffic, I/O, etc. No program can outperform the hardware bound on computation or communication 4 [1] S. Williams, A. Waterman, D. Patterson. Roofline: An Insightful Visual Performance Model for Floating Point Programs and Multicore Architectures. Communications of the ACM, 2009 Forrest Iandola forresti@eecs.berkeley.edu

What limits the performance of convolution? 5 Roofline Model of computational performance Forrest Iandola forresti@eecs.berkeley.edu Fast Slow Memory Bounded Computation Bounded

What limits the performance of convolution? Convolution on NVIDIA GPUs: –Communication between the GPU’s off-chip DRAM and on-chip caches is the bottleneck –This doesn’t include communication between the CPU and GPU, though this can also be an issue If we want fast computer vision, we need fast convolution. If we want fast convolution on GPUs, we need to optimize memory communication. 6 Forrest Iandola forresti@eecs.berkeley.edu

Exploiting the GPU Memory Architecture GPU Global Memory (DRAM) Texture Cache RegistersRegisters L1 Cache / Shared Memory 893 GB/s 123 GB/s Memory per GPU Multiprocessor 129 Gtexels/s CPU DRAM NVIDIA GTX680 8 GB/s 7 L2 Cache Threads

Data Reuse with Loop Blocking Typical Implementation: no data reuse at the register level 8 Forrest Iandola forresti@eecs.berkeley.edu 9 input pixels 1 output pixel

Data Reuse with Loop Blocking Our approach: reuse data by doing more work per thread 9 Forrest Iandola forresti@eecs.berkeley.edu 9 input pixels 1 output pixel 16 input pixels 4 output pixels 4 inputs per output Typical Implementation: no data reuse at the register level

Exploring the Memory Communication Design Space 10 NVIDIA GTX680 (Kepler) Convolution with 3x3 filters

Comparison with Related Work 11 NVIDIA GTX680 (Kepler) Inverse roofline model

Comparison with Related Work 12 NVIDIA GTX680 (Kepler) With texture cache and blocking (ours)

Comparison with Related Work 13 NVIDIA GTX680 (Kepler)

Comparison with Related Work 4.5xspeedup 17 NVIDIA GTX680 (Kepler)

Are we done? Are we done optimizing memory communication? I think so. We achieved the memory bandwidth bound for small filters. Future work: optimize computation some more! 18 Forrest Iandola forresti@eecs.berkeley.edu

Conclusions If we want fast computer vision, we need fast convolution. If we want fast convolution on GPUs, we need to optimize memory communication. Up to 4.5x faster than existing GPU languages and libraries Download our code! https://github.com/forresti/convolution https://github.com/forresti/convolution –Use/modify it for your language/library/application 19 Forrest Iandola forresti@eecs.berkeley.edu

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Similar presentations

Presentation on theme: "Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Similar presentations

Presentation on theme: "Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University."— Presentation transcript:

Similar presentations

About project

Feedback