FPGA Acceleration of Convolutional Neural Networks Sachin Kumawat Electrical and Computer Engineering University of California, Davis
Today’s Discussion Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer- Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8. Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017..
FPGA Design Challenges Low Memory Bandwidth Typically off-chip DDR3 DRAMs provide 20-30 GBps (Nvidia Titan X provides GDDR5 devices with max 336.5 GBps) Limited on-chip BRAMs available, caching size vs bandwidth Full utilization of Computation Resources (DSP blocks, Logic based Multipliers/Additions) Achieving best possible operating frequency Need deeply pipelined implementations Becomes better as technology scales
Popular Solutions Low off-chip Memory BW use dual-buffering, load next set of data into secondary buffer while first buffer is reused. Low on-chip Memory BW tile IFMs/OFMs to save on buffer depth [1] [2]
Popular Solutions Compute resource utilization Operating Frequency design efficient Processing Elements and vectorize independent computations (SIMD) Operating Frequency liberal use of registers for short critical path, HLS CAD tools are designed to synthesize deep pipelines [3]
CNN Layer Structure [1]
Finding Optimal Memory and Architecture Parameters Roofline model proposed by Zhang et. al. [1] A comprehensive analytical model to select design parameters by exploring the design space spanned by them. SIMD factor: Tn x Tm Buffer depth: Tr x Tc
Shortcomings of the Analytical model in [1] Design Space explored only for convolution but CNNs have other layers Implementation inefficiency: fixed Kernel size for easier pipelining of loops they achieve 61.62 GFLOPs/s with 448 PEs at 100 MHz. At 100 MHz, with 2 operations per cycle (1 multiplications + 1 addition): @100% efficiency, each PE can perform = 2 x 100 MFLOPs/s = 0.2 GFLOPs/s Each PE actually performs on average = 61.62 / 448 = 0.1375 GFLOPs/s This gives PE Efficiency for convolution = 0.1375 / 0.2 = 68.77% Does not consider “Blockness” of BRAMs and just assumes all of available BRAMs as one chunk of on-chip memory. This does not scale well
Caffeine: A runtime configurable library to design and semi-autonomously map CNN accelerators for FPGAs [4] Two major contributions: Perform optimizations for Fully connected layers as well Provide integration with Caffe: Deep Learning Framework Caffeine stands for CAFfe Fpga EngINE
Optimizing Fully Connected Layer (FCN) [4] Two main Strategies: Input Major: batch of elements from different input vectors in Fully Connected Layer to the same input feature map in CONV. Weight Major: input vectors of Fully Connected Layer map to weight kernels of CONV, and weight vectors of Fully Connected Layer map to input FMs of CONV.
Design Space Exploration for FCN [4]
Performance Results [4] Based on Previous roofline model in []: Configuration used for Vertex 7: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <54, 54, ?, ?>1 Configuration used for Kintex KU060: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <32, 32, 6272, 25> Based on the revised roofline model in []: Weight major mapping used with <batch, ker> = <32, 1> 1not reported by authors
Issues still remaining in Caffeine … The underlying approach used in Caffeine is similar to what was proposed in [], therefore it suffers with some of the same issues: Kernel Size fixed to <Tk1 x Tk2> = <5 x 5>. This results in big gap in the achievable vs achieved performance: On-chip memory model used from previous model and hence still not addressed. VGG Processing Efficiency Vertex 690T Kintex KU060 Conv only 74.83 % 73.25% Full Network 41.65% 62.85%
An OpenCLTM Deep Learning Accelerator on Arria 10 [5] Work by Intel Corporation Toronto (former Altera Toronto Technology Center) Shows use of Winograd transformations for reducing computation complexity in CNNs Parallelizes more forms of independent computations than Caffeine
Winograd Minimal Filtering [7]
[7]
3x3 2D Convolution with Winograd Transform (courtesy Intel Nervana [6], yields speed-ups of 2-3x over NVIDIA’s cuDNN v4 kernels) IFMs OFMs
More about Winograd [8]
Deep Learning Accelerator (DLA) Architecture [5]
Performance results [5] Synthesized for <Cvec, Kvec, Qvec, Wvec> = <8, 48, 4, 6>, Total PEs ~2432 Throughput Results presented only for AlexNet Full Network Throughput = 1382 GFLOPs/s (half precision, non IEEE-Floating point compliant) AlexNet Conv only Average Efficiency = 70.5% (Batch size 1) AlexNet Full Network efficiency = 91.1% (Conv Batch size: 1, Fully Connected Layer Batch Size: 96)
References Zhang, Chen, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks." Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015. https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/ug/ug_ram_rom.pdf https://www.altera.com/en_US/pdfs/literature/wp/wp-201406-acceleware-opencl-on-fpgas-for-gpu- programmers.pdf Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8 Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. https://www.intelnervana.com/winograd-2/ https://homes.cs.washington.edu/~cdel/presentations/Fast_Algorithms_for_Convolutional_Neural_Netw orks_Slides_reading_group_uw_delmundo_slides.pdf Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.