Download presentation
Presentation is loading. Please wait.
Published byJasmine McDonald Modified over 6 years ago
1
FPGA Acceleration of Convolutional Neural Networks
Sachin Kumawat Electrical and Computer Engineering University of California, Davis
2
Today’s Discussion Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer- Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8. Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM,
3
FPGA Design Challenges
Low Memory Bandwidth Typically off-chip DDR3 DRAMs provide GBps (Nvidia Titan X provides GDDR5 devices with max GBps) Limited on-chip BRAMs available, caching size vs bandwidth Full utilization of Computation Resources (DSP blocks, Logic based Multipliers/Additions) Achieving best possible operating frequency Need deeply pipelined implementations Becomes better as technology scales
4
Popular Solutions Low off-chip Memory BW
use dual-buffering, load next set of data into secondary buffer while first buffer is reused. Low on-chip Memory BW tile IFMs/OFMs to save on buffer depth [1] [2]
5
Popular Solutions Compute resource utilization Operating Frequency
design efficient Processing Elements and vectorize independent computations (SIMD) Operating Frequency liberal use of registers for short critical path, HLS CAD tools are designed to synthesize deep pipelines [3]
6
CNN Layer Structure [1]
7
Finding Optimal Memory and Architecture Parameters
Roofline model proposed by Zhang et. al. [1] A comprehensive analytical model to select design parameters by exploring the design space spanned by them. SIMD factor: Tn x Tm Buffer depth: Tr x Tc
8
Shortcomings of the Analytical model in [1]
Design Space explored only for convolution but CNNs have other layers Implementation inefficiency: fixed Kernel size for easier pipelining of loops they achieve GFLOPs/s with 448 PEs at 100 MHz. At 100 MHz, with 2 operations per cycle (1 multiplications + 1 addition): @100% efficiency, each PE can perform = 2 x 100 MFLOPs/s = 0.2 GFLOPs/s Each PE actually performs on average = / 448 = GFLOPs/s This gives PE Efficiency for convolution = / 0.2 = 68.77% Does not consider “Blockness” of BRAMs and just assumes all of available BRAMs as one chunk of on-chip memory. This does not scale well
9
Caffeine: A runtime configurable library to design and semi-autonomously map CNN accelerators for FPGAs [4] Two major contributions: Perform optimizations for Fully connected layers as well Provide integration with Caffe: Deep Learning Framework Caffeine stands for CAFfe Fpga EngINE
10
Optimizing Fully Connected Layer (FCN) [4]
Two main Strategies: Input Major: batch of elements from different input vectors in Fully Connected Layer to the same input feature map in CONV. Weight Major: input vectors of Fully Connected Layer map to weight kernels of CONV, and weight vectors of Fully Connected Layer map to input FMs of CONV.
11
Design Space Exploration for FCN [4]
12
Performance Results [4]
Based on Previous roofline model in []: Configuration used for Vertex 7: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <54, 54, ?, ?>1 Configuration used for Kintex KU060: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <32, 32, 6272, 25> Based on the revised roofline model in []: Weight major mapping used with <batch, ker> = <32, 1> 1not reported by authors
13
Issues still remaining in Caffeine …
The underlying approach used in Caffeine is similar to what was proposed in [], therefore it suffers with some of the same issues: Kernel Size fixed to <Tk1 x Tk2> = <5 x 5>. This results in big gap in the achievable vs achieved performance: On-chip memory model used from previous model and hence still not addressed. VGG Processing Efficiency Vertex 690T Kintex KU060 Conv only 74.83 % 73.25% Full Network 41.65% 62.85%
14
An OpenCLTM Deep Learning Accelerator on Arria 10 [5]
Work by Intel Corporation Toronto (former Altera Toronto Technology Center) Shows use of Winograd transformations for reducing computation complexity in CNNs Parallelizes more forms of independent computations than Caffeine
15
Winograd Minimal Filtering
[7]
16
[7]
17
3x3 2D Convolution with Winograd Transform (courtesy Intel Nervana [6], yields speed-ups of 2-3x over NVIDIA’s cuDNN v4 kernels) IFMs OFMs
18
More about Winograd [8]
19
Deep Learning Accelerator (DLA) Architecture [5]
20
Performance results [5]
Synthesized for <Cvec, Kvec, Qvec, Wvec> = <8, 48, 4, 6>, Total PEs ~2432 Throughput Results presented only for AlexNet Full Network Throughput = 1382 GFLOPs/s (half precision, non IEEE-Floating point compliant) AlexNet Conv only Average Efficiency = 70.5% (Batch size 1) AlexNet Full Network efficiency = 91.1% (Conv Batch size: 1, Fully Connected Layer Batch Size: 96)
21
References Zhang, Chen, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks." Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015. programmers.pdf Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8 Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. orks_Slides_reading_group_uw_delmundo_slides.pdf Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.