FPGA Acceleration of Convolutional Neural Networks

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
GPGPU platforms GP - General Purpose computation using GPU
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Power Reduction for FPGA using Multiple Vdd/Vth
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Convolutional Neural Network
Philipp Gysel ECE Department University of California, Davis
3 Falcon-computing Solutions Inc.
Scalpel: Customizing DNN Pruning to the
GPU Architecture and Its Application
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Convolutional Sequence to Sequence Learning
Analysis of Sparse Convolutional Neural Networks
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
Benchmarking Deep Learning Inference
CS427 Multicore Architecture and Parallel Computing
FPGA implementation of CNN Convolution layer logic
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Combining CNN with RNN for scene labeling (segmentation)
Map-Scan Node Accelerator for Big-Data
Cache Memory Presentation I
FPGAs in AWS and First Use Cases, Kees Vissers
Overcoming Resource Underutilization in Spatial CNN Accelerators
Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
Anne Pratoomtong ECE734, Spring2002
Stripes: Bit-Serial Deep Neural Network Computing
A systolic array for a 2D-FIR filter for image processing
Power-Efficient Machine Learning using FPGAs on POWER Systems
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Introduction to Neural Networks
Latte: Locality Aware Transformation for High Level Synthesis
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Very Deep Convolutional Networks for Large-Scale Image Recognition
A High Performance SoC: PkunityTM
Final Project presentation
Hossein Omidian, Guy Lemieux
Hardware Architectures for Deep Learning
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Convolutional Neural Networks
LANMC: LSTM-Assisted Non-Rigid Motion Correction
Convolution Layer Optimization
Heterogeneous convolutional neural networks for visual recognition
Model Compression Joseph E. Gonzalez
Samira Khan University of Virginia Feb 6, 2019
6- General Purpose GPU Programming
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Boston University & USTC
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
YOLO-based Object Detection on ARM Mali GPU
Presentation transcript:

FPGA Acceleration of Convolutional Neural Networks Sachin Kumawat Electrical and Computer Engineering University of California, Davis

Today’s Discussion Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer- Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8. Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017..

FPGA Design Challenges Low Memory Bandwidth Typically off-chip DDR3 DRAMs provide 20-30 GBps (Nvidia Titan X provides GDDR5 devices with max 336.5 GBps) Limited on-chip BRAMs available, caching size vs bandwidth Full utilization of Computation Resources (DSP blocks, Logic based Multipliers/Additions) Achieving best possible operating frequency Need deeply pipelined implementations Becomes better as technology scales

Popular Solutions Low off-chip Memory BW use dual-buffering, load next set of data into secondary buffer while first buffer is reused. Low on-chip Memory BW tile IFMs/OFMs to save on buffer depth [1] [2]

Popular Solutions Compute resource utilization Operating Frequency design efficient Processing Elements and vectorize independent computations (SIMD) Operating Frequency liberal use of registers for short critical path, HLS CAD tools are designed to synthesize deep pipelines [3]

CNN Layer Structure [1]

Finding Optimal Memory and Architecture Parameters Roofline model proposed by Zhang et. al. [1] A comprehensive analytical model to select design parameters by exploring the design space spanned by them. SIMD factor: Tn x Tm Buffer depth: Tr x Tc

Shortcomings of the Analytical model in [1] Design Space explored only for convolution but CNNs have other layers Implementation inefficiency: fixed Kernel size for easier pipelining of loops they achieve 61.62 GFLOPs/s with 448 PEs at 100 MHz. At 100 MHz, with 2 operations per cycle (1 multiplications + 1 addition): @100% efficiency, each PE can perform = 2 x 100 MFLOPs/s = 0.2 GFLOPs/s Each PE actually performs on average = 61.62 / 448 = 0.1375 GFLOPs/s This gives PE Efficiency for convolution = 0.1375 / 0.2 = 68.77% Does not consider “Blockness” of BRAMs and just assumes all of available BRAMs as one chunk of on-chip memory. This does not scale well

Caffeine: A runtime configurable library to design and semi-autonomously map CNN accelerators for FPGAs [4] Two major contributions: Perform optimizations for Fully connected layers as well Provide integration with Caffe: Deep Learning Framework Caffeine stands for CAFfe Fpga EngINE

Optimizing Fully Connected Layer (FCN) [4] Two main Strategies: Input Major: batch of elements from different input vectors in Fully Connected Layer to the same input feature map in CONV. Weight Major: input vectors of Fully Connected Layer map to weight kernels of CONV, and weight vectors of Fully Connected Layer map to input FMs of CONV.

Design Space Exploration for FCN [4]

Performance Results [4] Based on Previous roofline model in []: Configuration used for Vertex 7: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <54, 54, ?, ?>1 Configuration used for Kintex KU060: <Tm, Tn, Tr × Tc, Tk1 × Tk2> = <32, 32, 6272, 25> Based on the revised roofline model in []: Weight major mapping used with <batch, ker> = <32, 1> 1not reported by authors

Issues still remaining in Caffeine … The underlying approach used in Caffeine is similar to what was proposed in [], therefore it suffers with some of the same issues: Kernel Size fixed to <Tk1 x Tk2> = <5 x 5>. This results in big gap in the achievable vs achieved performance: On-chip memory model used from previous model and hence still not addressed. VGG Processing Efficiency Vertex 690T Kintex KU060 Conv only 74.83 % 73.25% Full Network 41.65% 62.85%

An OpenCLTM Deep Learning Accelerator on Arria 10 [5] Work by Intel Corporation Toronto (former Altera Toronto Technology Center) Shows use of Winograd transformations for reducing computation complexity in CNNs Parallelizes more forms of independent computations than Caffeine

Winograd Minimal Filtering [7]

[7]

3x3 2D Convolution with Winograd Transform (courtesy Intel Nervana [6], yields speed-ups of 2-3x over NVIDIA’s cuDNN v4 kernels) IFMs OFMs

More about Winograd [8]

Deep Learning Accelerator (DLA) Architecture [5]

Performance results [5] Synthesized for <Cvec, Kvec, Qvec, Wvec> = <8, 48, 4, 6>, Total PEs ~2432 Throughput Results presented only for AlexNet Full Network Throughput = 1382 GFLOPs/s (half precision, non IEEE-Floating point compliant) AlexNet Conv only Average Efficiency = 70.5% (Batch size 1) AlexNet Full Network efficiency = 91.1% (Conv Batch size: 1, Fully Connected Layer Batch Size: 96)

References Zhang, Chen, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks." Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015. https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/ug/ug_ram_rom.pdf https://www.altera.com/en_US/pdfs/literature/wp/wp-201406-acceleware-opencl-on-fpgas-for-gpu- programmers.pdf Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on IEEE, 1–8 Aydonat, Utku, et al. "An OpenCL™ Deep Learning Accelerator on Arria 10." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. https://www.intelnervana.com/winograd-2/ https://homes.cs.washington.edu/~cdel/presentations/Fast_Algorithms_for_Convolutional_Neural_Netw orks_Slides_reading_group_uw_delmundo_slides.pdf Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.