C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Alternative Parallel Processing Approaches Jonathan Sagabaen.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
Implementing and Optimizing a Direct Digital Frequency Synthesizer on FPGA Jung Seob LEE Xiangning YANG.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Content Based Coding of Face Images
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk†
Scalpel: Customizing DNN Pruning to the
MobiRNN: Efficient Recurrent Neural Network Execution On Mobile GPU
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Analysis of Sparse Convolutional Neural Networks
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Benchmarking Deep Learning Inference
基于多核加速计算平台的深度神经网络 分割与重训练技术
Large-scale Machine Learning
Pipelining and Retiming 1
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
FPGAs in AWS and First Use Cases, Kees Vissers
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
Bit-Pragmatic Deep Neural Network Computing
Stripes: Bit-Serial Deep Neural Network Computing
Centar ( Global Signal Processing Expo
Matlab as a Development Environment for FPGA Design
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
Smita Vijayakumar Qian Zhu Gagan Agrawal
Scalable Memory-Less Architecture for String Matching With FPGAs
Information Redundancy Fault Tolerant Computing
Sparselet Models for Efficient Multiclass Object Detection
Final Project presentation
Deep Neural Networks for Onboard Intelligence
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
1CECA, Peking University, China
实习生汇报 ——北邮 张安迪.
Course Summary Joseph E. Gonzalez
Model Compression Joseph E. Gonzalez
Learning and Memorization
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Search-Based Approaches to Accelerate Deep Learning
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
Presentation transcript:

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2, and Yun Liang1 1CECA, Peking University, China 2Syracuse University, USA 3City University of New York, USA

FPGA Accelerated DNNs

RNNs Voice Control Machine Translation Music Composition Image Caption

LSTM based RNNs LSTM Long Short Term Memory neural network, capable of learning both long-term and short-term dependencies … … cell state is added to preserve the long-term information

Google LSTM Model Size (32MB) Large Model Size Google LSTM Model Size (32MB) Skewed Computation Intensity complicated data dependency

√ C-LSTM Framework LSTM Architecture Specification Optimized FPGA Implementation Software Level Hardware Structured Compression Model Training √ Hardware Optimization Automatic Synthesis Toolchain

Compression Techniques Overview Deep Compression ICLR’16 Structured Sparsity NIPS’16 Less is More ECCV’16 Limited Precision ICML’15 Binary Connect NIPS’15 Binary Net CoRR’16 Circulant Matrix ICCV’15 Deep Fried Structured Transform Model Compression Techniques Parameter Pruning Quantization Structured Matrix ESE FPGA’17 Our Work: C-LSTM FPGA’18

Parameter Pruning = Unbalanced Workload Extra Storage Footprint Partition Workload Sparse Matrix CSR Format Data Indices Unbalanced Workload 0 : 2 : 1 : 1 Hardware Unfriendly! = Extra Storage Footprint indices Irregular Memory Access random access is slow

Structured Matrix Circulant Matrix Block-Circulant Matrix w10 w11 w12 w13 w20 w21 w22 w23 w30 w31 w32 w33 4 x 4 Original Matrix 4 x 4 Circulant Matrix w00 w01 w02 w03 1 x 4 Dense Vector w00 w01 w02 w03 w00 w01 w02 w03 Circulant Projection w00 w01 w02 w03 Compress w00 w01 w03 w02 w00 w03 w02 w01 Block-Circulant Matrix 6 x 9 Original Matrix 2 x 9 Dense Matrix w00 w01 w02 w03 w04 w05 w03 w04 w05 Structured Compress w30 w31 w32 w33 w34 w35 w33 w34 w35

Circulant Convolution Circulant Matrix Input Vector Matrix Multiply Vector Compress Circulant Dense Matrix Circulant Convolution

Circulant Convolution Acceleration × x0 x1 x2 x3 x4 x5 y0 y1 y2 y3 y4 y5 = w00 w01 w02 w03 w04 w05 w03 w04 w05 w30 w31 w32 w33 w34 w35 w33 w34 w35 Fast Fourier Transformation FFT IFFT ∑ x0 x1 x2 x3 x4 x5 FFT-Accelerated Circulant Convolution y0 y1 y2 y3 y4 y5

Circulant Convolution Complexity Analysis m x n Matrix k x k Circulant Sub-Matrix w30 w31 w32 w33 w34 w35 Structured Compress w00 w01 w02 w03 w04 w05 m/k x n Dense Circulant Matrix Hardware Friendly! Storage Complexity reduced from O(m·n) to O(m·n/k) Computational Complexity reduced from O(m·n) to O(m·n·logk/k)

Model Training Varying Circulant Matrix Block Size k trade-offs between compression ratio and accuracy Output Corresponding Inference Models block size = 8 block size = 16

C-LSTM Hardware Level Hardware Optimization & Implementation Circulant Convolution Activation Functions Scheduling & Pipelining Automatic Synthesis Toolchain Operator Scheduling C/C++ Operator Templates Code Generator Synthesis Backend

Circulant Convolution Local Memory Promotion FFT FFT FFT accelerated circulant convolution IFFT promoted to on-chip BRAMs of FPGA Redundancy Elimination outputs of FFT is mirrored, and about half of the outputs could be eliminated

Circulant Convolution FFT/IFFT Decoupling block circulant convolution decouple coupled form decoupled form decouple

Activation Functions Piece-wise linearization 22 segments x y = sigmoid(x) y = tanh(x) sigmoid tanh 22 segments < 1% error rate

the resources are wasted LSTM Pipelining Putting It All Together Parallelization Degree X 10.5 X 8 8 / 8 = 1 8 / 1 = 8 Throughput (Tuples / Clock Cycle) circulant convolution operators x · tanh + σ element-wise operators the resources are wasted

Operator Scheduler · Coarse & Fine-grained Pipeline double buffers double buffers step 1: partition LSTM module into 3 stages x step 2: build fine-grained pipeline for each stage (II = 1) · tanh + σ step 3: add double buffers for each concatenated stage pair step 4: adjust the parallelism degree to maximize throughout under the resource constraint x

Scheduling Result . . . · Execution Timeline x + σ stage 1 stage 2 tanh + σ double buffers double buffers stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 time input 1 input 1 input 2 input 1 input 2 input 3 input 2 input 3 input 4 input 3 input 4 input 5 . . .

Synthesis Backend

Experimental Setup Comparison Baseline LSTM Architecture ESE (FPGA’17) LSTM Architecture Google LSTM Benchmark Suite TIMIT FPGA Platforms Software Tools SDAccel 2017.1

Experimental Results Performance throughput (frames per second) up to 21.2X throughput speedup execution latency up to 7.0X latency reduction

Experimental Results Energy Efficiency (frames per second / watt) up to 33.5 X energy efficiency speedup PER Degradation comparable PER degradation

Conclusion Circulant Matrix Compression reduces both storage and computational complexity accuracy degradation is small and acceptable compressed circulant matrix is dense and hardware friendly LSTM Inference Engine Created by C-LSTM > 10X throughput speedup > 4X latency reduction > 19X energy speedup < 1.3% accuracy degradation

Thank you !