Download presentation
Presentation is loading. Please wait.
Published byLee Martin Blake Modified over 6 years ago
1
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2, and Yun Liang1 1CECA, Peking University, China 2Syracuse University, USA 3City University of New York, USA
2
FPGA Accelerated DNNs
3
RNNs Voice Control Machine Translation Music Composition Image Caption
4
LSTM based RNNs LSTM Long Short Term Memory neural network, capable of learning both long-term and short-term dependencies … … cell state is added to preserve the long-term information
5
Google LSTM Model Size (32MB)
Large Model Size Google LSTM Model Size (32MB) Skewed Computation Intensity complicated data dependency
6
√ C-LSTM Framework LSTM Architecture Specification
Optimized FPGA Implementation Software Level Hardware Structured Compression Model Training √ Hardware Optimization Automatic Synthesis Toolchain
7
Compression Techniques Overview
Deep Compression ICLR’16 Structured Sparsity NIPS’16 Less is More ECCV’16 Limited Precision ICML’15 Binary Connect NIPS’15 Binary Net CoRR’16 Circulant Matrix ICCV’15 Deep Fried Structured Transform Model Compression Techniques Parameter Pruning Quantization Structured Matrix ESE FPGA’17 Our Work: C-LSTM FPGA’18
8
Parameter Pruning = Unbalanced Workload Extra Storage Footprint
Partition Workload Sparse Matrix CSR Format Data Indices Unbalanced Workload 0 : 2 : 1 : 1 Hardware Unfriendly! = Extra Storage Footprint indices Irregular Memory Access random access is slow
9
Structured Matrix Circulant Matrix Block-Circulant Matrix
w10 w11 w12 w13 w20 w21 w22 w23 w30 w31 w32 w33 4 x 4 Original Matrix 4 x 4 Circulant Matrix w00 w01 w02 w03 1 x 4 Dense Vector w00 w01 w02 w03 w00 w01 w02 w03 Circulant Projection w00 w01 w02 w03 Compress w00 w01 w03 w02 w00 w03 w02 w01 Block-Circulant Matrix 6 x 9 Original Matrix 2 x 9 Dense Matrix w00 w01 w02 w03 w04 w05 w03 w04 w05 Structured Compress w30 w31 w32 w33 w34 w35 w33 w34 w35
10
Circulant Convolution
Circulant Matrix Input Vector Matrix Multiply Vector Compress Circulant Dense Matrix Circulant Convolution
11
Circulant Convolution Acceleration
× x0 x1 x2 x3 x4 x5 y0 y1 y2 y3 y4 y5 = w00 w01 w02 w03 w04 w05 w03 w04 w05 w30 w31 w32 w33 w34 w35 w33 w34 w35 Fast Fourier Transformation FFT IFFT ∑ x0 x1 x2 x3 x4 x5 FFT-Accelerated Circulant Convolution y0 y1 y2 y3 y4 y5
12
Circulant Convolution Complexity Analysis
m x n Matrix k x k Circulant Sub-Matrix w30 w31 w32 w33 w34 w35 Structured Compress w00 w01 w02 w03 w04 w05 m/k x n Dense Circulant Matrix Hardware Friendly! Storage Complexity reduced from O(m·n) to O(m·n/k) Computational Complexity reduced from O(m·n) to O(m·n·logk/k)
13
Model Training Varying Circulant Matrix Block Size k
trade-offs between compression ratio and accuracy Output Corresponding Inference Models block size = 8 block size = 16
14
C-LSTM Hardware Level Hardware Optimization & Implementation
Circulant Convolution Activation Functions Scheduling & Pipelining Automatic Synthesis Toolchain Operator Scheduling C/C++ Operator Templates Code Generator Synthesis Backend
15
Circulant Convolution
Local Memory Promotion FFT FFT FFT accelerated circulant convolution IFFT promoted to on-chip BRAMs of FPGA Redundancy Elimination outputs of FFT is mirrored, and about half of the outputs could be eliminated
16
Circulant Convolution
FFT/IFFT Decoupling block circulant convolution decouple coupled form decoupled form decouple
17
Activation Functions Piece-wise linearization 22 segments
x y = sigmoid(x) y = tanh(x) sigmoid tanh 22 segments < 1% error rate
18
the resources are wasted
LSTM Pipelining Putting It All Together Parallelization Degree X 10.5 X 8 8 / 8 = 1 8 / 1 = 8 Throughput (Tuples / Clock Cycle) circulant convolution operators x tanh + σ element-wise operators the resources are wasted
19
Operator Scheduler · Coarse & Fine-grained Pipeline
double buffers double buffers step 1: partition LSTM module into 3 stages x step 2: build fine-grained pipeline for each stage (II = 1) tanh + σ step 3: add double buffers for each concatenated stage pair step 4: adjust the parallelism degree to maximize throughout under the resource constraint x
20
Scheduling Result . . . · Execution Timeline x + σ stage 1 stage 2
tanh + σ double buffers double buffers stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 time input 1 input 1 input 2 input 1 input 2 input 3 input 2 input 3 input 4 input 3 input 4 input 5 . . .
21
Synthesis Backend
22
Experimental Setup Comparison Baseline LSTM Architecture
ESE (FPGA’17) LSTM Architecture Google LSTM Benchmark Suite TIMIT FPGA Platforms Software Tools SDAccel
23
Experimental Results Performance throughput (frames per second)
up to 21.2X throughput speedup execution latency up to 7.0X latency reduction
24
Experimental Results Energy Efficiency (frames per second / watt)
up to 33.5 X energy efficiency speedup PER Degradation comparable PER degradation
25
Conclusion Circulant Matrix Compression
reduces both storage and computational complexity accuracy degradation is small and acceptable compressed circulant matrix is dense and hardware friendly LSTM Inference Engine Created by C-LSTM > 10X throughput speedup > 4X latency reduction > 19X energy speedup < 1.3% accuracy degradation
26
Thank you !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.