1CECA, Peking University, China

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
OpenFOAM on a GPU-based Heterogeneous Cluster
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
GPGPU platforms GP - General Purpose computation using GPU
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Reconfigurable acceleration of robust frequency-domain echo cancellation C. H. Ho 1, K.F.C.Yiu 2, J. Huo 3, S. Nordholm 3 and W. Luk 1 1.Department of.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
 presented by- ARPIT GARG ISHU MISHRA KAJAL SINGHAL B.TECH(ECE) 3RD YEAR.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CORDIC Based 64-Point Radix-2 FFT Processor
1 “A picture speaks a thousand words.” Art By Ranjith & Waquas Islamiah Evening College.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Scalpel: Customizing DNN Pruning to the
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Analysis of Sparse Convolutional Neural Networks
Employing compression solutions under openacc
Reza Yazdani Albert Segura José-María Arnau Antonio González
Dynamo: A Runtime Codesign Environment
基于多核加速计算平台的深度神经网络 分割与重训练技术
Rapid Overlay Builder for Xilinx FPGAs
Richard Dorrance Literature Review: 1/11/13
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Collaborative Computing for Heterogeneous Integrated Systems
SoC and FPGA Oriented High-quality Stereo Vision System
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
for more information ... Performance Tuning
Verilog to Routing CAD Tool Optimization
Approximate Fully Connected Neural Network Generation
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Applications of Distributed Arithmetic to Digital Signal Processing:
Sparselet Models for Efficient Multiclass Object Detection
University of Wisconsin-Madison
Optimizing stencil code for FPGA
Final Project presentation
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Department of Computer Science University of California, Santa Barbara
Model Compression Joseph E. Gonzalez
Scalable light field coding using weighted binary images
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Search-Based Approaches to Accelerate Deep Learning
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
on Road Signs & Face Detection
Presentation transcript:

1CECA, Peking University, China REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs Caiwen Ding2, Shuo Wang1, Ning Liu2, Kaidi Xu2, Yanzhi Wang2, and Yun Liang1 1CECA, Peking University, China 2Northeastern University, USA

FPGA Accelerated DNNs

YOLO based Object Detection

YOLO Model for FPGAs Large Model Size Heterogeneous Resources YOLO Model Size (32MB) Heterogeneous Resources Logic Blocks DSP Blocks Block RAMs

Parameter Pruning = Unbalanced Workload Extra Storage Footprint Partition Workload Sparse Matrix CSR Format Data Indices Unbalanced Workload 0 : 2 : 1 : 1 Hardware Unfriendly! = Extra Storage Footprint indices Irregular Memory Access random access is slow

Structured Matrix Circulant Matrix Block-Circulant Matrix w10 w11 w12 w13 w20 w21 w22 w23 w30 w31 w32 w33 4 x 4 Original Matrix 4 x 4 Circulant Matrix w00 w01 w02 w03 1 x 4 Dense Vector w00 w01 w02 w03 w00 w01 w02 w03 Circulant Projection w00 w01 w02 w03 Compress w00 w01 w03 w02 w00 w03 w02 w01 Block-Circulant Matrix 6 x 9 Original Matrix 2 x 9 Dense Matrix w00 w01 w02 w03 w04 w05 w03 w04 w05 Structured Compress w30 w31 w32 w33 w34 w35 w33 w34 w35

Circulant Convolution Acceleration ✖️ x0 x1 x2 x3 x4 x5 y0 y1 y2 y3 y4 y5 = w00 w01 w02 w03 w04 w05 w03 w04 w05 w30 w31 w32 w33 w34 w35 w33 w34 w35 Fast Fourier Transformation FFT IFFT ∑ x0 x1 x2 x3 x4 x5 FFT-Accelerated Circulant Convolution y0 y1 y2 y3 y4 y5

Circulant Convolution Complexity Analysis m x n Matrix k x k Circulant Sub-Matrix w30 w31 w32 w33 w34 w35 Structured Compress w00 w01 w02 w03 w04 w05 m/k x n Dense Circulant Matrix Hardware Friendly! Storage Complexity reduced from O(m·n) to O(m·n/k) Computational Complexity reduced from O(m·n) to O(m·n·logk/k)

Quantization Techniques Overview Fixed Bitwidth ICLR’16 Tenary Bitwidth NIPS’16 Binary Bitwidth ECCV’16 Power of Two ICCV’15 Equal Distance Non-Equal Distance Our Work: Req-YOLO FPGA’19 Quantization Techniques

Mixed Distance Quantization REQ-YOLO Framework ADMM based Training FPGA-friendly Inference Acceleration Structured Compression Mixed Distance Quantization Hardware Optimization Automatic Synthesis Toolchain YOLO Architecture Specification Optimized FPGA Implementation

Data Quantization Approaches Equal Distance Power of Two X Y X Y 001 010 011 100 101 0001 0010 0100 1000 exponential distances equal distances Low Accuracy High Accuracy Complex Multiplication Simple Multiplication (Shift) We propose Mixed Distance quantization combine equal + exponential resource-aware Decent Accuracy Better Hardware Utilization

Mixed Distance Quantization More Balanced! exponential distances X Y 0001 0010 0100 1000 1000 mixed distances X Y 0001 0010 0100 Mixed Distance Encoding 0011 primary primary bits for coarse-grained offsets signed bit 1 sign secondary bits for fined-grained offsets 10 secondary shift 2 bits shift 1 bits Simpler Hardware !! addition

Resource-Aware Quantization Equal Distance Mixed Distance bottleneck bottleneck Layer-by-Layer Resource-Aware Quantization equal distance mixed distance

Resource & Accuracy Aware Quantization

Training Approaches ADMM based Training Framework Alternating Direction Method of Multipliers Decomposing into two subproblems Consider the Optimization Problem rewrite

ADMM for Weight Quantization ADMM based Quantization for FFT based Acceleration perform weight mapping in the weight domain higher compression ratio and lower accuracy degradation

Experimental Setup YOLO Architecture Benchmark Suite FPGA Platforms Tiny YOLO Benchmark Suite DJI benchmark (IoU) Pascal (IoU) FPGA Platforms Software Tools SDAccel 2017.1

Experimental Results GPU FPGA Req-YOLO Summary Performance at least 7X higher throughput over GPU implementation at least 15X higher throughput over previous FPGA implementation Performance at least 3X higher energy efficiency over GPU implementation Energy Efficiency at least 4X higher energy efficiency over previous FPGA implementation

Consistently improved utilizations across different FPGA resources Experimental Results Resource Utilization Consistently improved utilizations across different FPGA resources

Accuracy degradations are with 6% Experimental Results Accuracy Degradation Accuracy degradations are with 6%

Conclusion Resource and Accuracy Aware Quantization and reduces both storage and computational complexity resource utilization is improved accuracy degradation is considered YOLO Inference Engine Created by Req-YOLO higher throughput speedup higher energy speedup < 6% accuracy degradation

Thank you !