Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Slides:



Advertisements
Similar presentations
purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
Advertisements

FPGA (Field Programmable Gate Array)
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
Berlin, Germany – January 21st, 2013 A2B: A F RAMEWORK FOR F AST P ROTOTYPING OF R ECONFIGURABLE S YSTEMS Christian Pilato, R. Cattaneo, G. Durelli, A.A.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
RTL Processor Synthesis for Architecture Exploration and Implementation Schliebusch, O. Chattopadhyay, A. Leupers, R. Ascheid, G. Meyr, H. Steinert, M.
Copyright  1999 Daniel D. Gajski IP – Based Design Methodology Daniel D. Gajski University of California
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
C OLUMBIA U NIVERSITY Lightwave Research Laboratory Embedding Real-Time Substrate Measurements for Cross-Layer Communications Caroline Lai, Franz Fidler,
TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.
SOC Consortium Course Material ASIC Logic National Taiwan University Adopted from National Chiao-Tung University IP Core Design.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Advanced Computer Architecture & Processing Systems Research Lab Framework for Automatic Design Space Exploration.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Philipp Gysel ECE Department University of California, Davis
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
3 Falcon-computing Solutions Inc.
Big data classification using neural network
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Early Results of Deep Learning on the Stampede2 Supercomputer
Benchmarking Deep Learning Inference
Evaluating Register File Size
Enabling machine learning in embedded systems
FPGA Acceleration of Convolutional Neural Networks
FPGAs in AWS and First Use Cases, Kees Vissers
Improving java performance using Dynamic Method Migration on FPGAs
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
IP – Based Design Methodology
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Methodology of a Compiler that Compresses Code using Echo Instructions
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Power-Efficient Machine Learning using FPGAs on POWER Systems
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Introduction to Neural Networks
Matlab as a Development Environment for FPGA Design
Early Results of Deep Learning on the Stampede2 Supercomputer
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A High Performance SoC: PkunityTM
AI Stick Easy to learn and use, accelerate the industrialization of artificial intelligence, and let the public become an expert in AI.
HIGH LEVEL SYNTHESIS.
Using the Microsoft AI Platform for next generation applications
Optimizing stencil code for FPGA
Final Project presentation
Deep Neural Networks for Onboard Intelligence
2018 NSF Expeditions in Computing PI Meeting
2018 NSF Expeditions in Computing PI Meeting
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
1CECA, Peking University, China
TensorFlow: A System for Large-Scale Machine Learning
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
System View Inc..
Search-Based Approaches to Accelerate Deep Learning
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs Yao Chen1, Jiong He1, Xiaofan Zhang2, Cong Hao2, Deming Chen1,2 1Advanced Digital Sciences Center, Singapore 2 University of Illinois at Urbana-Champaign, IL, USA

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Background FPGAs DNNs are driving the next-generation AI revolution A lot of applications are latency sensitive DNNs are computationally intensive CPU/GPU are commonly used and easy to program CPUs may not deliver the required performance GPUs are very power hungry ASIC can deliver high performance and energy efficiency Long time to market Less flexibility FPGAs Low latency, energy-efficient and re-configurable platform for both cloud and embedded scenarios Hard to program

Background: FPGA in the cloud High-Definition Pictures/videos Streaming Input Implementation constraints: e.g., memory, logic resource Big Data Volume

Challenges and our solutions Difficult to program targeting high performance and fast deployment Our solution: parameterized C++ template library for high performance accelerator IP generation, HLS optimizations Difficult to implement without timing violation Our solution: split to sub-nets with architectural consideration Difficult to support both hardware and software Our solution: APIs for accelerator control, software generation HLS implementation & Optimization

Cloud-DNN: An Open-Source Framework targeting Cloud-FPGA Our Tool Chain DNN models Cloud implementation Optimized for performance Suitable for HLS design flow Synthesizable function library Automated generation flow DSE for optimization API and call function generation Avoid tedious hardware programming and verification Optimal hardware configuration exploration System level solution for DNN applications

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Cloud-DNN Overview Extract Analyze Generate Implementation

Cloud-DNN Overview Extract Analyze Generate Implementation (1) Synthesizable library (3) Generation flow (2) System architecture Extract Analyze Generate Implementation

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Layer function IPs CONV_1 ReLU_1 POOL_1 FC_1 Tensor-to-Tensor (TT) Element-wise (EW) Conv ReLU Pooling Sigmoid InnerProduct Tanh DeConv BN …… ….. ReLU_1 POOL_1 FC_1

Layer accelerator template C++ template class Easily configured with template parameters Fully synthesizable and flexible Element-wise function as in-line functions Tensor-to-tensor functions are layer templates

Sub-net template Die0 Die1 Die2

Synthesizable library Template class and parameters Examples: Conv_acc<DataType, Tm_0, Tn_0, Tr_0, Tc_0> convAcc0; Conv_acc<DataType, Tm_1, Tn_1, Tr_1, Tc_1> convAcc1; Pool_acc< …..

Accelerator models Convolution Pooling Resource model Performance model *

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

System architecture generation One sub-net per die Controllable connection between dies during system generation Shell integration DMA enabled Physical constraints to allocate the modules to different locations

API function support Host processor manage accelerator status Application ignores the hardware detail Necessary data movement functions Implemented as light weighted functions

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Cloud-DNN in detail Configuration file Acc_num Acc_type Acc_parameters

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Evaluations Experimental setup: Resource evaluation Local cloud: UltraScale+ VU118 platform (Xilinx VU9P FPGA chip) AWS cloud: F1.2Xlarge instance (Xilinx VU9P FPGA chip) Logic BRAM DSP48 URAM 2585K 75.9Mb 6840 270Mb Resource evaluation Sub-net_0

Comparison with Software Implementation CPU platform: E5-2430 with 32GB DDR GPU platform: E5-2609 + Pascal Titan X GPU 2.18X faster than GPU 104X faster than CPU

Evaluations Comparison with other designs 2.7X 1.6X 6.6X 11.2X 2.6X [5] Jiantao Qiu et al. Going deeper with embedded FPGA platform for convolutional neural network. In Proc. of FPGA, 2016. [20] Chen Zhang et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA, 2015. [26] Chen Zhang et al. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proc. of ICCAD, 2016. [27] Yijin Guan et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proc. of FCCM, 2017. [28] Yufei Ma et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Prof. of FPL, 2017.

Outline Background, challenges and motivation Cloud-DNN overview Synthesizable library design System architecture generation Cloud-DNN framework in detail Evaluations Conclusion and future work

Conclusion & Future work Open-source framework, including synthesizable layer function library, automated structural optimization and code generation More energy efficient solution than others Fast development of high-performance, power efficient on-cloud DNN acceleration system Future work More functional support (eg, RNN) Performance enhancement More framework support, TensorFlow, MxNet, etc. Extension of design methodology, RTL-HLS hybrid, etc. https://github.com/microideax/Open-Dnn.git

Thank you! Q&A