Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

Slides:



Advertisements
Similar presentations
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Advertisements

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Virtualization for Cloud Computing
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Organization & Management Model for FCP Center. Goals [From previous session] (Why?) Vision — The Center for Sustainable Software on Future Computing.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Foundation Express The HDL Value Leader. Xilinx Foundation Express The HDL Value Leader  Complete HDL Development Environment Best in Class EDA Tools.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
Xilinx Programmable Logic Design Solutions Version 2.1i Designing the Industry’s First 2 Million Gate FPGA Drop-In 64 Bit / 66 MHz PCI Design.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
Full and Para Virtualization
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Tracking Millions of Flows In High Speed Networks for Application Identification Tian Pan, Xiaoyu Guo, Chenhui Zhang, Junchen Jiang, Hao Wu and Bin Liut.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Philipp Gysel ECE Department University of California, Davis
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Rob Byrd Chief Enterprise Architect Enterprise Architecture – A Citywide Service Delivery Strategy Aligning Information Technology Services to the Citizen.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
An Automated Hardware/Software Co-Design
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou
Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng
Seth Pugsley, Jeffrey Jestes,
Dynamo: A Runtime Codesign Environment
Initial Experiences with Deploying FPGA Accelerators in Datacenters
FPGA implementation of CNN Convolution layer logic
Cloud Computing: Delivering Your Right Mix
Enabling machine learning in embedded systems
Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs
Xilinx Ready to Use Design Solutions
FPGA Acceleration of Convolutional Neural Networks
Map-Scan Node Accelerator for Big-Data
FPGAs in AWS and First Use Cases, Kees Vissers
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Unistore: Project Updates
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
High Level Synthesis Overview
Latte: Locality Aware Transformation for High Level Synthesis
“Azure is the cloud platform of choice for our customers
Embedded systems, Lab 1: notes
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
An Automated Design Flow for 3D Microarchitecture Evaluation
ECE 699: Lecture 3 ZYNQ Design Flow.
Final Project presentation
2018 NSF Expeditions in Computing PI Meeting
2018 NSF Expeditions in Computing PI Meeting
Hossein Omidian, Guy Lemieux
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
1CECA, Peking University, China
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu.
MAZARS’ CONSULTING PRACTICE
LANMC: LSTM-Assisted Non-Rigid Motion Correction
MAZARS’ CONSULTING PRACTICE Helping your Business Venture Further
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F1 FPGA Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China Computer Science Department, University of California, Los Angeles, CA, USA Falcon Computing Solutions, Inc, Los Angeles, CA, USA

Falcon Computing Solutions An early stage company focused on FPGA-based acceleration solutions with offices in Santa Clara, Los Angeles, and Beijing Vision : Provide seamless acceleration solutions that deliver high performance and energy efficiency for compute-intensive applications on-premises or in the cloud Leveraging years of research under co-founder Dr. Jason Cong Chancellor’s professor and Director of the Center for Domain-Specific Computing at UCLA Have raised more than $10M in venture funding in the past 2 years Executive team from Intel, Altera, Xilinx, Synopsys, Magma >30 years in FPGA industry, >30 years of University Research

DNN Design Challenges on FPGAs High performance and throughput DNN architecture Maximum resource utilization and frequency on SSI devices Irregularities in different DNN layers Design portability across different DNN models and FPGA devices

Systolic Array Architecture

Stacked Systolic Array Architecture

Two-Phase Design Space Exploration for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; Determine the single systolic array structure Determine the number of systolic arrays Stream buffer management for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];

Programming Model for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];

Merlin Compiler Pure C/C++ based flow enabling SW programmers to develop FPGA accelerated applications Highly integrated flow with automatic optimization greatly improving productivity Advanced code transformation delivering highest QoR without FPGA expertise Kernel Code to Accelerate FPGA Merlin Compiler CPU K C/C++ GCC Merlin Optimization Library FPGA Binary

Experiment Result Baseline: single layer implementation on AWS F1 Irregularity: Stacked Systolic Array w/o floorplanning – higher computation efficiency Frequency: Stacked Systolic Array w floorplanning – higher clock frequency

Summary A low latency DNN accelerator design based on stacked systolic arrays achieving 2 TOPs on AWS F1 FPGA An automated resource partitioning algorithm between systolic arrays and FPGA dies for multiple DNN layers achieving 90% resource utilization and 240MHz An end-to-end automation flow from high-level C code to FPGA accelerated DNNs in datacenters. We implement a push-button automation

THANK YOU!