Overcoming Resource Underutilization in Spatial CNN Accelerators

Slides:



Advertisements
Similar presentations
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Advertisements

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
Hardware implement 涂正中. 2010:Convolutional Networks and Applications in Vision ( Overview ) Chip: first : , Bell Labs’ ANNA chip. ConvNet chip.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CASTNESS11, Rome Italy © 2011 Target Compiler Technologies L 1 Ideas for the design of an ASIP for LQCD Target Compiler Technologies CASTNESS’11, Rome,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
NEURAL - FUZZY LOGIC FOR AUTOMATIC OBJECT RECOGNITION.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Convolutional Neural Network
Philipp Gysel ECE Department University of California, Davis
Spatial Localization and Detection
Gaussian Conditional Random Field Network for Semantic Segmentation
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Single Instruction Multiple Data
Modeling Big Data Execution speed limited by: Model complexity
Analysis of Sparse Convolutional Neural Networks
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
Reza Yazdani Albert Segura José-María Arnau Antonio González
FPGA implementation of CNN Convolution layer logic
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
The Need for Speed: Benchmarking DL Workloads
Spatial Analysis With Big Data
FPGA Acceleration of Convolutional Neural Networks
Introduction to Parallelism.
FPGAs in AWS and First Use Cases, Kees Vissers
Rethinking NoCs for Spatial Neural Network Accelerators
Lecture 5 Smaller Network: CNN
Instructor: Dr. Phillip Jones
Machine Learning: The Connectionist
Stripes: Bit-Serial Deep Neural Network Computing
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
A Convolutional Neural Network Cascade For Face Detection
Power-Efficient Machine Learning using FPGAs on POWER Systems
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Deep Learning Tutorial
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
STUDY AND IMPLEMENTATION
CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,
Optimization for Fully Connected Neural Network for FPGA application
Final Project presentation
Lecture: Deep Convolutional Neural Networks
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu.
Heterogeneous convolutional neural networks for visual recognition
Graphics Processing Unit
CSC 578 Neural Networks and Deep Learning
EE 193: Parallel Computing
Sketch Object Prediction
Model Compression Joseph E. Gonzalez
Arani Bhattacharya, Han Chen, Peter Milder, Samir R. Das
Samira Khan University of Virginia Feb 6, 2019
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Natalie Lang Tomer Malach
Boston University & USTC
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Overcoming Resource Underutilization in Spatial CNN Accelerators Yongming Shen, Michael Ferdman, Peter Milder COMPAS Lab, Stony Brook University

CNNs are highly parallel and amenable to FPGA acceleration CNN on FPGAs Convolutional Neural Networks (CNNs) Best known method for object recognition [Simonyan, ICLR2015] Massive amount of computation Highly parallel, amenable to hardware acceleration FPGA-based CNN Acceleration Lots of compute units (DSP slices) Exploit parallelism Reconfigurable fabric CNN-tailored caches and dataflow Energy efficient Conv. Layer 2 Conv. Layer 1 Conv. Layer 0 DOG CNNs are highly parallel and amenable to FPGA acceleration

The Underutilization Problem State-of-the-art: whole FPGA  one processor [Zhang, FPGA2015] Single-CLP (Convolutional Layer Processor) Dimension mismatch causes dynamic underutilization Part of the compute array becomes idle Only 74% average utilization with AlexNet on Virtex-7 485T Less than 30% average utilization on larger FPGAs (~10K DSP slices) 40% Util. Conv. Layer 1 FPGA 100% Util. Conv. Layer 0 CLP 100% Util. 45% Util. Conv. Layer 2 “One size fits all”  low dynamic arithmetic-unit utilization

Our Multi-CLP Solution Single large CLP  Multiple smaller CLPs Total DSP-slice consumption remains the same Each CLP optimized for a subset of layers Multiple images in flight at the same time Avoids data dependency between CLPs AlexNet on Virtex-7 485T (simulation) 74% util.  97% util., 1.3x speedup AlexNet on FPGAs with ~10K DSP slices (model) 30% util.  96% util., 3.2x speedup FPGA CLP CLP0 CLP1 CLP2 Specialization  ~100% dynamic arithmetic-unit utilization