Overcoming Resource Underutilization in Spatial CNN Accelerators

Slides:

Advertisements

Similar presentations

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Advertisements

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.

Hardware implement 涂正中. 2010:Convolutional Networks and Applications in Vision （ Overview ） Chip: first : , Bell Labs’ ANNA chip. ConvNet chip.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

CASTNESS11, Rome Italy © 2011 Target Compiler Technologies L 1 Ideas for the design of an ASIP for LQCD Target Compiler Technologies CASTNESS’11, Rome,

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

NEURAL - FUZZY LOGIC FOR AUTOMATIC OBJECT RECOGNITION.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Convolutional Neural Network

Philipp Gysel ECE Department University of California, Davis

Spatial Localization and Detection

Gaussian Conditional Random Field Network for Semantic Segmentation

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Single Instruction Multiple Data

Modeling Big Data Execution speed limited by: Model complexity

Analysis of Sparse Convolutional Neural Networks

Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.

Reza Yazdani Albert Segura José-María Arnau Antonio González

FPGA implementation of CNN Convolution layer logic

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

The Need for Speed: Benchmarking DL Workloads

Spatial Analysis With Big Data

FPGA Acceleration of Convolutional Neural Networks

Introduction to Parallelism.

FPGAs in AWS and First Use Cases, Kees Vissers

Rethinking NoCs for Spatial Neural Network Accelerators

Lecture 5 Smaller Network: CNN

Instructor: Dr. Phillip Jones

Machine Learning: The Connectionist

Stripes: Bit-Serial Deep Neural Network Computing

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

A Convolutional Neural Network Cascade For Face Detection

Power-Efficient Machine Learning using FPGAs on POWER Systems

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Deep Learning Tutorial

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

STUDY AND IMPLEMENTATION

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,

Optimization for Fully Connected Neural Network for FPGA application

Final Project presentation

Lecture: Deep Convolutional Neural Networks

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu.

Heterogeneous convolutional neural networks for visual recognition

Graphics Processing Unit

CSC 578 Neural Networks and Deep Learning

EE 193: Parallel Computing

Sketch Object Prediction

Model Compression Joseph E. Gonzalez

Arani Bhattacharya, Han Chen, Peter Milder, Samir R. Das

Samira Khan University of Virginia Feb 6, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Natalie Lang Tomer Malach

Boston University & USTC

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

Martin Croome VP Business Development GreenWaves Technologies.

Presentation transcript:

Overcoming Resource Underutilization in Spatial CNN Accelerators Yongming Shen, Michael Ferdman, Peter Milder COMPAS Lab, Stony Brook University

CNNs are highly parallel and amenable to FPGA acceleration CNN on FPGAs Convolutional Neural Networks (CNNs) Best known method for object recognition [Simonyan, ICLR2015] Massive amount of computation Highly parallel, amenable to hardware acceleration FPGA-based CNN Acceleration Lots of compute units (DSP slices) Exploit parallelism Reconfigurable fabric CNN-tailored caches and dataflow Energy efficient Conv. Layer 2 Conv. Layer 1 Conv. Layer 0 DOG CNNs are highly parallel and amenable to FPGA acceleration

The Underutilization Problem State-of-the-art: whole FPGA  one processor [Zhang, FPGA2015] Single-CLP (Convolutional Layer Processor) Dimension mismatch causes dynamic underutilization Part of the compute array becomes idle Only 74% average utilization with AlexNet on Virtex-7 485T Less than 30% average utilization on larger FPGAs (~10K DSP slices) 40% Util. Conv. Layer 1 FPGA 100% Util. Conv. Layer 0 CLP 100% Util. 45% Util. Conv. Layer 2 “One size fits all”  low dynamic arithmetic-unit utilization

Our Multi-CLP Solution Single large CLP  Multiple smaller CLPs Total DSP-slice consumption remains the same Each CLP optimized for a subset of layers Multiple images in flight at the same time Avoids data dependency between CLPs AlexNet on Virtex-7 485T (simulation) 74% util.  97% util., 1.3x speedup AlexNet on FPGAs with ~10K DSP slices (model) 30% util.  96% util., 3.2x speedup FPGA CLP CLP0 CLP1 CLP2 Specialization  ~100% dynamic arithmetic-unit utilization