Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

ImageNet Classification with Deep Convolutional Neural Networks

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

GPGPU platforms GP - General Purpose computation using GPU

SAGE: Self-Tuning Approximation for Graphics Engines

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

Philipp Gysel ECE Department University of California, Davis

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

Scalpel: Customizing DNN Pruning to the

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

Learning to Compare Image Patches via Convolutional Neural Networks

Analysis of Sparse Convolutional Neural Networks

Benchmarking Deep Learning Inference

The Relationship between Deep Learning and Brain Function

CS427 Multicore Architecture and Parallel Computing

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Learning in HEP Large number of applications:

Parallel Programming By J. H. Wang May 2, 2017.

FPGA Acceleration of Convolutional Neural Networks

Genomic Data Clustering on FPGAs for Compression

dawn.cs.stanford.edu/benchmark

SoC and FPGA Oriented High-quality Stereo Vision System

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Supporting Fault-Tolerance in Streaming Grid Applications

Machine Learning: The Connectionist

Presented by: Isaac Martin

Power-Efficient Machine Learning using FPGAs on POWER Systems

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Introduction to Neural Networks

Approximate Fully Connected Neural Network Generation

Jason furmanek Kris murphy IBM

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

Soft Error Detection for Iterative Applications Using Offline Training

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Optimization for Fully Connected Neural Network for FPGA application

Declarative Transfer Learning from Deep CNNs at Scale

Final Project presentation

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing

1CECA, Peking University, China

Convolution Layer Optimization

TensorFlow: A System for Large-Scale Machine Learning

EE 193: Parallel Computing

Sketch Object Prediction

Model Compression Joseph E. Gonzalez

6- General Purpose GPU Programming

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Natalie Lang Tomer Malach

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Learning and Memorization

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Search-Based Approaches to Accelerate Deep Learning

Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar

YOLO-based Object Detection on ARM Mali GPU

Presentation transcript:

Streaming Architecture for Large-Scale Quantized Neural Networks on an FPGA-Based Dataflow Platform Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M. Bronstein and Avi Mendelson IPDPS-RAW 2018 Hello, My name is Chaim I am, phd student from Technion. I am happy to present my joint work with named fellows

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for QNN Evaluation of proposed Architecture Conclusions

Research Motivation Deep Neural Networks (DNN) are widely used by different applications that are executed on a range of computer architectures. Currently most of DNNs applications run on Multiprocessors and/or GPUs. Requires much power but very fast inference and training. FPGAs are good candidates for replacing GPUs. More energy efficient but slower memories and limited FP performance. Although I think that there no special reason to explain why DNN so popular and widely used in various fields such CV, NLP, Autonomous vehicles and etc.

Research Motivation To reduce the number of memory accesses and fit bigger NNs on FPGAs we want to compress the network as much as possible. Using Qinarized Neural Networks (BNN) is one of the proposed solutions to allow implementing DNN on FPGAs.

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Background - Neural Networks A NN needs to be trained before it can be used. Training involves significantly more computations (Data Centers) and time than usage of network (inference). Inference can potentially be run on a wide range of devices with smaller computational abilities

Background - Convolutional Neural Networks

Background - Convolutional Neural Networks AlexNet This Architecture is baseline for different NN techniques First success of NN on ImageNet 50 M Parameters

Background - Convolutional Neural Networks ResNet-18 Based on Skip connection blocks State of the art ImageNet classification accuracy achieved on these type of CNN

Background - Quantization It was shown that parameters can contain a lot of redundant information. DL frameworks such as TensorFlow, Torch, and Caffe worked on support (FP16) and (INT8). The results of reduced precision networks were compatibly. There are different methods of quantization. The most basic one is: Q(x,bw) =Clip(round(x /bw)×bw,min,max)

Background - Binarized Neural Network Binarization is extreme case of quantization (each parameter is 1 bit). In this manner we could avoid of Mul operations. Mul is energy expansive and time consuming. FPGA has limited number of Mul units.

Background - Binarized Neural Network Full precision Binarized

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Functional vs data decomposition Data decomposition assumes that the same kernel is operated on different data sets at a given time Functional decomposition assumes a “pipeline’ (or dataflow) behavior From theoretical point of view, the computational power of these two methods is the same. Most of massive SW environments; e.g., CUDA, Click, OpenMP, assume data decomposition. We can intermix these two techniques.

Functional Decomposition The Maxeler heterogeneous environment    CPUs PCIe Dataflow Engines Memory 0110101001010010

Programming with MaxCompiler PCI Express Manager Chip Memory Manager (.java) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", CPU)); m.createSLiCInterface(); m.build(); Main CPU Code CPU Code (.c) SLiC MaxelerOS #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; Calc(x, DATA_SIZE) y x + 30 DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) Please say that the C or C++ and the Java are restricted versions that can be used in this environments.bbb

Programming with MaxCompiler Main Memory CPU Code CPUCode (.c) PCI Express Manager Chip Memory Manager (.java) x + 30 y y SLiC MaxelerOS x DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; Calc(x, DATA_SIZE) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", LMEM_LINEAR1D)); m.createSLiCInterface(); m.build();

Kernel Streaming 5 4 3 2 1 x + 30 y

Kernel Streaming 5 4 3 2 1 x + 30 y

Kernel Streaming 5 4 3 2 1 x + 30 y 1

Kernel Streaming 5 4 3 2 1 x + 30 y 2 1 30

Kernel Streaming 5 4 3 2 1 x + 30 y 3 4 4 4 31 31 31 30

Kernel Streaming 5 4 3 2 1 x + 30 y 4 9 34 30 31

Kernel Streaming 5 4 3 2 1 x + 30 y 5 16 39 30 31 34

Kernel Streaming 5 4 3 2 1 x + 30 y 25 46 30 31 34 39

Kernel Streaming 5 4 3 2 1 x + 30 y 55 30 31 34 39 46

Kernel Streaming 5 4 3 2 1 x + 30 y 30 31 34 39 46 55

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Streaming Architecture of QNN on FPGA Hardware Implementation overview For the first time we implement a full version of ImageNet classification architecture on FPGAs. (this implementation is based on [Hubara et al.]) The sizes of weights and activation function outputs were chosen to be 1-bit and 2-bit respectively.

Streaming Architecture of BNN on FPGA Hardware Implementation overview Pre-trained weights and normalization parameters are stored on the CPU side Computations required for the inference are performed on the DFE side. Streaming architecture allows the current layer to begin its output calculation once enough data has been accumulated in its internal buffer.

Streaming Architecture of BNN on FPGA Hardware Implementation overview As soon as all the data required for the calculation of the particular output pixel is present, this pixel is calculated and passed to the next layer. Since each layer is represented in the DFE Manager by a single function call, the building of the network is similar to the process of building in high level frameworks such as Tensorﬂow, Pytorch,Cafe.

Streaming Architecture of BNN on FPGA Hardware Implementation overview Due to the compact model size of QNNs, all NN parameters are kept in on-chip memory. Due to computation overlap, the latency and the initiation interval are pretty small.

Streaming Architecture of BNN on FPGA Convolution The execution of the convolution kernel starts with inputs for weights, global normalization parameters, and feature maps. We have replaced element-wise matrix multiplication of feature maps and their corresponding weights with the XNOR-popcount algorithm.

Streaming Architecture of BNN on FPGA Convolution All the weights received by the FPGA are represented as 32-bit ﬂoating point numbers. Due to usage of global normalization, the amount of memory required for normalization parameter storage is relatively small.

Streaming Architecture of BNN on FPGA Convolution

Streaming Architecture of BNN on FPGA Global batch normalization and activation function Pixels in the same position of all feature maps use same normalization parameters. As was shown in FINN [Umuroglu et.al.], batch normalization and one bit activation can be replaced with a threshold function. The n-bit uniform activation (quantization) divides the range of inputs into 2 𝑛 equally-sized ranges. Each range is mapped to a single output value of the activation function.

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Methodology We evaluated 2 common CNN architectures: AlexNet and ResNet18. As a baseline we used GPU-based systems with Nvidia’s TeslaP100-12GB and Geforce GTX1080. We have measured performance and power consumption FPGA was evaluated on three common data sets: CIFAR-10, ImageNet, and STL-10.

Evaluation of proposed Architecture Power Our architecture at least 15× lower power than GPUs based system solution

Evaluation of proposed Architecture Performance GPUs outperform our implementation with large inputs, Our proposed streaming architecture is still fast enough to meet real-time requirements, achieving more than 60 fps

Evaluation of proposed Architecture Impact of energy 4× lower energy consumption (energy = power x run_time)

Evaluation of proposed Architecture Impact of picture size We indicate that our streaming architecture does have high scalability and the ability to effectively utilize resources on both single and multiple FPGAs. BRAM plot FF plot LUT plot For example, increasing the size of input from 32×32 to 96×96increases the resource utilization by approximately 5% for all types of resources.

List of Topics Motivation Background Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Conclusions We have shown streaming architecture for QNNs, which scales well for large inputs size and large NNs. For inputs up to 144X144, resource utilization is small enough to fit in a single Stratix V 5SGSD8 FPGA. The run-time is only a couple of times greater, which allows us to speculate that next-generation FPGAs could outperform GPUs in both performance and power/energy consumption. The demonstrated performance is achieved due to streaming architecture implementation, without the involvement of off-chip memory.