Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Streaming Architecture for Large-Scale Quantized Neural Networks on an FPGA-Based Dataflow Platform
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M. Bronstein and Avi Mendelson IPDPS-RAW 2018 Hello, My name is Chaim I am, phd student from Technion. I am happy to present my joint work with named fellows

List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for QNN Evaluation of proposed Architecture Conclusions

Research Motivation Deep Neural Networks (DNN) are widely used by different applications that are executed on a range of computer architectures. Currently most of DNNs applications run on Multiprocessors and/or GPUs. Requires much power but very fast inference and training. FPGAs are good candidates for replacing GPUs. More energy efficient but slower memories and limited FP performance. Although I think that there no special reason to explain why DNN so popular and widely used in various fields such CV, NLP, Autonomous vehicles and etc.

Research Motivation To reduce the number of memory accesses and fit bigger NNs on FPGAs we want to compress the network as much as possible. Using Qinarized Neural Networks (BNN) is one of the proposed solutions to allow implementing DNN on FPGAs.

Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions

Background - Neural Networks
A NN needs to be trained before it can be used. Training involves significantly more computations (Data Centers) and time than usage of network (inference). Inference can potentially be run on a wide range of devices with smaller computational abilities

Background - Convolutional Neural Networks

Background - Convolutional Neural Networks AlexNet
This Architecture is baseline for different NN techniques First success of NN on ImageNet 50 M Parameters

Background - Convolutional Neural Networks ResNet-18
Based on Skip connection blocks State of the art ImageNet classification accuracy achieved on these type of CNN

Background - Quantization
It was shown that parameters can contain a lot of redundant information. DL frameworks such as TensorFlow, Torch, and Caffe worked on support (FP16) and (INT8). The results of reduced precision networks were compatibly. There are different methods of quantization. The most basic one is: Q(x,bw) =Clip(round(x /bw)×bw,min,max)

Background - Binarized Neural Network
Binarization is extreme case of quantization (each parameter is 1 bit). In this manner we could avoid of Mul operations. Mul is energy expansive and time consuming. FPGA has limited number of Mul units.

Background - Binarized Neural Network
Full precision Binarized

Functional vs data decomposition
Data decomposition assumes that the same kernel is operated on different data sets at a given time Functional decomposition assumes a “pipeline’ (or dataflow) behavior From theoretical point of view, the computational power of these two methods is the same. Most of massive SW environments; e.g., CUDA, Click, OpenMP, assume data decomposition. We can intermix these two techniques.

Functional Decomposition The Maxeler heterogeneous environment
   CPUs PCIe Dataflow Engines Memory

Programming with MaxCompiler
PCI Express Manager Chip Memory Manager (.java) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", CPU)); m.createSLiCInterface(); m.build(); Main CPU Code CPU Code (.c) SLiC MaxelerOS #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; Calc(x, DATA_SIZE) y x + 30 DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) Please say that the C or C++ and the Java are restricted versions that can be used in this environments.bbb

Programming with MaxCompiler
Main Memory CPU Code CPUCode (.c) PCI Express Manager Chip Memory Manager (.java) x + 30 y y SLiC MaxelerOS x DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; Calc(x, DATA_SIZE) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", LMEM_LINEAR1D)); m.createSLiCInterface(); m.build();

Kernel Streaming 5 4 3 2 1 x + 30 y

Kernel Streaming 5 4 3 2 1 x + 30 y 1

Kernel Streaming 5 4 3 2 1 x + 30 y 2 1 30

Kernel Streaming 5 4 3 2 1 x + 30 y 3 4 4 4 31 31 31 30

Kernel Streaming 5 4 3 2 1 x + 30 y 4 9 34 30 31

Kernel Streaming 5 4 3 2 1 x + 30 y 5 16 39 30 31 34

Streaming Architecture of QNN on FPGA Hardware Implementation overview
For the first time we implement a full version of ImageNet classification architecture on FPGAs. (this implementation is based on [Hubara et al.]) The sizes of weights and activation function outputs were chosen to be 1-bit and 2-bit respectively.

Streaming Architecture of BNN on FPGA Hardware Implementation overview
Pre-trained weights and normalization parameters are stored on the CPU side Computations required for the inference are performed on the DFE side. Streaming architecture allows the current layer to begin its output calculation once enough data has been accumulated in its internal buffer.

As soon as all the data required for the calculation of the particular output pixel is present, this pixel is calculated and passed to the next layer. Since each layer is represented in the DFE Manager by a single function call, the building of the network is similar to the process of building in high level frameworks such as Tensorﬂow, Pytorch,Cafe.

Due to the compact model size of QNNs, all NN parameters are kept in on-chip memory. Due to computation overlap, the latency and the initiation interval are pretty small.

Streaming Architecture of BNN on FPGA Convolution
The execution of the convolution kernel starts with inputs for weights, global normalization parameters, and feature maps. We have replaced element-wise matrix multiplication of feature maps and their corresponding weights with the XNOR-popcount algorithm.

All the weights received by the FPGA are represented as 32-bit ﬂoating point numbers. Due to usage of global normalization, the amount of memory required for normalization parameter storage is relatively small.

Streaming Architecture of BNN on FPGA Global batch normalization and activation function
Pixels in the same position of all feature maps use same normalization parameters. As was shown in FINN [Umuroglu et.al.], batch normalization and one bit activation can be replaced with a threshold function. The n-bit uniform activation (quantization) divides the range of inputs into 2 𝑛 equally-sized ranges. Each range is mapped to a single output value of the activation function.

Methodology We evaluated 2 common CNN architectures: AlexNet and ResNet18. As a baseline we used GPU-based systems with Nvidia’s TeslaP100-12GB and Geforce GTX1080. We have measured performance and power consumption FPGA was evaluated on three common data sets: CIFAR-10, ImageNet, and STL-10.

Evaluation of proposed Architecture Power
Our architecture at least 15× lower power than GPUs based system solution

Evaluation of proposed Architecture Performance
GPUs outperform our implementation with large inputs, Our proposed streaming architecture is still fast enough to meet real-time requirements, achieving more than 60 fps

Evaluation of proposed Architecture Impact of energy
4× lower energy consumption (energy = power x run_time)

Evaluation of proposed Architecture Impact of picture size
We indicate that our streaming architecture does have high scalability and the ability to effectively utilize resources on both single and multiple FPGAs. BRAM plot FF plot LUT plot For example, increasing the size of input from 32×32 to 96×96increases the resource utilization by approximately 5% for all types of resources.

Conclusions We have shown streaming architecture for QNNs, which scales well for large inputs size and large NNs. For inputs up to 144X144, resource utilization is small enough to fit in a single Stratix V 5SGSD8 FPGA. The run-time is only a couple of times greater, which allows us to speculate that next-generation FPGAs could outperform GPUs in both performance and power/energy consumption. The demonstrated performance is achieved due to streaming architecture implementation, without the involvement of off-chip memory.

Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Similar presentations

Presentation on theme: "Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Similar presentations

Presentation on theme: "Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M"— Presentation transcript:

Similar presentations

About project

Feedback