Download presentation
Presentation is loading. Please wait.
Published byShavonne Booker Modified over 5 years ago
1
Streaming Architecture for Large-Scale Quantized Neural Networks on an FPGA-Based Dataflow Platform
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M. Bronstein and Avi Mendelson IPDPS-RAW 2018 Hello, My name is Chaim I am, phd student from Technion. I am happy to present my joint work with named fellows
2
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for QNN Evaluation of proposed Architecture Conclusions
3
Research Motivation Deep Neural Networks (DNN) are widely used by different applications that are executed on a range of computer architectures. Currently most of DNNs applications run on Multiprocessors and/or GPUs. Requires much power but very fast inference and training. FPGAs are good candidates for replacing GPUs. More energy efficient but slower memories and limited FP performance. Although I think that there no special reason to explain why DNN so popular and widely used in various fields such CV, NLP, Autonomous vehicles and etc.
4
Research Motivation To reduce the number of memory accesses and fit bigger NNs on FPGAs we want to compress the network as much as possible. Using Qinarized Neural Networks (BNN) is one of the proposed solutions to allow implementing DNN on FPGAs.
5
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions
6
Background - Neural Networks
A NN needs to be trained before it can be used. Training involves significantly more computations (Data Centers) and time than usage of network (inference). Inference can potentially be run on a wide range of devices with smaller computational abilities
7
Background - Convolutional Neural Networks
8
Background - Convolutional Neural Networks AlexNet
This Architecture is baseline for different NN techniques First success of NN on ImageNet 50 M Parameters
9
Background - Convolutional Neural Networks ResNet-18
Based on Skip connection blocks State of the art ImageNet classification accuracy achieved on these type of CNN
10
Background - Quantization
It was shown that parameters can contain a lot of redundant information. DL frameworks such as TensorFlow, Torch, and Caffe worked on support (FP16) and (INT8). The results of reduced precision networks were compatibly. There are different methods of quantization. The most basic one is: Q(x,bw) =Clip(round(x /bw)×bw,min,max)
11
Background - Binarized Neural Network
Binarization is extreme case of quantization (each parameter is 1 bit). In this manner we could avoid of Mul operations. Mul is energy expansive and time consuming. FPGA has limited number of Mul units.
12
Background - Binarized Neural Network
Full precision Binarized
13
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions
14
Functional vs data decomposition
Data decomposition assumes that the same kernel is operated on different data sets at a given time Functional decomposition assumes a “pipeline’ (or dataflow) behavior From theoretical point of view, the computational power of these two methods is the same. Most of massive SW environments; e.g., CUDA, Click, OpenMP, assume data decomposition. We can intermix these two techniques.
15
Functional Decomposition The Maxeler heterogeneous environment
CPUs PCIe Dataflow Engines Memory
16
Programming with MaxCompiler
PCI Express Manager Chip Memory Manager (.java) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", CPU)); m.createSLiCInterface(); m.build(); Main CPU Code CPU Code (.c) SLiC MaxelerOS #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; Calc(x, DATA_SIZE) y x + 30 DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) Please say that the C or C++ and the Java are restricted versions that can be used in this environments.bbb
17
Programming with MaxCompiler
Main Memory CPU Code CPUCode (.c) PCI Express Manager Chip Memory Manager (.java) x + 30 y y SLiC MaxelerOS x DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32)); MyKernel (.java) #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; Calc(x, DATA_SIZE) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), link(“y", LMEM_LINEAR1D)); m.createSLiCInterface(); m.build();
18
Kernel Streaming 5 4 3 2 1 x + 30 y
19
Kernel Streaming 5 4 3 2 1 x + 30 y
20
Kernel Streaming 5 4 3 2 1 x + 30 y 1
21
Kernel Streaming 5 4 3 2 1 x + 30 y 2 1 30
22
Kernel Streaming 5 4 3 2 1 x + 30 y 3 4 4 4 31 31 31 30
23
Kernel Streaming 5 4 3 2 1 x + 30 y 4 9 34 30 31
24
Kernel Streaming 5 4 3 2 1 x + 30 y 5 16 39 30 31 34
25
Kernel Streaming 5 4 3 2 1 x + 30 y 25 46 30 31 34 39
26
Kernel Streaming 5 4 3 2 1 x + 30 y 55 30 31 34 39 46
27
Kernel Streaming 5 4 3 2 1 x + 30 y 30 31 34 39 46 55
28
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions
29
Streaming Architecture of QNN on FPGA Hardware Implementation overview
For the first time we implement a full version of ImageNet classification architecture on FPGAs. (this implementation is based on [Hubara et al.]) The sizes of weights and activation function outputs were chosen to be 1-bit and 2-bit respectively.
30
Streaming Architecture of BNN on FPGA Hardware Implementation overview
Pre-trained weights and normalization parameters are stored on the CPU side Computations required for the inference are performed on the DFE side. Streaming architecture allows the current layer to begin its output calculation once enough data has been accumulated in its internal buffer.
31
Streaming Architecture of BNN on FPGA Hardware Implementation overview
As soon as all the data required for the calculation of the particular output pixel is present, this pixel is calculated and passed to the next layer. Since each layer is represented in the DFE Manager by a single function call, the building of the network is similar to the process of building in high level frameworks such as Tensorflow, Pytorch,Cafe.
32
Streaming Architecture of BNN on FPGA Hardware Implementation overview
Due to the compact model size of QNNs, all NN parameters are kept in on-chip memory. Due to computation overlap, the latency and the initiation interval are pretty small.
33
Streaming Architecture of BNN on FPGA Convolution
The execution of the convolution kernel starts with inputs for weights, global normalization parameters, and feature maps. We have replaced element-wise matrix multiplication of feature maps and their corresponding weights with the XNOR-popcount algorithm.
34
Streaming Architecture of BNN on FPGA Convolution
All the weights received by the FPGA are represented as 32-bit floating point numbers. Due to usage of global normalization, the amount of memory required for normalization parameter storage is relatively small.
35
Streaming Architecture of BNN on FPGA Convolution
36
Streaming Architecture of BNN on FPGA Global batch normalization and activation function
Pixels in the same position of all feature maps use same normalization parameters. As was shown in FINN [Umuroglu et.al.], batch normalization and one bit activation can be replaced with a threshold function. The n-bit uniform activation (quantization) divides the range of inputs into 2 𝑛 equally-sized ranges. Each range is mapped to a single output value of the activation function.
37
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions
38
Methodology We evaluated 2 common CNN architectures: AlexNet and ResNet18. As a baseline we used GPU-based systems with Nvidia’s TeslaP100-12GB and Geforce GTX1080. We have measured performance and power consumption FPGA was evaluated on three common data sets: CIFAR-10, ImageNet, and STL-10.
39
Evaluation of proposed Architecture Power
Our architecture at least 15× lower power than GPUs based system solution
40
Evaluation of proposed Architecture Performance
GPUs outperform our implementation with large inputs, Our proposed streaming architecture is still fast enough to meet real-time requirements, achieving more than 60 fps
41
Evaluation of proposed Architecture Impact of energy
4× lower energy consumption (energy = power x run_time)
42
Evaluation of proposed Architecture Impact of picture size
We indicate that our streaming architecture does have high scalability and the ability to effectively utilize resources on both single and multiple FPGAs. BRAM plot FF plot LUT plot For example, increasing the size of input from 32×32 to 96×96increases the resource utilization by approximately 5% for all types of resources.
43
List of Topics Motivation Background
Design Methodologies For FPGA Based Systems Streaming Architecture for BNN Evaluation of proposed Architecture Conclusions
44
Conclusions We have shown streaming architecture for QNNs, which scales well for large inputs size and large NNs. For inputs up to 144X144, resource utilization is small enough to fit in a single Stratix V 5SGSD8 FPGA. The run-time is only a couple of times greater, which allows us to speculate that next-generation FPGAs could outperform GPUs in both performance and power/energy consumption. The demonstrated performance is achieved due to streaming architecture implementation, without the involvement of off-chip memory.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.