Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Slides:



Advertisements
Similar presentations
ImageNet Classification with Deep Convolutional Neural Networks
Advertisements

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Fully Convolutional Networks for Semantic Segmentation
Philipp Gysel ECE Department University of California, Davis
Deep Residual Learning for Image Recognition
Lecture 3a Analysis of training of NN
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Scalpel: Customizing DNN Pruning to the
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Big data classification using neural network
Deep Residual Learning for Image Recognition
GPUNFV: a GPU-Accelerated NFV System
Analysis of Sparse Convolutional Neural Networks
Early Results of Deep Learning on the Stampede2 Supercomputer
The Relationship between Deep Learning and Brain Function
Microarchitecture.
Data Mining, Neural Network and Genetic Programming
Chilimbi, et al. (2014) Microsoft Research
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
The Problem: Classification
ISPASS th April Santa Rosa, California
A Pool of Deep Models for Event Recognition
CS 147 – Parallel Processing
Combining CNN with RNN for scene labeling (segmentation)
Architecture & Organization 1
Cache Memory Presentation I
Inception and Residual Architecture in Deep Convolutional Networks
Intelligent Information System Lab
ECE 6504 Deep Learning for Perception
Short Circuiting Memory Traffic in Handheld Platforms
CS6890 Deep Learning Weizhen Cai
Machine Learning: The Connectionist
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
PipeDream: Pipeline Parallelism for DNN Training
ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants
Bird-species Recognition Using Convolutional Neural Network
Architecture & Organization 1
Introduction to Neural Networks
Image Classification.
Early Results of Deep Learning on the Stampede2 Supercomputer
A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,
Counting in Dense Crowds using Deep Learning
Architectural Support for Efficient Large-Scale Automata Processing
Smart Robots, Drones, IoT
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Final Project presentation
Age and Gender Classification using Convolutional Neural Networks
Neural Networks Geoff Hulten.
Lecture: Deep Convolutional Neural Networks
Deep Neural Networks for Onboard Intelligence
Outline Background Motivation Proposed Model Experimental Results
Logistic Regression & Transfer Learning
Inception-v4, Inception-ResNet and the Impact of
Heterogeneous convolutional neural networks for visual recognition
Model Compression Joseph E. Gonzalez
Automatic Handwriting Generation
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M
Deep Object Co-Segmentation
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Natalie Lang Tomer Malach
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Learning and Memorization
Haonan Wang, Adwait Jog College of William & Mary
Debasis Bhattacharya, JD, DBA University of Hawaii Maui College
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
Overall Introduction for the Lecture
Presentation transcript:

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Thanks for introduction. Hi everyone, let’s start my presentation now. This work is collaborated with Professor Adwait Jog from The College of William and Mary. And my advisor Professor Jishen Zhao. I am Hengyu Zhao, the first author of this work. Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California Santa Cruz, *The College of William and Mary The First International Workshop on Architectures for Intelligent Machines

As we all known, deep learning neural networks have applied to many areas. Such as facial recognition, alphago intelligent machine, and unmanned driving technology. Deep learning technologies are changing our world and life styles.

Training Nvidia, Intel and AMD all develop products to do Deep neural networks training. But running Deep neural networks training on graphic cards is a most common way now.

Inference This work focuses on TRAINING.

ImageNet Large Scale Visual Recognition Challenge Recent Winners Modern neural networks go deeper! AlexNet: 8 layers: 2012 VGG: 16 or 19 layers: 2014 GoogleNet: 22 layers: 2014 ResNet: 152 layers: 2015 Here are several ImageNet Large Scale Visual Recognition Challenge Recent Winners. We can see that the trend of development of neural networks is that modern neural networks go deeper.

Challenges With Deeper Model and Large Batch More layers also bring many issues. Deeper DNNs cost much more hardware resource, like VGG in this figure. Batch size is the volume of input data in training. Increasing batch size is a good way to improve training efficiency. When batch size reaches 256, the memory consumption will be very huge. Simgle GPU cannot afford the requirement.

Problem Larger models need more powerful new architecture! Then the problem comes.

Motivation To build new powerful systems, we need to address the compute and memory bottlenecks in deep learning. Also we may ask this question:

Our Work We build a layer-wise model for training VGG-16 and AlexNet on GPUs. We identify GPU performance bottlenecks in compute and cache resources, by characterizing the performance and data access behaviors of AlexNet and VGG-16 models in a layer-wise manner.

Background

Machine Learning AlexNet Feature extraction layers that extract input features, and most operations are convolutions. Classification layers that analyze features and classify input images into groups, like fully connected layers. AlexNet This is the architecture of AlexNet. The layers can be divided into two groups. Then we will do a layer-wise profiling according to different type layers.

Forward propagation Backward propagation Compute each layer’s feature map with input, which is the output of last layer. Backward propagation Compute the gradient map with the loss produced by loss function. Update weights. Backward propagation

… … GPU Architecture SM SM SM L1 L1 L1 L2 Cache Memory Here is a general GPU architecture: L1 cache is private and L2 cache is shared by multiple processors. L2 Cache Memory

Experiment Setup Models: AlexNet and VGG-16 Dataset: ImageNet Framework: Caffe

Real Machine Characterization Execution time and Instruction L1 Cache L2 Cache Memory

Real Machine Characterization Execution time & Stall time 10 First, convolutional (CONV) layers execute for much longer time than fully connected (FCN) layers. Second, CONV interlayers dominate the execution time of all CONV layers; these inter-layers also execute more instructions than other layers (Figure 5(a), Figure 6(a)). Third, execution time and instruction count increases as we increase the batch size from 32 to 256. Finally, with both CONV and FCN layers, the execution time of backpropagation can be over 2 of forward propagation. 1 Why? Convolutional layers are more compute intensive.

Execution time & Stall time Normalized Stall Time CONV inter-layers cost much more stall time. First, convolutional (CONV) layers execute for much longer time than fully connected (FCN) layers. Second, CONV interlayers dominate the execution time of all CONV layers; these inter-layers also execute more instructions than other layers (Figure 5(a), Figure 6(a)). Third, execution time and instruction count increases as we increase the batch size from 32 to 256. Finally, with both CONV and FCN layers, the execution time of backpropagation can be over 2 of forward propagation.

Backpropagation to forward propagation computation latency ratio with a 256 batch size in AlexNet and VGG-16 Computation Latency Ratio Why? Backward propagation takes most execution time.

Execution instruction: AlexNet (b) CONV inter layers are compute intensive. Memory access is the major operations in DNNs training.

Execution instruction: VGG-16 (a) (b) Data access is performance critical to both CONV inter-layers and FCN layers.

Their working set does not fit in L1 caches. L1 cache: AlexNet Their working set does not fit in L1 caches. They have low data access locality (our later evaluation on L2 access behavior demonstrates that this is not the case). (a) (b) This observation is consistent with the long data access stall time of these layers. The reason can be either or both of the two: a) their working set does not fit in L1 caches; b) they have low data access locality (our later evaluation on L2 access behavior demonstrates that this is not the case). Third, CONV input layer (CONV1) has a high L1 hit rate, but L1 hit rate drops in as CONV layers get deeper. Finally, L1 throughput and hit rate appear stable across various batch sizes with CONV

FCN layers have higher L1 cache throughput. L1 cache: VGG-16 (a) (b) FCN layers have higher L1 cache throughput.

CONV inter-layers yield much higher hit rates in the 4MB L2 cache than the 24KB L1 caches. L2 cache: AlexNet (a) (b) (c) (d) The execution time of FCN layers is much shorter than CONV layers, so the throughput is higher.

L2 cache: VGG-16 (a) (b) We also profile vgg-16. the results are consistent with alexnet.

L2 cache: VGG-16 As such, these layers have sufficient locality, especially with read requests, if the GPU can integrate large caches to accommodate their working set. (c) (d)

Memory: AlexNet (a) (b)

Memory: VGG-16 CONV inter-layers have much higher memory write throughput, because they have lower L2 write hit rates. FCN layers have much higher memory read throughput, because they have lower L2 read hit rates. (a) (b) The results of vgg-16 have the similar shape with alexnet.

Conclusion The execution time of convolutional inter-layers dominates the total execution time. In particular, backpropagation of these inter-layers consumes significantly longer execution time than forward propagation. The working set of convolutional inter-layers does not fit in L1 cache, while convolutional input layer can exploit L1 cache sufficiently. Interconnect network can also be a performance bottleneck that substantially increase GPU memory bandwidth demand.

Thank you.