Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Slides:

Advertisements

Similar presentations

ImageNet Classification with Deep Convolutional Neural Networks

Advertisements

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Fully Convolutional Networks for Semantic Segmentation

Philipp Gysel ECE Department University of California, Davis

Deep Residual Learning for Image Recognition

Lecture 3a Analysis of training of NN

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Scalpel: Customizing DNN Pruning to the

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Big data classification using neural network

Deep Residual Learning for Image Recognition

GPUNFV: a GPU-Accelerated NFV System

Analysis of Sparse Convolutional Neural Networks

Early Results of Deep Learning on the Stampede2 Supercomputer

The Relationship between Deep Learning and Brain Function

Microarchitecture.

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

The Problem: Classification

ISPASS th April Santa Rosa, California

A Pool of Deep Models for Event Recognition

CS 147 – Parallel Processing

Combining CNN with RNN for scene labeling (segmentation)

Architecture & Organization 1

Cache Memory Presentation I

Inception and Residual Architecture in Deep Convolutional Networks

Intelligent Information System Lab

ECE 6504 Deep Learning for Perception

Short Circuiting Memory Traffic in Handheld Platforms

CS6890 Deep Learning Weizhen Cai

Machine Learning: The Connectionist

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

PipeDream: Pipeline Parallelism for DNN Training

ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants

Bird-species Recognition Using Convolutional Neural Network

Architecture & Organization 1

Introduction to Neural Networks

Image Classification.

Early Results of Deep Learning on the Stampede2 Supercomputer

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Counting in Dense Crowds using Deep Learning

Architectural Support for Efficient Large-Scale Automata Processing

Smart Robots, Drones, IoT

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Final Project presentation

Age and Gender Classification using Convolutional Neural Networks

Neural Networks Geoff Hulten.

Lecture: Deep Convolutional Neural Networks

Deep Neural Networks for Onboard Intelligence

Outline Background Motivation Proposed Model Experimental Results

Logistic Regression & Transfer Learning

Inception-v4, Inception-ResNet and the Impact of

Heterogeneous convolutional neural networks for visual recognition

Model Compression Joseph E. Gonzalez

Automatic Handwriting Generation

Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Deep Object Co-Segmentation

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Natalie Lang Tomer Malach

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Learning and Memorization

Haonan Wang, Adwait Jog College of William & Mary

Debasis Bhattacharya, JD, DBA University of Hawaii Maui College

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

Overall Introduction for the Lecture

Presentation transcript:

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Thanks for introduction. Hi everyone, let’s start my presentation now. This work is collaborated with Professor Adwait Jog from The College of William and Mary. And my advisor Professor Jishen Zhao. I am Hengyu Zhao, the first author of this work. Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California Santa Cruz, *The College of William and Mary The First International Workshop on Architectures for Intelligent Machines

As we all known, deep learning neural networks have applied to many areas. Such as facial recognition, alphago intelligent machine, and unmanned driving technology. Deep learning technologies are changing our world and life styles.

Training Nvidia, Intel and AMD all develop products to do Deep neural networks training. But running Deep neural networks training on graphic cards is a most common way now.

Inference This work focuses on TRAINING.

ImageNet Large Scale Visual Recognition Challenge Recent Winners Modern neural networks go deeper! AlexNet: 8 layers: 2012 VGG: 16 or 19 layers: 2014 GoogleNet: 22 layers: 2014 ResNet: 152 layers: 2015 Here are several ImageNet Large Scale Visual Recognition Challenge Recent Winners. We can see that the trend of development of neural networks is that modern neural networks go deeper.

Challenges With Deeper Model and Large Batch More layers also bring many issues. Deeper DNNs cost much more hardware resource, like VGG in this figure. Batch size is the volume of input data in training. Increasing batch size is a good way to improve training efficiency. When batch size reaches 256, the memory consumption will be very huge. Simgle GPU cannot afford the requirement.

Problem Larger models need more powerful new architecture! Then the problem comes.

Motivation To build new powerful systems, we need to address the compute and memory bottlenecks in deep learning. Also we may ask this question:

Our Work We build a layer-wise model for training VGG-16 and AlexNet on GPUs. We identify GPU performance bottlenecks in compute and cache resources, by characterizing the performance and data access behaviors of AlexNet and VGG-16 models in a layer-wise manner.

Background

Machine Learning AlexNet Feature extraction layers that extract input features, and most operations are convolutions. Classification layers that analyze features and classify input images into groups, like fully connected layers. AlexNet This is the architecture of AlexNet. The layers can be divided into two groups. Then we will do a layer-wise profiling according to different type layers.

Forward propagation Backward propagation Compute each layer’s feature map with input, which is the output of last layer. Backward propagation Compute the gradient map with the loss produced by loss function. Update weights. Backward propagation

… … GPU Architecture SM SM SM L1 L1 L1 L2 Cache Memory Here is a general GPU architecture: L1 cache is private and L2 cache is shared by multiple processors. L2 Cache Memory

Experiment Setup Models: AlexNet and VGG-16 Dataset: ImageNet Framework: Caffe

Real Machine Characterization Execution time and Instruction L1 Cache L2 Cache Memory

Real Machine Characterization Execution time & Stall time 10 First, convolutional (CONV) layers execute for much longer time than fully connected (FCN) layers. Second, CONV interlayers dominate the execution time of all CONV layers; these inter-layers also execute more instructions than other layers (Figure 5(a), Figure 6(a)). Third, execution time and instruction count increases as we increase the batch size from 32 to 256. Finally, with both CONV and FCN layers, the execution time of backpropagation can be over 2 of forward propagation. 1 Why? Convolutional layers are more compute intensive.

Execution time & Stall time Normalized Stall Time CONV inter-layers cost much more stall time. First, convolutional (CONV) layers execute for much longer time than fully connected (FCN) layers. Second, CONV interlayers dominate the execution time of all CONV layers; these inter-layers also execute more instructions than other layers (Figure 5(a), Figure 6(a)). Third, execution time and instruction count increases as we increase the batch size from 32 to 256. Finally, with both CONV and FCN layers, the execution time of backpropagation can be over 2 of forward propagation.

Backpropagation to forward propagation computation latency ratio with a 256 batch size in AlexNet and VGG-16 Computation Latency Ratio Why? Backward propagation takes most execution time.

Execution instruction: AlexNet (b) CONV inter layers are compute intensive. Memory access is the major operations in DNNs training.

Execution instruction: VGG-16 (a) (b) Data access is performance critical to both CONV inter-layers and FCN layers.

Their working set does not fit in L1 caches. L1 cache: AlexNet Their working set does not fit in L1 caches. They have low data access locality (our later evaluation on L2 access behavior demonstrates that this is not the case). (a) (b) This observation is consistent with the long data access stall time of these layers. The reason can be either or both of the two: a) their working set does not fit in L1 caches; b) they have low data access locality (our later evaluation on L2 access behavior demonstrates that this is not the case). Third, CONV input layer (CONV1) has a high L1 hit rate, but L1 hit rate drops in as CONV layers get deeper. Finally, L1 throughput and hit rate appear stable across various batch sizes with CONV

FCN layers have higher L1 cache throughput. L1 cache: VGG-16 (a) (b) FCN layers have higher L1 cache throughput.

CONV inter-layers yield much higher hit rates in the 4MB L2 cache than the 24KB L1 caches. L2 cache: AlexNet (a) (b) (c) (d) The execution time of FCN layers is much shorter than CONV layers, so the throughput is higher.

L2 cache: VGG-16 (a) (b) We also profile vgg-16. the results are consistent with alexnet.

L2 cache: VGG-16 As such, these layers have sufficient locality, especially with read requests, if the GPU can integrate large caches to accommodate their working set. (c) (d)

Memory: AlexNet (a) (b)

Memory: VGG-16 CONV inter-layers have much higher memory write throughput, because they have lower L2 write hit rates. FCN layers have much higher memory read throughput, because they have lower L2 read hit rates. (a) (b) The results of vgg-16 have the similar shape with alexnet.

Conclusion The execution time of convolutional inter-layers dominates the total execution time. In particular, backpropagation of these inter-layers consumes significantly longer execution time than forward propagation. The working set of convolutional inter-layers does not fit in L1 cache, while convolutional input layer can exploit L1 cache sufficiently. Interconnect network can also be a performance bottleneck that substantially increase GPU memory bandwidth demand.

Thank you.