Benchmarking Deep Learning Inference

Slides:



Advertisements
Similar presentations
Branch Prediction with Neural- Networks: Hidden Layers and Recurrent Connections Andrew Smith CSE Dept. June 10, 2004.
Advertisements

1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.
SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Low-Power Wireless Sensor Networks
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Measuring Performance II and Logic Design
TensorFlow The Deep Learning Library You Should Be Using.
Machine Learning Supervised Learning Classification and Regression
Happy Endings: Reengineering Wesleyan’s Software Deployment to Labs and Classrooms Kyle Tousignant 03/22/2016.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
COMPSCI 110 Operating Systems
TensorFlow– A system for large-scale machine learning
World’s fastest Machine Learning Engine
Analysis of Sparse Convolutional Neural Networks
Early Results of Deep Learning on the Stampede2 Supercomputer
Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.
DL (Deep Learning) Workspace
Andrea Acquaviva, Luca Benini, Bruno Riccò
基于多核加速计算平台的深度神经网络 分割与重训练技术
Large-scale Machine Learning
Chilimbi, et al. (2014) Microsoft Research
IS301 – Software Engineering Dept of Computer Information Systems
Enabling machine learning in embedded systems
Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.
The Need for Speed: Benchmarking DL Workloads
Morgan Kaufmann Publishers
FPGA Acceleration of Convolutional Neural Networks
Deep Learning Libraries
dawn.cs.stanford.edu/benchmark
Introduction.
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Section 1: Introduction to Simics
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Power-Efficient Machine Learning using FPGAs on POWER Systems
Centar ( Global Signal Processing Expo
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
CS294-1 Reading Aug 28, 2003 Jaein Jeong
Early Results of Deep Learning on the Stampede2 Supercomputer
SBNet: Sparse Blocks Network for Fast Inference
Deep CNN of JPEG 2000 電信所R 林俊廷.
The Pitfalls and Guidelines for Mobile ML Benchmarking
AI Stick Easy to learn and use, accelerate the industrialization of artificial intelligence, and let the public become an expert in AI.
APACHE MXNET By Beni Mulyana.
Final Project presentation
Introduction to Operating Systems
John H.L. Hansen & Taufiq Al Babba Hasan
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Convolutional Neural Networks
TensorFlow: A System for Large-Scale Machine Learning
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,
EE 193: Parallel Computing
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Open Source Activity Showcase Computational Storage SNIA SwordfishTM
Model Compression Joseph E. Gonzalez
User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Luyang Wang (Office Depot), Guoqiong.
Chapter 6: Architectural Design
Detailed Characterization of Deep Neural Networks on GPUs and FPGAs
Deep Learning Libraries
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Learning and Memorization
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Andrew Karl, Ph.D. James Wisnowski, Ph.D. Lambros Petropoulos
Martin Croome VP Business Development GreenWaves Technologies.
Overall Introduction for the Lecture
Presentation transcript:

Benchmarking Deep Learning Inference Sharan Narang June 28, 2017 Deep learning works today for several different applications “Does it work efficiently?” Or rather “Is deep learning fast?”

Find what we are looking for What can AI do for us? Help us communicate with devices Help us communicate with each other Find what we are looking for Drive us to work

Scaling with Data

How Large is our Data?

Model Sizes

Deep Learning Training Large amount of data Large and complex models

Training Many Large Models Quickly We need to complete the cycle fast to explore many ideas. Idea Results Code

Need for Speed

DeepBench First open source benchmarking tool to measure deep learning training performance

What is DeepBench? Benchmarking tool for neural network libraries and underlying hardware for training deep learning models Includes a curated list of deep learning operations and workloads that are important and widely used in the industry

Training Operations Matrix Multiply Convolution Recurrent Communication cost

Where does DeepBench fit in? Deep Learning Frameworks E.g. PaddlePaddle, TensorFlow Neural Network Libraries E.g. cuDNN, MKL DeepBench Hardware

Deep Learning Inference Define Inference H/w is different from training and inference :- AWS v/s cluster, I/O costs are different. End goal is different compared to training (reduce time for training) Inference involves latency and real time constraints The model may need to be adapted before deployment.

Model Changes Bidirectional Model Forward Only Model Outputs Outputs Time Inputs Outputs Bidirectional Model Time Inputs Outputs Forward Only Model

Precision Training uses single precision 32 bit floating point numbers FP32: 8 bits of exponent, 23 bits of mantissa Fixed point presentation with 8 bits is sufficient for inference Centering and normalization?

Batch Size

Batch Dispatch for Efficiency Time

Sparse Neural Networks Dense Neural Network Sparse Neural Network

Deployment Platform Image purchased

Inference workloads are significantly different from training Model changes, Low Precision, Batch Size, Sparsity Can’t take training kernels and deploy them. Need to focus on inference and pick the right kernels for it

DeepBench updates Built list of kernels to figure out the best processor based on application requirements Guide hw vendors to develop better hardware for inference

Inference Operations Matrix Multiply Convolution Operations Recurrent Operations Sparse Operations – inference only kernel Smaller Batch Size Low Precision

Latency Measuring latency of operations and kernels isn’t representative Measuring latency involves benchmarking complete applications with deep learning frameworks For server deployment, a user’s network bandwidth will have a significant impact on latency

Training updates to DeepBench New Recurrent Layer - Gated Recurrent Unit (GRU) Low Precision 16 bit training New kernels from different models

DeepBench Inference Results

Benchmarks – Matrix Multiply Matrix Sizes Server Deployment Time (milliseconds) Device Deployment 3072 x 1024, 1024 x 1 0.01 3.71 5124 x 2048, 2048 x 700 0.55 212.84 35 x 2048, 2048 x 700 0.07 1.94

Benchmarks - Convolutions Input Size Filter size # of Filters Server Deployment Time (milliseconds) Device Deployment 112 x 112 x 64 1 x 1 64 0.04 670 28 x 28 x 512 128 0.02 391 7 x 7 x 512 3 x 3 512 0.10 149

Benchmarks – Sparse Matrix Multiply Matrix Sizes Sparsity Server Deployment Time (milliseconds) Device Deployment 7680 x 2560, 2560 x 1 0.95 0.03 1.01 0.9 0.07 2.10 10752 x 3584, 3584 x 1 0.06 1.99

How do I use it? DeepBench blog post has more details: https://svail.github.io/DeepBench-update/ Github repository has the kernels, results and software required for the benchmark: https://github.com/baidu-research/DeepBench

Community Involvement Deep learning researchers can provide new operations and workloads that are specific to their application Software Developers working on neural network libraries or linear algebra libraries can contribute results for inference or training platforms Hardware vendors and startups can contribute results for these benchmarks using their hardware and libraries

Sharan Narang sharan@baidu.com http://research.baidu.com Silicon Valley AI Lab