Search-Based Approaches to Accelerate Deep Learning

Slides:



Advertisements
Similar presentations
SAGE: Self-Tuning Approximation for Graphics Engines
Advertisements

Tree-Based Density Clustering using Graphics Processors
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
ELeaRNT: Evolutionary Learning of Rich Neural Network Topologies Authors: Slobodan Miletic 3078/2010 Nikola Jovanovic 3077/2010
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Slides for “Data Mining” by I. H. Witten and E. Frank.
Advanced Computer Architecture & Processing Systems Research Lab Framework for Automatic Design Space Exploration.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Philipp Gysel ECE Department University of California, Davis
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
ConvNets for Image Classification
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Variational Autoencoders Theory and Extensions
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN
TensorFlow The Deep Learning Library You Should Be Using.
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
TensorFlow– A system for large-scale machine learning
Deep Learning Software: TensorFlow
Deep Learning for Dual-Energy X-Ray
SUNY Korea BioData Mining Lab - Journal Review
Analysis of Sparse Convolutional Neural Networks
Early Results of Deep Learning on the Stampede2 Supercomputer
Stochastic tree search and stochastic games
基于多核加速计算平台的深度神经网络 分割与重训练技术
Large-scale Machine Learning
Chilimbi, et al. (2014) Microsoft Research
Deep Learning in HEP Large number of applications:
R SE to the challenges of ntelligent systems
Deep Learning Platform as a Service
AlphaGo with Deep RL Alpha GO.
Deep Learning Libraries
Real-Time Ray Tracing Stefan Popov.
A VERY Brief Introduction to Convolutional Neural Network using TensorFlow 李 弘
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Machine Learning: The Connectionist
Florian Tramèr (joint work with Dan Boneh)
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RaPyDLI: Rapid Prototyping HPC Environment for Deep Learning NSF
Introduction to Neural Networks
Early Results of Deep Learning on the Stampede2 Supercomputer
Bucket Renormalization for Approximate Inference
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Final Project presentation
병렬처리시스템 2005년도 2학기 채 수 환
Designing Neural Network Architectures Using Reinforcement Learning
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Inception-v4, Inception-ResNet and the Impact of
Course Recap and What’s Next?
TensorFlow: A System for Large-Scale Machine Learning
EE 193: Parallel Computing
Course Summary Joseph E. Gonzalez
Neural Architecture Search: Basic Approach, Acceleration and Tricks
Model Compression Joseph E. Gonzalez
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M
Deep Learning Libraries
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Overall Introduction for the Lecture
Presentation transcript:

Search-Based Approaches to Accelerate Deep Learning Zhihao Jia 9/9/2019 Stanford University

Deep Learning is Everywhere Convolutional Neural Networks Recurrent Neural Networks Neural Architecture Search Reinforcement Learning

Deep Learning Deployment is Challenging Distributed Heterogenous Hardware Platforms Diverse and Complex DNN Models What operators to execute? How to distribute these operators?

Existing Approach: Heuristic Optimizations Device 1 Device N DNN Architecture Graph Optimizations Parallelization Rule-based Operator Fusion Data/Model Parallelism Miss model- and hardware-specific optimizations Performance is suboptimal

Search-Based Optimizations A search space of possible strategies A cost model and a search algorithm Optimized strategies + = Challenge 1: How to build a search space including optimized strategies? Challenge 2: How to efficiently explore the search space?

+ = … Overview A search space of possible strategies A cost model and Device 1 Device N Parallelization Graph Optimizations A search space of possible strategies The SOAP search space Auto-generated graph substitutions A cost model and a search algorithm + Markov Chain Monte Carlo Cost-based backtracking search Optimized strategies Fast parallelization strategies Optimized computation graphs = Outperform data/model parallelism by 3.3x Outperform rule-based operator fusion by 2.9x

+ = … Overview A search space of possible strategies A cost model and Device 1 Device N Parallelization Graph Optimizations A search space of possible strategies The SOAP search space Auto-generated graph substitutions A cost model and a search algorithm + Markov Chain Monte Carlo Cost-based backtracking search Optimized strategies Fast parallelization strategies Optimized computation graphs =

Beyond Data and Model Parallelism for Deep Neural Networks ICML’18, SysML’19

Current Approaches: Data and Model Parallelism Data parallelism is the default strategy in existing DNN frameworks Manually-designed strategies [1, 2] Combine data and model parallelism to accelerate specific DNNs Automatic generated strategies ColocRL [3] uses RL to find device placement for model parallelism Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x) [1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. 2014 [2] Wu et. al. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016 [3] Mirhoseini et. al. Device placement optimization with reinforcement learning. 2017

The SOAP Search Space Samples Operators Attributes Parameters

Parallelizing a 1D convolution The SOAP Search Space Samples: partitioning training samples (Data Parallelism) Operators Attributes Parameters Pixel Parameter Sample GPU1 GPU2 GPU3 GPU4 Parallelizing a 1D convolution

The SOAP Search Space Samples: partitioning training samples (Data Parallelism) Operators: partitioning DNN operators (Model Parallelism) Attributes Parameters Pixel Pixel Pixel Parameter Parameter Parameter Sample Convolution#1 Sample Convolution#2 Sample Convolution#3 GPU1 GPU2 GPU3

Parallelizing a 1D convolution The SOAP Search Space Samples: partitioning training samples (Data Parallelism) Operators: partitioning DNN operators (Model Parallelism) Attributes: partitioning attributes in a sample (e.g., different pixels) Parameters Pixel GPU4 Parameter GPU3 GPU2 GPU1 Sample Parallelizing a 1D convolution

Parallelizing a 1D convolution The SOAP Search Space Samples: partitioning training samples (Data Parallelism) Operators: partitioning DNN operators (Model Parallelism) Attributes: partitioning attributes in a sample (e.g., different pixels) Parameters: partitioning parameters in an operator GPU1 Pixel GPU2 GPU3 Parameter GPU4 Sample Parallelizing a 1D convolution

Hybrid Parallelism in SOAP Example parallelization strategies for 1D convolution Different strategies perform the same computation.

A possible parallelization strategy in the SOAP search space Parameter GPU1 Sample GPU2 GPU3 GPU4 Data parallelism A possible parallelization strategy in the SOAP search space

A possible parallelization strategy in the SOAP search space Parameter GPU1 Sample GPU2 GPU3 GPU4 Data parallelism Adding speedup A possible parallelization strategy in the SOAP search space

FlexFlow Search Algorithm Cost Model DNN Architecture Device Topology MatMul Network Concat CPU CPU Conv Conv GPU GPU GPU GPU Execution Optimizer Simulated Performance MCMC Search Alg. Execution Simulator Candidate Strategy Search Algorithm Cost Model Best Found Strategy Distributed Runtime

Number of nodes (four K80 GPUs per node) Evaluation 1.7x faster Training Throughput (samples per second) Number of nodes (four K80 GPUs per node) Speedup Over SOTA DNNs AlexNet ResNet-50 Inception-v3 RNNTC RNNLM GNMT FlexFlow 3.3x 1.1x 1.6x 1.7x 1.9x 2.4x

+ = … Overview A search space of possible strategies A cost model and Device 1 Device N Parallelization Graph Optimizations A search space of possible strategies The SOAP search space Auto-generated graph substitutions A cost model and a search algorithm + Markov Chain Monte Carlo Cost-based backtracking search Optimized strategies Fast parallelization strategies Optimized computation graphs =

Optimizing DNN Computation with Automated Generation of Graph Substitutions SysML’19

Current Practice: Rule-Based Graph Transformations Apply graph transformations designed by domain experts E.g., fuse a convolution and a relu into a ``conv + relu’’ Input Input Conv3x3 Conv1x1 Conv3x3 + Relu Conv1x1 + Relu fuse conv and relu Relu Relu Conv3x3 Conv3x3 add add relu relu

Limitations of Rule-based Approaches Robustness Experts’ heuristics do not apply to all DNNs/hardware When I turned on XLA (TensorFlow’s graph optimizer), the training speed is about 20% slower. With XLA, my program is almost 2x slower than without XLA

Limitations of Rule-based Approaches Robustness Experts’ heuristics do not apply to all DNNs/hardware Scalability New operators and graph structures require more rules Performance Miss subtle optimizations for specific DNNs/hardware TensorFlow involves ~4K LOC to optimize a new operator

A Missing Graph Optimization Conv3x3 + Relu Input Relu Conv3x3 + Relu Conv1x1 Input Add Relu Conv3x3 + Relu Input Add Relu Conv3x3 + Relu Input Add Relu Split Conv3x3 + Relu Input Enlarge convs Fuse convs Fuse conv & add Fuse conv & relu The final graph is 1.3x faster on V100 but 10% slower on K80.

Can we automatically find these optimizations? Automatically generated graph substitutions

… … … XFlow Cost-Based Search Alg. Graph Subst. Verifier Verified Substitutions … Graph Subst. Verifier Candidate Substitutions … Input Comp. Graph Optimized Comp. Graph Graph Subst. Generator Operator Specifications

End-to-end Inference Performance Use ~500 automatically generated substitutions End-to-end Inference Performance 2.9x 1.5x 1.4x 1.3x 1.0x Competitive with SOTA Outperform SOTA on unconventional DNNs

Open Problems Can we design better search space for parallelization and graph optimizations? Can we find more efficient search algorithms? Can we use search-based optimizations in other domains?

+ = … Conclusion A search space of possible strategies Device 1 Device N Parallelization Graph Optimizations A search space of possible strategies The SOAP search space Auto-generated graph substitutions A cost model and a search algorithm + Markov Chain Monte Carlo Cost-based backtracking search Optimized strategies Fast parallelization strategies Optimized computation graphs = https://github.com/flexflow/FlexFlow

Backup Slides