Gandiva: Introspective Cluster Scheduling for Deep Learning

Slides:



Advertisements
Similar presentations
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Advertisements

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Cloud Computing Energy efficient cloud computing Keke Chen.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Models for runtime optimization Free Breakout Session Jens, Thomas, Alex, Christoph.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Next Generation of Apache Hadoop MapReduce Owen
Big data classification using neural network
Deep Residual Learning for Image Recognition
TensorFlow– A system for large-scale machine learning
Organizations Are Embracing New Opportunities
OPERATING SYSTEMS CS 3502 Fall 2017
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Processes and threads.
Early Results of Deep Learning on the Stampede2 Supercomputer
Jacob R. Lorch Microsoft Research
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
OpenMosix, Open SSI, and LinuxPMI
Operating System.
Working With Azure Batch AI
ECRG High-Performance Computing Seminar
Chilimbi, et al. (2014) Microsoft Research
R SE to the challenges of ntelligent systems
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
dawn.cs.stanford.edu/benchmark
Inception and Residual Architecture in Deep Convolutional Networks
Infrastructure, Data Center & Managed Services
Capriccio – A Thread Model
ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
April 30th – Scheduling / parallel
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Department of Computer Science University of California, Santa Barbara
CSC 578 Neural Networks and Deep Learning
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Learn about MATLAB Engineers – not sales!
Today’s Business Pain Points
Early Results of Deep Learning on the Stampede2 Supercomputer
Summary Background Introduction in algorithms and applications
Incremental Training of Deep Convolutional Neural Networks
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Scalable Parallel Interoperable Data Analytics Library
Ch 4. The Evolution of Analytic Scalability
F# for Parallel and Asynchronous Programming
What’s new in the Fall Creators Update for Windows Defender ATP
Where Intelligence Lives & Intelligence Management
Declarative Transfer Learning from Deep CNNs at Scale
Last.Backend is a Continuous Delivery Platform for Developers and Dev Teams, Allowing Them to Manage and Deploy Applications Easier and Faster MICROSOFT.
PHI Research in Digital Science Center
Performance And Scalability In Oracle9i And SQL Server 2000
TensorFlow: A System for Large-Scale Machine Learning
Panel on Research Challenges in Big Data
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Department of Computer Science University of California, Santa Barbara
Support for Adaptivity in ARMCI Using Migratable Objects
Tiresias A GPU Cluster Manager for Distributed Deep Learning
CSC 578 Neural Networks and Deep Learning
Accelerating Distributed Machine Learning by Smart Parameter Server
Deep CNN for breast cancer histology Image Analysis
Presentation transcript:

Gandiva: Introspective Cluster Scheduling for Deep Learning Wencong Xiao (Beihang University & Microsoft Research) Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, Lidong Zhou Joint work of MSR Asia and MSR India Published in OSDI’18

Gandiva: Introspective Cluster Scheduling for Deep Learning A new scheduler architecture catering to the key characterizations of deep learning training System innovations bring an order of magnitude efficiency gains

Deep Learning Training vs. Big Data Processing Goal Find a qualified neural model Compute a set of data Process Trial-and-error: see the training accuracy of hyper-parameters a trial job can last for hours or days it might take lots of jobs (~100) May stop a job early if not promising AutoML: automate the process Run one job to completion may have sub-tasks (e.g., MapReduce) sub-tasks are short-lived (minutes) Performance Find the model as fast as possible performance of one job might NOT matter Run the job as fast as possible Quality model: high accuracy, small footprint Our goal is to build a system to speed up the process

Implication – More Parallel Jobs the Better Which model is better? GPU cluster Good trial jobs surface earlier Bad trial jobs stop earlier More effective search in hyper-parameter space Need time-slicing of jobs – GPU not efficiently virtualizable

Implication – Long-Running Jobs vs. Changing Environment Need to adapt the long-running job to the changing environment Example: Need ability to migrate jobs 2-GPU Job Server 1 Server 2

Complication Sensitivity to locality and interference varies across jobs How do we know the decision is beneficial? How do we know migration is beneficial? Better locality but with more interference?

Opportunity – Computation Boundary of Deep Learning Deep learning training runs in mini-batches Iterative behavior Computation divided by super-steps (i.e., mini-batches) Mini-batches separated by global synchronization Mini-batch as the computation boundary Light-weight profiling Time-slicing and migration at the barrier ResNet50 on ImageNet data

Our approach Time-slicing and migration as the primitives for scheduling (similar to OS) Mitigate head-of-line blocking Explore more trials in parallel Time-slicing (50~250ms) Migration (~second)

Our approach Introspection: Application-aware profiling (time-per-minibatch) Continuous and introspective scheduling to adapt quickly to the changing environment Efficient implementation by exploiting the predictability Checkpointing at the mini-batch boundary with minimum memory overhead

A Comparison to Big Data Scheduler Deep Learning Scheduler Granularity MapReduce task or DFG Mini-batch boundary Decision One-time Continuous/introspective Profiling System-level - CPU/GPU Util., disk I/O Application-level - Time-per-mini-batch

Performance Highlights Time-slicing – Less than 2% overhead Migration – 50x faster – <1s migration cost (including multi-GPU, multi-Node) Migration: save 98% of overhead, sub-second overhead even for multi-GPU, multi-node migration ResNet50

Performance Highlights AutoML model exploration – 1.5hrs vs. 20.3hrs – 13.6X speedup VGG-like model, >90% accuracy, 40 trials Two AutoML sessions, each uses 8 GPUs Background DLT in a 100-GPU cluster

Beyond Research: An Open Source Stack for AI Innovation OpenPAI platform (2017-December) Cluster management for AI training and Marketplace for AI asset sharing NNI – Neural Network Intelligence (2018-September) A toolkit for automated machine learning experiments MMdnn (2017-November) A tool to convert, visualize and diagnose deep neural network models Tools for AI (2017-September) An extension to build, test, and deploy deep learning/AI solutions Compiler infrastructure (TBD) Compile time and runtime optimization

Thank you!

NNI OpenPAI