Gandiva: Introspective Cluster Scheduling for Deep Learning

Slides:

Advertisements

Similar presentations

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Advertisements

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Cloud Computing Energy efficient cloud computing Keke Chen.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Models for runtime optimization Free Breakout Session Jens, Thomas, Alex, Christoph.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Next Generation of Apache Hadoop MapReduce Owen

Big data classification using neural network

Deep Residual Learning for Image Recognition

TensorFlow– A system for large-scale machine learning

Organizations Are Embracing New Opportunities

OPERATING SYSTEMS CS 3502 Fall 2017

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Processes and threads.

Early Results of Deep Learning on the Stampede2 Supercomputer

Jacob R. Lorch Microsoft Research

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

OpenMosix, Open SSI, and LinuxPMI

Operating System.

Working With Azure Batch AI

ECRG High-Performance Computing Seminar

Chilimbi, et al. (2014) Microsoft Research

R SE to the challenges of ntelligent systems

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker

dawn.cs.stanford.edu/benchmark

Inception and Residual Architecture in Deep Convolutional Networks

Infrastructure, Data Center & Managed Services

Capriccio – A Thread Model

ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

April 30th – Scheduling / parallel

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Department of Computer Science University of California, Santa Barbara

CSC 578 Neural Networks and Deep Learning

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

Learn about MATLAB Engineers – not sales!

Today’s Business Pain Points

Early Results of Deep Learning on the Stampede2 Supercomputer

Summary Background Introduction in algorithms and applications

Incremental Training of Deep Convolutional Neural Networks

CS110: Discussion about Spark

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Scalable Parallel Interoperable Data Analytics Library

Ch 4. The Evolution of Analytic Scalability

F# for Parallel and Asynchronous Programming

What’s new in the Fall Creators Update for Windows Defender ATP

Where Intelligence Lives & Intelligence Management

Declarative Transfer Learning from Deep CNNs at Scale

Last.Backend is a Continuous Delivery Platform for Developers and Dev Teams, Allowing Them to Manage and Deploy Applications Easier and Faster MICROSOFT.

PHI Research in Digital Science Center

Performance And Scalability In Oracle9i And SQL Server 2000

TensorFlow: A System for Large-Scale Machine Learning

Panel on Research Challenges in Big Data

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Department of Computer Science University of California, Santa Barbara

Support for Adaptivity in ARMCI Using Migratable Objects

Tiresias A GPU Cluster Manager for Distributed Deep Learning

CSC 578 Neural Networks and Deep Learning

Accelerating Distributed Machine Learning by Smart Parameter Server

Deep CNN for breast cancer histology Image Analysis

Presentation transcript:

Gandiva: Introspective Cluster Scheduling for Deep Learning Wencong Xiao (Beihang University & Microsoft Research) Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, Lidong Zhou Joint work of MSR Asia and MSR India Published in OSDI’18

Gandiva: Introspective Cluster Scheduling for Deep Learning A new scheduler architecture catering to the key characterizations of deep learning training System innovations bring an order of magnitude efficiency gains

Deep Learning Training vs. Big Data Processing Goal Find a qualified neural model Compute a set of data Process Trial-and-error: see the training accuracy of hyper-parameters a trial job can last for hours or days it might take lots of jobs (~100) May stop a job early if not promising AutoML: automate the process Run one job to completion may have sub-tasks (e.g., MapReduce) sub-tasks are short-lived (minutes) Performance Find the model as fast as possible performance of one job might NOT matter Run the job as fast as possible Quality model: high accuracy, small footprint Our goal is to build a system to speed up the process

Implication – More Parallel Jobs the Better Which model is better? GPU cluster Good trial jobs surface earlier Bad trial jobs stop earlier More effective search in hyper-parameter space Need time-slicing of jobs – GPU not efficiently virtualizable

Implication – Long-Running Jobs vs. Changing Environment Need to adapt the long-running job to the changing environment Example: Need ability to migrate jobs 2-GPU Job Server 1 Server 2

Complication Sensitivity to locality and interference varies across jobs How do we know the decision is beneficial? How do we know migration is beneficial? Better locality but with more interference?

Opportunity – Computation Boundary of Deep Learning Deep learning training runs in mini-batches Iterative behavior Computation divided by super-steps (i.e., mini-batches) Mini-batches separated by global synchronization Mini-batch as the computation boundary Light-weight profiling Time-slicing and migration at the barrier ResNet50 on ImageNet data

Our approach Time-slicing and migration as the primitives for scheduling (similar to OS) Mitigate head-of-line blocking Explore more trials in parallel Time-slicing (50~250ms) Migration (~second)

Our approach Introspection: Application-aware profiling (time-per-minibatch) Continuous and introspective scheduling to adapt quickly to the changing environment Efficient implementation by exploiting the predictability Checkpointing at the mini-batch boundary with minimum memory overhead

A Comparison to Big Data Scheduler Deep Learning Scheduler Granularity MapReduce task or DFG Mini-batch boundary Decision One-time Continuous/introspective Profiling System-level - CPU/GPU Util., disk I/O Application-level - Time-per-mini-batch

Performance Highlights Time-slicing – Less than 2% overhead Migration – 50x faster – <1s migration cost (including multi-GPU, multi-Node) Migration: save 98% of overhead, sub-second overhead even for multi-GPU, multi-node migration ResNet50

Performance Highlights AutoML model exploration – 1.5hrs vs. 20.3hrs – 13.6X speedup VGG-like model, >90% accuracy, 40 trials Two AutoML sessions, each uses 8 GPUs Background DLT in a 100-GPU cluster

Beyond Research: An Open Source Stack for AI Innovation OpenPAI platform (2017-December) Cluster management for AI training and Marketplace for AI asset sharing NNI – Neural Network Intelligence (2018-September) A toolkit for automated machine learning experiments MMdnn (2017-November) A tool to convert, visualize and diagnose deep neural network models Tools for AI (2017-September) An extension to build, test, and deploy deep learning/AI solutions Compiler infrastructure (TBD) Compile time and runtime optimization

Thank you!

NNI OpenPAI