SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Logan Stafman*, Haoyu Zhang *, Andrew Or, Michael J Freedman

Machine Learning in Data Science
Machine learning in clusters has become ubiquitous Training data and model size growthhigh resource contention

Machine Learning Overview
Repeatedly update model Model available after each iteration Loss is an indicator of how well a model is trained Choose Initial Model Model/Loss Update Model

Cluster Schedulers Use Fairness
Use “fair-share” algorithms to evenly allocate resources ML jobs have 3 key features making fair-share an inappropriate choice Approximate Diminishing Returns Exploratory Process

Machine Learning is Approximate
ML Job Worker Send Tasks Worker Worker Model Model Updates Model Replica Tasks Data Shards

ML Jobs Have Diminishing Returns
80% of work done in 20% of time Common across many ML algorithms Only applicable to convex optimization problems

ML Training is Exploratory
Data Scientists run and rerun experiments while varying Hyperparameters Dataset Features Model Structures Might launch several concurrently Modify Features Modify Hyperparameters Modify Model Structure Run ML Training Algorithm

Our Solution: SLAQ SLAQ uses knowledge of ML applications’ losses
Allocate resources based on loss reduction, not resource fairness Helps applications quickly produce approximate models with high predictive power

SLAQ Design Challenges
Find universal way to measure how much work an application does Accurately predict an application’s loss and runtime Do this online, as many of these jobs are “one-off” jobs

SLAQ Approach ML jobs send progress reports to scheduler
Predicts future loss online Scheduler Prediction Resource Allocation Update Loss ML Job Model Worker ML Job Model Send Tasks Worker ML Job Model Worker Model Updates Model Replica Tasks Tasks Data Shards

Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix  

Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix   Loss Normalized Loss ∆Loss Normalized ∆Loss

ML algorithms have similar norm ∆Loss

Iteration CPU-time is Predictable

Fitting Loss Curves to Predict Progress
Convex optimization processes O(1/n) or O( 1/𝑛 2 ) for sublinear, O( 𝜇 𝑛 ) for 𝜇≤1 for superlinear convergence Use weighted loss history to fit a curve

How To Assign Resources
Assigning is an optimization problem Different optimization goals available, as with fair resource scheduling Maximizing total quality vs maximizing minimum quality

Experimental Setup 20 Amazon c3.8xlarge instances; total of 600+ cores
Tested with many ML algorithms Including Classification, regression, unsupervised learning GBT, SVM, Logistic Regression, Linear Regression, K-Means, LDA, MLPC Most algorithms in Spark MLlib Varying model sizes 10 KB-10 MB 200GB+ training data Synthetic trace of ML jobs arriving on average every 15 seconds

SLAQ jobs have lower average loss
Average loss is lowerAverage model quality is higher

Resources belong to newer jobs
Models that converged to 97% or less of their total loss, SLAQ is faster More than twice as fast for models converged to 80%

Conclusion SLAQ is a scheduler for ML algorithms that allocates resources for work, not resource fairness Average quality improvement up to 73%, delay reduction up to 44%

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Similar presentations

Presentation on theme: "SLAQ: Quality-Driven Scheduling for Distributed Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Similar presentations

Presentation on theme: "SLAQ: Quality-Driven Scheduling for Distributed Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback