Download presentation
Presentation is loading. Please wait.
Published byRalf Pearson Modified over 6 years ago
1
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Logan Stafman*, Haoyu Zhang *, Andrew Or, Michael J Freedman
2
Machine Learning in Data Science
Machine learning in clusters has become ubiquitous Training data and model size growthhigh resource contention
3
Machine Learning Overview
Repeatedly update model Model available after each iteration Loss is an indicator of how well a model is trained Choose Initial Model Model/Loss Update Model
4
Cluster Schedulers Use Fairness
Use “fair-share” algorithms to evenly allocate resources ML jobs have 3 key features making fair-share an inappropriate choice Approximate Diminishing Returns Exploratory Process
5
Machine Learning is Approximate
ML Job Worker Send Tasks Worker Worker Model Model Updates Model Replica Tasks Data Shards
6
ML Jobs Have Diminishing Returns
80% of work done in 20% of time Common across many ML algorithms Only applicable to convex optimization problems
7
ML Training is Exploratory
Data Scientists run and rerun experiments while varying Hyperparameters Dataset Features Model Structures Might launch several concurrently Modify Features Modify Hyperparameters Modify Model Structure Run ML Training Algorithm
8
Our Solution: SLAQ SLAQ uses knowledge of ML applications’ losses
Allocate resources based on loss reduction, not resource fairness Helps applications quickly produce approximate models with high predictive power
9
SLAQ Design Challenges
Find universal way to measure how much work an application does Accurately predict an application’s loss and runtime Do this online, as many of these jobs are “one-off” jobs
10
SLAQ Approach ML jobs send progress reports to scheduler
Predicts future loss online Scheduler Prediction Resource Allocation Update Loss ML Job Model Worker ML Job Model Send Tasks Worker ML Job Model Worker Model Updates Model Replica Tasks Tasks Data Shards
11
Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix
12
Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix Loss Normalized Loss ∆Loss Normalized ∆Loss
13
ML algorithms have similar norm ∆Loss
14
Iteration CPU-time is Predictable
15
Fitting Loss Curves to Predict Progress
Convex optimization processes O(1/n) or O( 1/𝑛 2 ) for sublinear, O( 𝜇 𝑛 ) for 𝜇≤1 for superlinear convergence Use weighted loss history to fit a curve
16
How To Assign Resources
Assigning is an optimization problem Different optimization goals available, as with fair resource scheduling Maximizing total quality vs maximizing minimum quality
17
Experimental Setup 20 Amazon c3.8xlarge instances; total of 600+ cores
Tested with many ML algorithms Including Classification, regression, unsupervised learning GBT, SVM, Logistic Regression, Linear Regression, K-Means, LDA, MLPC Most algorithms in Spark MLlib Varying model sizes 10 KB-10 MB 200GB+ training data Synthetic trace of ML jobs arriving on average every 15 seconds
18
SLAQ jobs have lower average loss
Average loss is lowerAverage model quality is higher
19
Resources belong to newer jobs
Models that converged to 97% or less of their total loss, SLAQ is faster More than twice as fast for models converged to 80%
20
Conclusion SLAQ is a scheduler for ML algorithms that allocates resources for work, not resource fairness Average quality improvement up to 73%, delay reduction up to 44%
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.