Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca
2 DATA PARALLEL CLUSTERS
3
4 Predictabi lity
5 DATA PARALLEL CLUSTERS Deadli ne
6 DATA PARALLEL CLUSTERS Deadli ne
7 VARIABLE LATENCY
8
9
10 VARIABLE LATENCY
11 VARIABLE LATENCY
12 VARIABLE LATENCY
13 VARIABLE LATENCY
14 VARIABLE LATENCY 4.3x
15 Why does latency vary? 1.Pipeline complexity 2.Noisy execution environment
Cosmos 16 MICROSOFT’S DATA PARALLEL CLUSTERS
Cosmos 17 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore
Cosmos 18 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad
Cosmos 19 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad SCOPE
Cosmos 20 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad SCOPE
21 DRYAD’S DAG WORKFLOW Cosmos Cluster
22 DRYAD’S DAG WORKFLOW Cosmos Cluster
23 DRYAD’S DAG WORKFLOW Pipeli ne Job
24 DRYAD’S DAG WORKFLOW Deadli ne
25 DRYAD’S DAG WORKFLOW Deadli ne
26 Sta ge DRYAD’S DAG WORKFLOW Job
27 Sta ge DRYAD’S DAG WORKFLOW Tas ks Job
28
29 EXPRESSING PERFORMANCE TARGETS Priorities?
30 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights?
31 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights? Difficult for users to set Utility curves?
32 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty
33 OUR GOAL
34 OUR GOAL Maximize utility
35 OUR GOAL Maximize utility while minimizing resources
36 OUR GOAL Maximize utility while minimizing resources by dynamically adjusting the allocation
Jockey 37
Jockey 38 Large clusters
Jockey 39 Large clusters Many users
Jockey 40 Large clusters Many users Prior execution
41 JOCKEY – MODEL f(job state, allocation) -> remaining run time
42 JOCKEY – MODEL f(job state, allocation) -> remaining run time
43 JOCKEY – MODEL f(job state, allocation) -> remaining run time
44 JOCKEY – MODEL f(job state, allocation) -> remaining run time
45 JOCKEY – CONTROL LOOP
46 JOCKEY – CONTROL LOOP
47 JOCKEY – CONTROL LOOP
48 JOCKEY – MODEL f(job state, allocation) -> remaining run time
49 JOCKEY – MODEL f(progress, allocation) -> remaining run time f(job state, allocation) -> remaining run time
50 JOCKEY – PROGRESS INDICATOR
51 JOCKEY – PROGRESS INDICATOR
52 JOCKEY – PROGRESS INDICATOR
53 JOCKEY – PROGRESS INDICATOR total running
54 JOCKEY – PROGRESS INDICATOR total running + total queuing
55 JOCKEY – PROGRESS INDICATOR sta ge total running + total queuing
56 JOCKEY – PROGRESS INDICATOR total running + total queuing total running + total queuing total running + total queuing Stage 1 Stage 2 Stage 3 + +
57 JOCKEY – PROGRESS INDICATOR total running + total queuing total running + total queuing total running + total queuing # complet e total tasks # complet e total tasks # complet e total tasks Stage 1 Stage 2 Stage 3 + +
58 JOCKEY – PROGRESS INDICATOR
59 JOCKEY – PROGRESS INDICATOR
60 JOCKEY – PROGRESS INDICATOR
61 JOCKEY – PROGRESS INDICATOR
62 JOCKEY – PROGRESS INDICATOR
63 JOCKEY – PROGRESS INDICATOR
64 JOCKEY – CONTROL LOOP
65 JOCKEY – CONTROL LOOP 1% complete 2% complete 3% complete 4% complete 5% complete Job model
66 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete 2% complete 3% complete 4% complete 5% complete Job model
67 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Job model
68 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Job model Deadlin e: 50 min. Comple tion: 1%
69 JOCKEY – CONTROL LOOP Job model Deadlin e: 50 min. Comple tion: 1% 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes
70 JOCKEY – CONTROL LOOP Job model 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Deadlin e: 40 min. Comple tion: 3%
71 JOCKEY – CONTROL LOOP Job model 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Deadlin e: 30 min. Comple tion: 5%5%
72 JOCKEY – MODEL f(progress, allocation) -> remaining run time
73 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model?
74 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model? machine learning?
75 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model? machine learning? simulator
76 JOCKEY ProblemSolution
77 JOCKEY ProblemSolution Pipeline complexity
78 JOCKEY ProblemSolution Pipeline complexity Use a simulator
79 JOCKEY ProblemSolution Pipeline complexity Use a simulator Noisy environment
80 JOCKEY ProblemSolution Pipeline complexity Use a simulator Noisy environment Dynamic control
Jockey in Action 81
Jockey in Action 82 Real job
Jockey in Action 83 Real job Production cluster
Jockey in Action 84 Real job Production cluster CPU load: ~80%
Jockey in Action 85
Jockey in Action 86
Jockey in Action 87
Jockey in Action 88 Initial deadline: 140 minutes
Jockey in Action 89 New deadline: 70 minutes
Jockey in Action 90 New deadline: 70 minutes Release resources due to excess pessimism
Jockey in Action 91 “Oracle” allocation: Total allocation- hours Deadline
Jockey in Action 92 “Oracle” allocation: Total allocation- hours Deadline Available parallelism less than allocation
Jockey in Action 93 “Oracle” allocation: Total allocation- hours Deadline Allocation above oracle
Evaluation 94
Evaluation 95 Production cluster
Evaluation 96 Production cluster 21 jobs
Evaluation 97 Production cluster 21 jobs SLO met?
Evaluation 98 Production cluster 21 jobs SLO met? Cluster impact?
Evaluation 99
Evaluation Jobs which met the SLO 100
Evaluation Jobs which met the SLO 101
Evaluation Jobs which met the SLO 102 Missed 1 of 94 deadlines
Evaluation Jobs which met the SLO 103
Evaluation Jobs which met the SLO 104
Evaluation Jobs which met the SLO x
Evaluation Jobs which met the SLO 106 Allocated too many resources Missed 1 of 94 deadlines
Evaluation Jobs which met the SLO Allocated too many resources 107 Simulator made good predictions: 80% finish before deadline Missed 1 of 94 deadlines
Evaluation Jobs which met the SLO Allocated too many resources Simulator made good predictions: 80% finish before deadline 108 Control loop is stable and successful Missed 1 of 94 deadlines
Evaluation 109
Evaluation 110
Evaluation 111
Evaluation 112
Evaluation 113
Evaluation 114
Conclusion 115
116 Data parallel jobs are complex,
117 Data parallel jobs are complex, yet users demand deadlines.
118 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters,
119 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters, making simple models inaccurate.
Jockey 120
simulator 121
control-loop 122
123 Deadli ne
124
Questions? Andrew Ferguson 125
Co- authors Peter Bodík (Microsoft Research) Srikanth Kandula (Microsoft Research) Eric Boutín (Microsoft) Rodrigo Fonseca (Brown) Questions? 126 Andrew Ferguson
Backup Slides 127
Utility Curves Deadline For single jobs, scale doesn’t matter For multiple jobs, use financial penalties 128
129 Jockey Resource allocation control loop 1. Slack 2. Hysteresis 3. Dead Zone PredictionRun Time Utility
130 Cosmos Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler) Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion. A token is a guaranteed share of CPU and memory To increase efficiency, unused tokens are re-allocated to jobs with available work Resource sharing in Cosmos
131 Jockey Progress indicator Can use many features of the job to build a progress indicator Earlier work (ParaTimer) concentrated on fraction of tasks completed Our indicator is very simple, but we found it performs best for Jockey’s needs Total vertex initialization time Total vertex run time Fraction of completed vertices
132 Comparison with ARIA ARIA uses analytic models Designed for 3 stages: Map, Shuffle, Reduce Jockey’s control loop is robust due to control-theory improvements ARIA tested on small (66-node) cluster without a network bottleneck We believe Jockey is a better match for production DAG frameworks such as Hive, Pig, etc.
133 Jockey Latency prediction: C(p, a) Event-based simulator –Same scheduling logic as actual Job Manager –Captures important features of job progress –Does not model input size variation or speculative re- execution of stragglers –Inputs: job algebra, distributions of task timings, probabilities of failures, allocation Analytic model –Inspired by Amdahl’s Law: T = S + P/N –S is remaining work on critical path, P is all remaining work, N is number of machines
134 Jockey Resource allocation control loop Executes in Dryad’s Job Manager Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup) Output: Number of tokens to allocate Improved with techniques from control-theory
Jockey offlineduring job runtime job profile 135
Jockey simulator offlineduring job runtime job profile 136
Jockey simulator offlineduring job runtime job stats job profile 137
Jockey simulator offlineduring job runtime job stats latency predictions job profile 138
Jockey simulator offlineduring job runtime utility function job stats latency predictions job profile 139
Jockey simulator offlineduring job runtime running job utility function job stats latency predictions resource allocation control loop job profile 140