Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.
Traffic Engineering with Forward Fault Correction (FFC)
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
SDN + Storage.
GreenSlot: Scheduling Energy Consumption in Green Datacenters Íñigo Goiri, Kien Le, Md. E. Haque, Ryan Beauchea, Thu D. Nguyen, Jordi Guitart, Jordi Torres,
SLA-Oriented Resource Provisioning for Cloud Computing
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
Distributed Computations
Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Distributed Low-Latency Scheduling
The Design and Performance of A Real-Time CORBA Scheduling Service Christopher Gill, David Levine, Douglas Schmidt.
GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
Faucets Queuing System Presented by, Sameer Kumar.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Scalable and Coordinated Scheduling for Cloud-Scale computing
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
A stochastic scheduling algorithm for precedence constrained tasks on Grid Future Generation Computer Systems (2011) Xiaoyong Tang, Kenli Li, Guiping Liao,
Prediction-Based Multivariate Query Modeling Analytic Queries.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.
Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.
Performance Assurance for Large Scale Big Data Systems
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 2: Performance Evaluation
Packing Tasks with Dependencies
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Measurement-based Design
Applying Control Theory to Stream Processing Systems
Edinburgh Napier University
ESE532: System-on-a-Chip Architecture
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
PA an Coordinated Memory Caching for Parallel Jobs
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Module 5: CPU Scheduling
Apollo Weize Sun Feb.17th, 2017.
3: CPU Scheduling Basic Concepts Scheduling Criteria
Henge: Intent-Driven Multi-Tenant Stream Processing
CPU SCHEDULING.
Resource-Efficient and QoS-Aware Cluster Management
Module 5: CPU Scheduling
COS 518: Distributed Systems Lecture 11 Mike Freedman
Module 5: CPU Scheduling
Scheduling of Regular Tasks in Linux
Towards Predictable Datacenter Networks
Presentation transcript:

Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca

2 DATA PARALLEL CLUSTERS

3

4 Predictabi lity

5 DATA PARALLEL CLUSTERS Deadli ne

6 DATA PARALLEL CLUSTERS Deadli ne

7 VARIABLE LATENCY

8

9

10 VARIABLE LATENCY

11 VARIABLE LATENCY

12 VARIABLE LATENCY

13 VARIABLE LATENCY

14 VARIABLE LATENCY 4.3x

15 Why does latency vary? 1.Pipeline complexity 2.Noisy execution environment

Cosmos 16 MICROSOFT’S DATA PARALLEL CLUSTERS

Cosmos 17 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore

Cosmos 18 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad

Cosmos 19 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad SCOPE

Cosmos 20 MICROSOFT’S DATA PARALLEL CLUSTERS CosmosStore Dryad SCOPE

21 DRYAD’S DAG WORKFLOW Cosmos Cluster

22 DRYAD’S DAG WORKFLOW Cosmos Cluster

23 DRYAD’S DAG WORKFLOW Pipeli ne Job

24 DRYAD’S DAG WORKFLOW Deadli ne

25 DRYAD’S DAG WORKFLOW Deadli ne

26 Sta ge DRYAD’S DAG WORKFLOW Job

27 Sta ge DRYAD’S DAG WORKFLOW Tas ks Job

28

29 EXPRESSING PERFORMANCE TARGETS Priorities?

30 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights?

31 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights? Difficult for users to set Utility curves?

32 EXPRESSING PERFORMANCE TARGETS Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty

33 OUR GOAL

34 OUR GOAL Maximize utility

35 OUR GOAL Maximize utility while minimizing resources

36 OUR GOAL Maximize utility while minimizing resources by dynamically adjusting the allocation

Jockey 37

Jockey 38 Large clusters

Jockey 39 Large clusters Many users

Jockey 40 Large clusters Many users Prior execution

41 JOCKEY – MODEL f(job state, allocation) -> remaining run time

42 JOCKEY – MODEL f(job state, allocation) -> remaining run time

43 JOCKEY – MODEL f(job state, allocation) -> remaining run time

44 JOCKEY – MODEL f(job state, allocation) -> remaining run time

45 JOCKEY – CONTROL LOOP

46 JOCKEY – CONTROL LOOP

47 JOCKEY – CONTROL LOOP

48 JOCKEY – MODEL f(job state, allocation) -> remaining run time

49 JOCKEY – MODEL f(progress, allocation) -> remaining run time f(job state, allocation) -> remaining run time

50 JOCKEY – PROGRESS INDICATOR

51 JOCKEY – PROGRESS INDICATOR

52 JOCKEY – PROGRESS INDICATOR

53 JOCKEY – PROGRESS INDICATOR total running

54 JOCKEY – PROGRESS INDICATOR total running + total queuing

55 JOCKEY – PROGRESS INDICATOR sta ge total running + total queuing

56 JOCKEY – PROGRESS INDICATOR total running + total queuing total running + total queuing total running + total queuing Stage 1 Stage 2 Stage 3 + +

57 JOCKEY – PROGRESS INDICATOR total running + total queuing total running + total queuing total running + total queuing # complet e total tasks # complet e total tasks # complet e total tasks Stage 1 Stage 2 Stage 3 + +

58 JOCKEY – PROGRESS INDICATOR

59 JOCKEY – PROGRESS INDICATOR

60 JOCKEY – PROGRESS INDICATOR

61 JOCKEY – PROGRESS INDICATOR

62 JOCKEY – PROGRESS INDICATOR

63 JOCKEY – PROGRESS INDICATOR

64 JOCKEY – CONTROL LOOP

65 JOCKEY – CONTROL LOOP 1% complete 2% complete 3% complete 4% complete 5% complete Job model

66 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete 2% complete 3% complete 4% complete 5% complete Job model

67 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Job model

68 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Job model Deadlin e: 50 min. Comple tion: 1%

69 JOCKEY – CONTROL LOOP Job model Deadlin e: 50 min. Comple tion: 1% 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes

70 JOCKEY – CONTROL LOOP Job model 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Deadlin e: 40 min. Comple tion: 3%

71 JOCKEY – CONTROL LOOP Job model 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Deadlin e: 30 min. Comple tion: 5%5%

72 JOCKEY – MODEL f(progress, allocation) -> remaining run time

73 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model?

74 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model? machine learning?

75 JOCKEY – MODEL f(progress, allocation) -> remaining run time analytic model? machine learning? simulator

76 JOCKEY ProblemSolution

77 JOCKEY ProblemSolution Pipeline complexity

78 JOCKEY ProblemSolution Pipeline complexity Use a simulator

79 JOCKEY ProblemSolution Pipeline complexity Use a simulator Noisy environment

80 JOCKEY ProblemSolution Pipeline complexity Use a simulator Noisy environment Dynamic control

Jockey in Action 81

Jockey in Action 82 Real job

Jockey in Action 83 Real job Production cluster

Jockey in Action 84 Real job Production cluster CPU load: ~80%

Jockey in Action 85

Jockey in Action 86

Jockey in Action 87

Jockey in Action 88 Initial deadline: 140 minutes

Jockey in Action 89 New deadline: 70 minutes

Jockey in Action 90 New deadline: 70 minutes Release resources due to excess pessimism

Jockey in Action 91 “Oracle” allocation: Total allocation- hours Deadline

Jockey in Action 92 “Oracle” allocation: Total allocation- hours Deadline Available parallelism less than allocation

Jockey in Action 93 “Oracle” allocation: Total allocation- hours Deadline Allocation above oracle

Evaluation 94

Evaluation 95 Production cluster

Evaluation 96 Production cluster 21 jobs

Evaluation 97 Production cluster 21 jobs SLO met?

Evaluation 98 Production cluster 21 jobs SLO met? Cluster impact?

Evaluation 99

Evaluation Jobs which met the SLO 100

Evaluation Jobs which met the SLO 101

Evaluation Jobs which met the SLO 102 Missed 1 of 94 deadlines

Evaluation Jobs which met the SLO 103

Evaluation Jobs which met the SLO 104

Evaluation Jobs which met the SLO x

Evaluation Jobs which met the SLO 106 Allocated too many resources Missed 1 of 94 deadlines

Evaluation Jobs which met the SLO Allocated too many resources 107 Simulator made good predictions: 80% finish before deadline Missed 1 of 94 deadlines

Evaluation Jobs which met the SLO Allocated too many resources Simulator made good predictions: 80% finish before deadline 108 Control loop is stable and successful Missed 1 of 94 deadlines

Evaluation 109

Evaluation 110

Evaluation 111

Evaluation 112

Evaluation 113

Evaluation 114

Conclusion 115

116 Data parallel jobs are complex,

117 Data parallel jobs are complex, yet users demand deadlines.

118 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters,

119 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters, making simple models inaccurate.

Jockey 120

simulator 121

control-loop 122

123 Deadli ne

124

Questions? Andrew Ferguson 125

Co- authors Peter Bodík (Microsoft Research) Srikanth Kandula (Microsoft Research) Eric Boutín (Microsoft) Rodrigo Fonseca (Brown) Questions? 126 Andrew Ferguson

Backup Slides 127

Utility Curves Deadline For single jobs, scale doesn’t matter For multiple jobs, use financial penalties 128

129 Jockey Resource allocation control loop 1. Slack 2. Hysteresis 3. Dead Zone PredictionRun Time Utility

130 Cosmos Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler) Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion. A token is a guaranteed share of CPU and memory To increase efficiency, unused tokens are re-allocated to jobs with available work Resource sharing in Cosmos

131 Jockey Progress indicator Can use many features of the job to build a progress indicator Earlier work (ParaTimer) concentrated on fraction of tasks completed Our indicator is very simple, but we found it performs best for Jockey’s needs Total vertex initialization time Total vertex run time Fraction of completed vertices

132 Comparison with ARIA ARIA uses analytic models Designed for 3 stages: Map, Shuffle, Reduce Jockey’s control loop is robust due to control-theory improvements ARIA tested on small (66-node) cluster without a network bottleneck We believe Jockey is a better match for production DAG frameworks such as Hive, Pig, etc.

133 Jockey Latency prediction: C(p, a) Event-based simulator –Same scheduling logic as actual Job Manager –Captures important features of job progress –Does not model input size variation or speculative re- execution of stragglers –Inputs: job algebra, distributions of task timings, probabilities of failures, allocation Analytic model –Inspired by Amdahl’s Law: T = S + P/N –S is remaining work on critical path, P is all remaining work, N is number of machines

134 Jockey Resource allocation control loop Executes in Dryad’s Job Manager Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup) Output: Number of tokens to allocate Improved with techniques from control-theory

Jockey offlineduring job runtime job profile 135

Jockey simulator offlineduring job runtime job profile 136

Jockey simulator offlineduring job runtime job stats job profile 137

Jockey simulator offlineduring job runtime job stats latency predictions job profile 138

Jockey simulator offlineduring job runtime utility function job stats latency predictions job profile 139

Jockey simulator offlineduring job runtime running job utility function job stats latency predictions resource allocation control loop job profile 140