Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling 1
Department of Computer Science, UIUC Papers Presented Improving MapReduce Performance in Heterogeneous OSDI 2008 Matei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion UC Berkeley RAD Lab Quincy: Fair Scheduling for Distributed Computing SOSP 2009 Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Andrew MSR Silicon Valley Reining in the Outliers in Map-Reduce Clusters using OSDI 2010 Ganesh Ananthanarayanan, Ion UC Berkeley RAD Lab, Srikanth Kandula, Albert MSR, Yi UIUC, Bikash Saha, Edward Microsoft Bing 2
Department of Computer Science, UIUC Quincy: Fair Scheduling for Distributed Computing Clusters SOSP 2009] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew MSR Silicon Valley 3
Motivation Fairness: – Existing dryad scheduler unfair [greedy approach]. – Subsequent small jobs waiting for a large job to finish. Data Locality: – HPC jobs fetch data from a SAN, no need for co-location of data and computation. – Data intensive workloads have storage attached to computers. – Scheduling tasks near data improves performance. 4 Department of Computer Science, UIUC
Fair Sharing Job X takes t seconds when it runs exclusively on a cluster. X should take no more than Jt seconds when cluster has J concurrent jobs. Formally, for N computers and J jobs, each job should get at-least N/J computers. 5 Department of Computer Science, UIUC
Quincy Assumptions Homogeneous clusters – Heterogeneity is discussed in next paper [LATE scheduler] Global uniform cost measure – E.g. Quincy assumes that the cost of preempting a running job can be expressed in the same units as the cost of data transfer. 6 Department of Computer Science, UIUC
Fine Grain Resource Sharing For MPI jobs, coarse grain scheduling – Devote a fixed set of computers for a particular job – Static allocation, rarely change the allocation For data intensive jobs (map-reduce, dryad) – We need fine grain resource sharing multiplex all computers in the cluster between all jobs – Large datasets attached to each computer – Independent tasks (less costly to kill a task and restart) 7 Department of Computer Science, UIUC
Example of Coarse Grain Sharing Department of Computer Science, UIUC 8
Example of Fine Grain Sharing 9 Department of Computer Science, UIUC
Goals of Quincy Fair sharing with locality. N computers, J jobs, – Each job gets at-least N/J computers – With data locality place tasks near data to avoid network bottlenecks – Feels like a multi-constrained optimization problem with trade-offs! – Joint optimization of fairness and data locality – These objectives might be at odds! 10 Department of Computer Science, UIUC
Cluster Architecture 11 Department of Computer Science, UIUC
Baseline: Queue Based Scheduler 12 Department of Computer Science, UIUC
Flow Based Scheduler = Quincy Main Idea: [Matching = Scheduling] – Construct a graph based on scheduling constraints, and cluster architecture. – Assign costs to each matching. – Finding a min cost flow on the graph is equivalent to finding a feasible schedule. – Each task is either scheduled on a computer or it remains unscheduled. – Fairness constrains number of tasks scheduled for each job. 13 Department of Computer Science, UIUC
New Goal Minimize matching cost while obeying fairness constraints. – Instead of making local decisions [greedy], solve it globally. Issues: – How to construct the graph? – How to embed fairness and locality constraints in the graph? Details in appendix of paper 14 Department of Computer Science, UIUC
Graph Construction Start with a directed graph representation of the cluster architecture. 15 Department of Computer Science, UIUC
Graph Construction (2) Add an unscheduled node U j. Each worker task has an edge to U j. There is a single edge from U j to the sink. High cost on edges from tasks to U j. The cost and flow on the edge from U j to the sink controls fairness. Fairness controlled by adjusting the number of tasks allowed for each job 16 Department of Computer Science, UIUC
Graph Construction (3) 17 Department of Computer Science, UIUC Add edges from tasks (T) to computers (C), racks (R), and the cluster (X). cost(T-C) << cost(T-R) << cost(T-X). Control over data locality. 0 cost edge from root task to computer to avoid preempting root task.
A Feasible Matching 18 Department of Computer Science, UIUC Cost of T-U edge increases over time New cost assigned to scheduled T-C edge: increases over time
Final Graph 19 Department of Computer Science, UIUC
Workloads Typical Dryad jobs (Sort, Join, PageRank, WordCount, Prime). Prime used as a worst-case job that hogs the cluster if started first. 240 computers in cluster. 8 racks, computers per rack. More than one metric used for evaluation. 20 Department of Computer Science, UIUC
Experiments 21 Department of Computer Science, UIUC
Experiments (2) 22 Department of Computer Science, UIUC
Experiments (3) 23 Department of Computer Science, UIUC
Experiments (4) 24 Department of Computer Science, UIUC
Experiments (5) 25 Department of Computer Science, UIUC
Conclusion New computational model for data intensive computing. Elegant mapping of scheduling to min cost flow/matching. 26 Department of Computer Science, UIUC
Discussion Points Min cost flow recomputed from scratch each time a change occurs – Improvement: incremental flow computation. No theoretical stability guarantee. Fairness measure: control number of tasks for each job, are there other measures? Correlation constraints? Other models: auctions to allocate resources. Selfish behavior: jobs manipulating costs. Heterogeneous data centers. Centralized Quincy controller: single point of failure. 27 Department of Computer Science, UIUC
Improving MapReduce Performance in Heterogeneous Environments Matei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion Stoica 28
Department of Computer Science, UIUC Background k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records map reduce Input records Split shuffle Map-Reduce Phases 29
Department of Computer Science, UIUC Motivation 1.MapReduce is becoming popular Open-source implementation: Hadoop, used by Yahoo!, Facebook, Scale: 20 PB/day at Google, O(10,000) nodes at Yahoo, 3000 jobs/day at Facebook 2.Utility computing services like Amazon Elastic Compute Cloud (EC2) provide cheap on-demand computing Price: 10 cents / VM / hour Scale: thousands of VMs Caveat: less control over performance So the smallest increase in performance has significant impact 30
Department of Computer Science, UIUC Motivation MapReduce (Hadoop) Performance depends Task Scheduler (handles stragglers) The task scheduler makes its decision based on the assumption that cluster nodes are homogeneous. However, this assumption does not hold in heterogeneous environments like Amazon EC2. 31
Department of Computer Science, UIUC Goal Define a new scheduling metric. Choosing the right machines to run speculative tasks. Capping the amount of speculative executions. Improve the performance of speculative executions. Speculative execution deals with rerunning stragglers. 32
Department of Computer Science, UIUC Speculative Execution Slow nodes/stragglers are the main bottleneck for jobs not finishing in time. So, to reduce response time stragglers are speculatively executed on other free nodes. Node 1 Node 2 How can this be done in a heterogeneous environment? 33
Department of Computer Science, UIUC Speculative Execution in Hadoop Task Associated Progress score [0,1] Map Task Progress Score Fraction of data read Reduce Task 1/3*Copy Phase + 1/3*Sort Phase + 1/3*Reduce Phase Progress depends E.g. a task halfway through reduce phase scores=1/3*1+1/3*1+1/3*1/2=5/6 Speculative execution threshold : progress < avgProgress –
Department of Computer Science, UIUC Hadoop’s Assumption 1. Nodes can perform work at exactly the same rate 2. Tasks progress at a constant rate throughout time 3. There is no cost to launching a speculative task on an idle node 4. The three phases of execution take approximately same time 5. Tasks with a low progress score are stragglers 6. Maps and Reduces require roughly the same amount of work 35
Breaking Down the Assumptions The first 2 assumptions talk about homogeneity. But heterogeneity exists due to- 1.Multiple generations of Hardware. 2.Co-location of multiple VMs on the same physical host. Department of Computer Science, UIUC 36
Department of Computer Science, UIUC Breaking Down the Assumptions Assumption 3:There is no cost to launching a speculative task on an idle node Not true in situations where resources are shared. E.g. Network Bandwidth Disk I/O operation 37
Department of Computer Science, UIUC Breaking Down the Assumptions Assumption 4: The three phases of execution take approximately the same time. The copy phase of the reduce task is the slowest while the other 2 phases are relatively faster. Suppose 40% of the reducers have finished the copy phase and quickly completed their remaining task. The Remaining 60% are near the end of copy phase. Avg progress: 0.4*1+0.6*1/3=60% Progress of the 60% reduce tasks= 33.33% progress < avgProgress –
Department of Computer Science, UIUC Breaking Down the Assumptions Assumption 5: Tasks with a low progress score are stragglers. Suppose a task has finished 80% of its work, and from that point onward gets really slow. But due to the 20% threshold it can never be speculated. 80% < 100%-20% 39
Department of Computer Science, UIUC 1.Too many backups, thrashing shared resources like network bandwidth 2.Wrong tasks backed up 3.Backups may be placed on slow nodes Example: Observed that ~80% of reduces backed up, most of them lost to originals; network thrashed Problems With Speculative Execution 40
Department of Computer Science, UIUC Idea: Use Progress Rate 41 Instead of using progress score, compute progress rates, and backup tasks that are “far enough” below the mean Problem: can still select the wrong tasks Progress Rate = Progress Score Execution Time
Department of Computer Science, UIUC Progress Rate Example Time (min) Node 1 Node 2 Node 3 3x slower 1.9x slower 1 task/min 1 min2 min A job with 3 tasks 42
Department of Computer Science, UIUC Node 1 Task1 Node 2 Task2 Task4 Node 3 Task3 Task5 What if the job had 5 tasks? time left: 1.8 min Task5 2 min Time (min) time left: 1 min Progress rate Node2=0.33 Node3=0.53 Node2 selected Free Slot 43 Progress Rate Example Node 2 is slowest, but should back up Node 3’s task!
Department of Computer Science, UIUC LATE Scheduler Longest Approximate Time to End Primary assumption: best task to execute is the one that finishes furthest into the future Secondary: tasks make progress at approx. constant rate. Caveat- can still select the wrong tasks Estimated time left= 1- Progress Score Progress Rate 44
Department of Computer Science, UIUC LATE Scheduler Task 5 Task 2 Node 1 Task 1 Node 2 Task 4 Node 3 Task3 2 min Time (min) Progress = 5.3% Estimated time left: (1-0.66) / (1/3) = 1 min Estimated time left: (1-0.05) / (1/1.9) = 1.8 min Progress = 66% LATE correctly picks Node 3 Copy of Task 5
Department of Computer Science, UIUC Other Details of LATE Cap the number of speculative tasks SpeculativeCap-10% Avoid unnecessary speculations Limit contention and hurting throughput Select fast node to launch backups SlowNodeThreshold–25 th percentile Based on total work performed Only back-up tasks that are sufficiently slow SlowTaskThreshold–25 th percentile Based on task progress rate Does not consider data locality 46
Department of Computer Science, UIUC Evaluation Environment Environments: Amazon EC2 ( nodes) Small Local Testbed (9 nodes) Heterogeneity Setup: Assigning a varying number of VMs to each node Running CPU and I/O intensive jobs to intentionally create stragglers 47
Department of Computer Science, UIUC EC2 Sort with Heterogeneity Each host sorted 128MB with a total of 30GB data Average 27% speedup over native, 31% over no backups 48
Department of Computer Science, UIUC EC2 Sort with Stragglers Average 58% speedup over native, 220% over no backups Each node sorted 256MB with a total of 25GB of data Stragglers created with 4 CPU (800KB array sort) and 4 disk (dd tasks) intensive processes 49
Department of Computer Science, UIUC EC2 Grep and Wordcount with Stragglers Grep WordCount 36% gain over native 57% gain over no backups 8.5% gain over native 179% gain over no backups 50
Department of Computer Science, UIUC Remarks Pros: Considers heterogeneity that appears in real life systems. LATE speculatively executes the tasks that hurt the response time the most on fast nodes. LATE caps speculative tasks to avoid overloading resources. Cons: Does not consider data locality. Tasks may require different amount of computation. 51
Department of Computer Science, UIUC Discussion Points What is the impact of allowing more than one speculative copy of a given task to run? How would LATE perform on larger VMs? How could we use data locality to improve the performance of LATE? How generic are the optimizations made by LATE? 52
Department of Computer Science, UIUC Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris 53
What this Paper is About Current schemes (e.g. Hadoop, LATE) duplicate long- running tasks based on some metrics. Mantri: A Cause-, Resource-Aware mitigation scheme – Case by case analysis: takes distinct actions based on cause – Considers opportunity cost of actions Department of Computer Science, UIUC54
The Outlier Problem: Causes and Solutions What are Outliers? Stragglers: Tasks that take >= 1.5 times the median task in that phase Re-computes: Tasks that are re-run because their output was lost (Not considered in LATE paper) Department of Computer Science, UIUC 55
Frequency of Outliers (1)The median phase has 10% stragglers and no recomputes (2)10% of the stragglers take > 10x longer Department of Computer Science, UIUC 56
Mantri [Resource Aware Restart] Problem: Outliers due to machine contention. Idea: Restart tasks elsewhere in the cluster Challenge: The earlier the better, but restart or duplicate? Mantri Solution: – Do either iff: P(t new < t rem ) is high – Mantri kills and restarts only if the remaining time is so large that the chance of a restart finishing earlier is high: t rem > E(t new ) + C – Mantri starts a duplicate only if the total amount of resource consumed decreases: P(t rem > t new (c+1)/c), c is the number of copies currently running – Continuously observe and kill wasteful copies. At-most 3 copies exist. Department of Computer Science, UIUC 57
Mantri [Network Aware Placement] Problem: Tasks reading input across the network experience variable congestion. Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth Challenges: Global co-ordination, congestion detection Insights: – local control is a good approximation – Link utilization averages out on the long term, and is steady on the short term If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that: Department of Computer Science, UIUC 58
Mantri [Avoid Recomputations] Problem: Due to unavoidable input, tasks have to be recomputed. Insight: – 50% of recomputes are on 5% machines – Cost to recompute vs cost to replicate M1M1 M2M2 t redo = r 2 (t 2 +t 1 redo ) Cost to recompute depends on data loss probabilities and time taken, and also recursively looks at prior phases. Mantri preferentially acts on more costly inputs Department of Computer Science, UIUC 59
Mantri [Data Aware Task Ordering] Problem: Workload imbalance causes tasks to straggle. Idea: Restarting outliers that are lengthy is counter- productive. Insights: – Theorem [Graham, 1969] – Scheduling tasks with longest processing time first is at-most 33% worse than optimal schedule. Mantri Solution: – Schedule tasks in a phase in descending order of input size. Department of Computer Science, UIUC 60
Summary Outliers are a significant problem Happens due to many causes Mantri: cause and resource aware mitigation outperforms prior schemes Department of Computer Science, UIUC 61
Discussion Points Too many schemes packed together, no unifying theme! Mantri does case by case analysis for each cause, what if the causes are inter- dependent? Department of Computer Science, UIUC 62