Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang
References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – us/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx
log(size of dataset) GB 10 9 TB PB EB log(size of cluster) HPC, || databases mapreduce MapReduce Decouples customized data operations from mechanisms to scale Is widely used Cosmos (based on SVC’s Dryad) + Bing Google Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data 3
Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: Read Map Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress 4
Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System Goals Speeding up jobs improves productivity Predictability supports SLAs … while using resources efficiently We find that: 5
What is an outlier A phase (map or reduce) has n tasks and s slots (available compute resources) Every task takes T seconds to run t i = f (datasize, code, machine, network) Ideally run time = ceiling (n/s) * T A naïve scheduler Goal is to be closer to
From a phase to a job A job may have many phases An outlier in an early phase has a cumulative effect Data loss may cause multi-phase recompute outliers
Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map Problem: Due to unavailable input, tasks have to be recomputed 8
Previous work The original MapReduce paper observed the problem But didn’t deal with it in depth Solution was to duplicate the slow tasks Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem
Quantifying the Outlier Problem Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1.Prevalence of outliers 2.Causes of outliers 3.Impact of outliers
stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer Why bother? Frequency of Outliers straggler Outlier 11
Causes of outliners: data skew In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Duplicating will not help!
Non-outliers can be improved as well 20% of them are 55% longer than median
Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slot Problem: Tasks reading input over the network experience variable congestion 14
Causes of outliers: cross rack traffic 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement
Cause of outliers: bad and busy machines 50% of recomputes happen on 5% of the machines Recompute increases resource usage
Outliers cluster by time – Resource contention might be the cause Recomputes cluster by machines – Data loss may cause multiple recomputes
Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18
Mantri Design
High-level idea Cause aware, and resource aware Runtime = f (input, network, machine, datatoProcess, …) Fix each problem with different strategies
Resource-aware restarts Duplicate or kill long outliers
When to restart Every ∆ seconds, tasks report progress Estimate t rem and t new
γ= 3 Schedule a duplicate if the total running time is smaller P(c t rem > (c+1) t new ) > δ When there are available slots, restart if reduction time is more than restart time – E(t rem – t new ) > ρ ∆
Network Aware Placement Compute the rack location for each task Find the placement that minimizes the maximum data transfer time If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that:
Avoid recomputation Replicating the output – Restart a task if data are lost – Replicate the most costly job
Data-aware task ordering Outliers due to large input Schedule tasks in descending order of dataToProcess At most 33% worse than optimal scheduling
Estimation of t rem and t new d: input data size d read : the amount read
Estimation of t new processRate: estimated of all tasks in the phase locationFactor: machine performance d: input size
Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles 29
Evaluation Methodology Mantri run on production clusters Baseline is results from Dryad Use trace-driven simulations to compare with other systems
Comparing jobs in the wild w/ and w/o Mantri for one month of jobs in Bing production cluster jobs that each repeated at least five times during May (release) vs. Apr 1-30 (pre-release)
In production, restarts… improve on native cosmos by 25% while using fewer resources 32
In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice CDF % cluster resources 33
Network-aware Placement Equal: all links have the same bandwidth Start: same as the start Ideal: available bandwidth at run time 34
Protecting against recomputes CDF % cluster resources 35
Summary a)Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e)Others: restart or duplicate tasks, cognizant of resource cost. Prune
Conclusion Outliers in map-reduce clusters are a significant problem happen due to many causes – interplay between storage, network and map-reduce cause-, resource- aware mitigation improves on prior art 37