Download presentation
Presentation is loading. Please wait.
Published byKorey Woolley Modified over 10 years ago
1
Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang
2
References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – http://research.microsoft.com/en- us/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx
3
log(size of dataset) GB 10 9 TB 10 12 PB 10 15 EB 10 18 log(size of cluster) 10 4 1 10 3 10 2 10 1 10 5 HPC, || databases mapreduce MapReduce Decouples customized data operations from mechanisms to scale Is widely used Cosmos (based on SVC’s Dryad) + Scope @ Bing MapReduce @ Google Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data 3
4
Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: Read Map Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress 4
5
Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System Goals Speeding up jobs improves productivity Predictability supports SLAs … while using resources efficiently We find that: 5
6
What is an outlier A phase (map or reduce) has n tasks and s slots (available compute resources) Every task takes T seconds to run t i = f (datasize, code, machine, network) Ideally run time = ceiling (n/s) * T A naïve scheduler Goal is to be closer to
7
From a phase to a job A job may have many phases An outlier in an early phase has a cumulative effect Data loss may cause multi-phase recompute outliers
8
Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map Problem: Due to unavailable input, tasks have to be recomputed 8
9
Previous work The original MapReduce paper observed the problem But didn’t deal with it in depth Solution was to duplicate the slow tasks Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem
10
Quantifying the Outlier Problem Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1.Prevalence of outliers 2.Causes of outliers 3.Impact of outliers
11
stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer Why bother? Frequency of Outliers straggler Outlier 11
12
Causes of outliners: data skew In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Duplicating will not help!
13
Non-outliers can be improved as well 20% of them are 55% longer than median
14
Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slot Problem: Tasks reading input over the network experience variable congestion 14
15
Causes of outliers: cross rack traffic 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement
16
Cause of outliers: bad and busy machines 50% of recomputes happen on 5% of the machines Recompute increases resource usage
17
Outliers cluster by time – Resource contention might be the cause Recomputes cluster by machines – Data loss may cause multiple recomputes
18
Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18
19
Mantri Design
20
High-level idea Cause aware, and resource aware Runtime = f (input, network, machine, datatoProcess, …) Fix each problem with different strategies
21
Resource-aware restarts Duplicate or kill long outliers
22
When to restart Every ∆ seconds, tasks report progress Estimate t rem and t new
23
γ= 3 Schedule a duplicate if the total running time is smaller P(c t rem > (c+1) t new ) > δ When there are available slots, restart if reduction time is more than restart time – E(t rem – t new ) > ρ ∆
24
Network Aware Placement Compute the rack location for each task Find the placement that minimizes the maximum data transfer time If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that:
25
Avoid recomputation Replicating the output – Restart a task if data are lost – Replicate the most costly job
26
Data-aware task ordering Outliers due to large input Schedule tasks in descending order of dataToProcess At most 33% worse than optimal scheduling
27
Estimation of t rem and t new d: input data size d read : the amount read
28
Estimation of t new processRate: estimated of all tasks in the phase locationFactor: machine performance d: input size
29
Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles 29
30
Evaluation Methodology Mantri run on production clusters Baseline is results from Dryad Use trace-driven simulations to compare with other systems
31
Comparing jobs in the wild w/ and w/o Mantri for one month of jobs in Bing production cluster 31 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release)
32
In production, restarts… improve on native cosmos by 25% while using fewer resources 32
33
In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice CDF % cluster resources 33
34
Network-aware Placement Equal: all links have the same bandwidth Start: same as the start Ideal: available bandwidth at run time 34
35
Protecting against recomputes CDF % cluster resources 35
36
Summary a)Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e)Others: restart or duplicate tasks, cognizant of resource cost. Prune
37
Conclusion Outliers in map-reduce clusters are a significant problem happen due to many causes – interplay between storage, network and map-reduce cause-, resource- aware mitigation improves on prior art 37
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.