Download presentation
Presentation is loading. Please wait.
Published byIreland Bathurst Modified over 10 years ago
1
Combating Outliers in map-reduce Srikanth Kandula Ganesh Ananthanarayanan , Albert Greenberg, Ion Stoica , Yi Lu, Bikas Saha , Ed Harris 1
2
log(size of dataset) GB 10 9 TB 10 12 PB 10 15 EB 10 18 log(size of cluster) 10 4 1 10 3 10 2 10 1 10 5 HPC, || databases mapreduce map-reduce decouples operations on data (user-code) from mechanisms to scale is widely used Cosmos (based on SVC’s Dryad) + Scope @ Bing MapReduce @ Google Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data 2
3
Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: Read Map Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress 3
4
Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System Goals speeding up jobs improves productivity predictability supports SLAs … while using resources efficiently We find that: 4
5
This talk… Identify fundamental causes of outliers – concurrency leads to contention for resources – heterogeneity (e.g., disk loss rate) – map-reduce artifacts Current schemes duplicate long-running tasks Mantri: A cause-, resource-aware mitigation scheme takes distinct actions based on cause considers resource cost of actions Results from a production deployment 5
6
stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost The median phase has 10% stragglers and no recomputes 10% of the stragglers take >10X longer The median phase has 10% stragglers and no recomputes 10% of the stragglers take >10X longer Why bother? Frequency of Outliers straggler Outlier 6
7
Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 7
8
Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map Problem: Due to unavailable input, tasks have to be recomputed 8
9
Why outliers? (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? Insights: 50% of the recomputes are on 5% of machines Problem: Due to unavailable input, tasks have to be recomputed 9
10
Why outliers? t = predicted runtime of task r = predicted probability of recompute at machine t rep = cost to copy data over within rack M1M1 M2M2 t redo = r 2 (t 2 +t 1 redo ) Mantri preferentially acts on the more costly recomputes (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? Problem: Due to unavailable input, tasks have to be recomputed Insights: 50% of the recomputes are on 5% of machines cost to recompute vs. cost to replicate 10
11
Why outliers? Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slot Problem: Tasks reading input over the network experience variable congestion 11
12
Why outliers? Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that: Challenge(s) Global co-ordination across jobs? Where is the congestion? Insights: local control is a good approximation (each job balances its traffic) link utilizations average out on the long term and are steady on the short term Problem: Tasks reading input over the network experience variable congestion 12
13
Persistently slow machines rarely cause outliers Cluster Software (Autopilot) quarantines persistently faulty machines Why outliers? 13
14
Solution: Ignoring these is better than the state-of-the-art! (duplicating) In an ideal world, we could divide work evenly… Problem: About 25% of outliers occur due to more dataToProcess Why outliers? We schedule tasks in descending order of dataToProcess Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal We schedule tasks in descending order of dataToProcess Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal 14
15
Why outliers? Problem: 25% outliers remain, likely due to contention@machine Idea: Restart tasks elsewhere in the cluster Challenge(s) The earlier the better, but to restart outlier or start a pending task? (a) (b) (c) Running task Potential restart (t new ) now time t rem If predicted time is much better, kill original, restart elsewhere Else, if other tasks are pending, duplicate iff save both time and resource Else, (no pending work) duplicate iff expected savings are high Continuously, observe and kill wasteful copies If predicted time is much better, kill original, restart elsewhere Else, if other tasks are pending, duplicate iff save both time and resource Else, (no pending work) duplicate iff expected savings are high Continuously, observe and kill wasteful copies Save time and resources iff 15
16
Summary a)preferentially replicate costly-to-recompute tasks b)each job locally avoids network hot-spots c)quarantine persistently faulty machines d)schedule in descending order of data size e)restart or duplicate tasks, cognoscent of resource cost. Prune. (a) (b) (c) (d) (e) Theme: Cause-, Resource- aware action Explicit attempt to decouple solutions, partial success Theme: Cause-, Resource- aware action Explicit attempt to decouple solutions, partial success 16
17
Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles 17
18
In production, restarts… improve on native cosmos by 25% while using fewer resources 18
19
Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) CDF % cluster resources 19
20
In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner CDF % cluster resources 20
21
Protecting against recomputes CDF % cluster resources 21
22
Outliers in map-reduce clusters are a significant problem happen due to many causes – interplay between storage, network and map-reduce cause-, resource- aware mitigation improves on prior art 22
23
Back-up 23
24
Network-aware Placement 24
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.