Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reining in the Outliers in MapReduce Jobs using Mantri

Similar presentations


Presentation on theme: "Reining in the Outliers in MapReduce Jobs using Mantri"— Presentation transcript:

1 Reining in the Outliers in MapReduce Jobs using Mantri
Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†, Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft Presented by Daniar H. Kurniawan The University of Chicago 2018

2 MapReduce Jobs Basis of analytics in modern Internet services
E.g., Dryad, Hadoop Job  {Phase}  {Task} Graph flow consists of pipelines as well as strict blocks Where does it fit in to the big picture?

3 Example Dryad Job Graph
Distr. File System Distr. File System EXTRACT EXTRACT Map.1 Map.2 AGGREGATE_PARTITION AGGREGATE_PARTITION Reduce.1 Reduce.2 FULL_AGGREGATE FULL_AGGREGATE PROCESS Join COMBINE Phase Pipeline Blocked until input is done PROCESS Distr. File System

4 Log Analysis from Production
Logs from production cluster with thousands of machines, sampled over six months 10,000+ jobs, 80PB of data, 4PB network transfers Task-level details Production and experimental jobs

5 Outliers hurt! Tasks that run longer than the rest in the phase
Median phase has 10% outliers, running for >10x longer Slow down jobs by 35% at median Operational Inefficiency Unpredictability in completion times affect SLAs Hurts development productivity Wastes compute-cycles

6 Mantri: A system that mitigates outliers based on root-cause analysis
Why do outliers occur? Read Input Execute Input Unavailable Network Congestion Local Contention Workload Imbalance Mantri: A system that mitigates outliers based on root-cause analysis

7 Mantri’s Outlier Mitigation
Avoid Recomputation Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance

8 Recomputes: Illustration
(a) Barrier phases (b) Cascading Recomputes Actual Actual Inflation Inflation Ideal Ideal animation Normal task Recompute task

9 What causes recomputes? [1]
Faulty machines Bad disks, non-persistent hardware quirks Set of faulty machines varies with time, not constant (4%)

10 What causes recomputes? [2]
Transient machine load Recomputes correlate with machine load Requests for data access dropped

11 Replicate costly outputs
MR: Recompute Probability of a machine Task1 Task 2 Task 3 MR2 MR3 TRecomp = ((MR3*(1-MR2)) * T3 Recompute only Task3 or both Task3 as well as Task2 + Replicate (TRep) (MR3 * MR2) (T3+T2) TRep < TRecomp Animation for recursive calculation REPLICATE

12 Transient Failure Causes
Recomputes manifest in clutches Machine prone to cause recomputes till the problem is fixed Load abates, critical process restart etc. Clue: At least r recomputes within t time window on a machine Animation for PR – with false +/-

13 Speculative Recomputes
Anticipatorily recompute tasks whose outputs are unread Task Input Data (Read Fail) Animation for PR – with false +/- Speculative Recompute Speculative Recompute Unread Data

14 Mantri’s Outlier Mitigation
Avoid Recomputation Preferential Replication + Speculative Recomp. Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance

15 Reduce Tasks Distr. File System
Tasks access output of tasks from previous phases Reduce phase (74% of total traffic) Distr. File System Local Map Network Reduce Outlier!

16 Variable Congestion Smart placement smoothens hotspots Reduce task
Map output Rack Remember to specify racks Data locality for everything else, but not for reduce. Uplink of rack is the congestion hotspot Smart placement smoothens hotspots

17 Traffic-based Allotment
Goal: Minimize phase completion time For every rack: d : data u : available uplink bandwidth v : available downlink bandwidth Solve for task allocation fractions, ai

18 Local Control is a good approx.
Goal: Minimize phase completion time For every rack: d : data, D: data over all racks u : available uplink bandwidth v : available downlink bandwidth Link utilizations average out in long term, are steady on the short term Let rack i have ai fraction of tasks Time uploading, Tu = di (1 - ai) / ui Time downloading, Td = (D – di) ai / vi Timei = max {Tu , Td} Animation

19 Mantri’s Outlier Mitigation
Avoid Recomputation Preferential Replication + Speculative Recomp. Network-aware Task Placement Traffic on link proportional to bandwidth Duplicate Outliers Cognizant of Workload Imbalance

20 Contentions cause outliers
Tasks contend for local resources Processor, memory etc. Duplicate tasks elsewhere in the cluster Current schemes duplicate towards end of the phase (e.g., LATE [OSDI 2008]) Duplicate outlier or schedule pending task?

21 Resource-Aware Restart
Running task Potential restart (tnew) now time trem Save time and resources: P(c tnew < (c + 1) trem) Continuously observe and kill wasteful copies

22 Mantri’s Outlier Mitigation
Avoid Recomputation Preferential Replication + Speculative Recomp. Network-aware Task Placement Traffic on link proportional to bandwidth Duplicate Outliers Resource-Aware Restart Cognizant of Workload Imbalance

23 Workload Imbalance A quarter of the outlier tasks have more data to process Unequal key partitions for reduce tasks Ignoring these better than duplication Schedule tasks in descending order of data to process Time α (Data to Process) [Graham ‘69] At worse, 33% of optimal

24 Mantri’s Outlier Mitigation
Avoid Recomputation Preferential Replication + Speculative Recomp. Network-aware Task Placement Traffic on link proportional to bandwidth Duplicate Outliers Resource-Aware Restart Cognizant of Workload Imbalance Schedule in descending order of size Predict to act early Be resource-aware Act based on the cause Reactive Proactive

25 Results Deployed in production Bing clusters Trace-driven simulations
Mimic workflow, failures, data skew Compare with existing and idealized schemes

26 Jobs faster by 32% at median, consuming lesser resources
Jobs in the Wild Jobs faster by 32% at median, consuming lesser resources Act Early: Duplicates issued when task 42% done (77% for Dryad) Light: Issues fewer copies (.47X as many as Dryad) Accurate: 2.8x higher success rate of copies

27 Recomputation Avoidance
Eliminates most recomputes with minimal extra resources (Replication + Speculation) work well in tandem

28 Network-Aware Placement
Bandwidth approximations Mantri well-approximates the ideal

29 Summary From measurements in a production cluster,
Outliers are a significant problem Are due to an interplay between storage, network and map-reduce Mantri, a cause-, resource-aware mitigation Deployment shows encouraging results “Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010


Download ppt "Reining in the Outliers in MapReduce Jobs using Mantri"

Similar presentations


Ads by Google