Download presentation
Presentation is loading. Please wait.
Published byAlexia May Modified over 9 years ago
1
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair Scheduling for MapReduce Clusters UC Berkeley
2
Motivation MapReduce / Hadoop originally designed for high throughput batch processing Today’s workload is far more diverse: – Many users want to share a cluster Engineering, marketing, business intelligence, etc – Vast majority of jobs are short Ad-hoc queries, sampling, periodic reports – Response time is critical Interactive queries, deadline-driven reports How can we efficiently share MapReduce clusters between users?
3
Example: Hadoop at Facebook 2000-node, 20 PB data warehouse, growing at 40 TB/day Applications: data mining, spam detection, ads 200 users (half non-engineers) 7500 MapReduce jobs / day Median: 84s
4
Approaches to Sharing Hadoop default scheduler (FIFO) – Problem: short jobs get stuck behind long ones Separate clusters – Problem 1: poor utilization – Problem 2: costly data replication Full replication nearly infeasible at FB / Yahoo! scale Partial replication prevents cross-dataset queries
5
Our Work: Hadoop Fair Scheduler Fine-grained sharing at level of map & reduce tasks Predictable response times and user isolation Example with 3 job pools:
6
What This Talk is About Two challenges for fair sharing in MapReduce – Data locality (conflicts with fairness) – Task interdependence (reduces can’t finish until maps are done, leading to slot hoarding) Some solutions Current work: Mesos project
7
Outline Challenge 1: data locality – Solution: delay scheduling Challenge 2: reduce slot hoarding – Solution: copy-compute splitting Current work: Mesos
8
The Problem Job 2 Master Job 1 Scheduling order Slave 4 4 2 2 1 1 1 1 2 2 2 2 3 3 3 3 9 9 5 5 3 3 3 3 6 6 7 7 5 5 6 6 9 9 4 4 8 8 7 7 8 8 2 2 1 1 1 1 Task 2 Task 5 Task 3 Task 1 Task 7 Task 4 File 1: File 2:
9
The Problem Job 1 Master Job 2 Scheduling order Slave 4 4 2 2 1 1 2 2 2 2 3 3 9 9 5 5 3 3 3 3 6 6 7 7 5 5 6 6 9 9 4 4 8 8 7 7 8 8 2 2 1 1 1 1 Task 2 Task 5 Task 3 Task 1 File 1: File 2: Task 1 Task 7 Task 2 Task 4 Task 3 Problem: Fair decision hurts locality Especially bad for jobs with small input files 1 1 3 3
10
Locality vs. Job Size at Facebook 58% of jobs
11
Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input files Problem: jobs get “stuck” in the same set of task slots – When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share – Bad because data files are spread out across all nodes Slave Master Task
12
Special Instance: Sticky Slots Under fair sharing, locality can be poor even when all jobs have large input files Problem: jobs get “stuck” in the same set of task slots – When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share – Bad because data files are spread out across all nodes
13
Solution: Delay Scheduling Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks Result: Very short wait time (1-5s) is enough to get nearly 100% locality
14
Delay Scheduling Example Job 1 Master Job 2 Scheduling order Slave 4 4 2 2 1 1 1 1 2 2 2 2 3 3 3 3 9 9 5 5 3 3 3 3 6 6 7 7 5 5 6 6 9 9 4 4 8 8 7 7 8 8 2 2 1 1 1 1 Task 2 Task 3 File 1: File 2: Task 8 Task 7 Task 2 Task 4 Task 6 Idea: Wait a short time to get data-local scheduling opportunities Task 5 Task 1 Task 3 Wait!
15
Delay Scheduling Details Scan jobs in order given by queuing policy, picking first that is permitted to launch a task Jobs must wait before being permitted to launch non-local tasks – If wait < T 1, only allow node-local tasks – If T 1 < wait < T 2, also allow rack-local – If wait > T 2, also allow off-rack Increase a job’s time waited when it is skipped
16
Analysis When is it worth it to wait, and for how long? Waiting worth it if E(wait time) < cost to run non-locally Simple model: Data-local scheduling opportunities arrive following Poisson process Expected response time gain from delay scheduling: E(gain) = (1 – e -w/t )(d – t) Optimal wait time is infinity Delay from running non-locally Expected time to get local scheduling opportunity Wait amount
17
Analysis Continued E(gain) = (1 – e -w/t )(d – t) Typical values of t and d: t = (avg task length) / (file replication × tasks per node) d ≥ (avg task length) Delay from running non-locally Expected time to get local scheduling opportunity Wait amount 36
18
More Analysis What if some nodes are occupied by long tasks? Suppose we have: – Fraction of active tasks are long – R replicas of each block – L task slots per node For a given data block, P(all replicas blocked by long tasks) = RL With R = 3 and L = 6, this is less than 2% if < 0.8
19
Evaluation 100-node EC2 cluster, 4 cores/node Job submission schedule based on job sizes and inter-arrival times at Facebook – 100 jobs grouped into 9 “bins” of sizes Three workloads: – IO-heavy, CPU-heavy, mixed Three schedulers: – FIFO – Fair sharing – Fair + delay scheduling (wait time = 5s)
20
Results for IO-Heavy Workload Small Jobs (1-10 input blocks) Medium Jobs (50-800 input blocks) Large Jobs (4800 input blocks) Job Response Times Up to 5x 40%
21
Results for IO-Heavy Workload
22
Sticky Slots Microbenchmark 5-50 big concurrent jobs on 100-node EC2 cluster 5s delay scheduling 2x
23
Summary Delay scheduling works well under two conditions: – Sufficient fraction of tasks are short relative to jobs – There are many locations where a task can run efficiently Blocks replicated across nodes, multiple tasks/node Generalizes beyond fair sharing & MapReduce – Applies to any queuing policy that gives ordered job list Used in Hadoop Fair Scheduler to implement hierarchical policy – Applies to placement needs other than data locality
24
Outline Challenge 1: data locality – Solution: delay scheduling Challenge 2: reduce slot hoarding – Solution: copy-compute splitting Current work: Mesos
25
The Problem Job 1 Job 2 Time Maps Time Reduces
26
The Problem Job 2 maps done Time Maps Time Reduces Job 1 Job 2
27
The Problem Job 2 reduces done Time Maps Time Reduces Job 1 Job 2
28
The Problem Job 1 Job 2 Job 3 Job 3 submitted Time Maps Time Reduces
29
The Problem Job 3 maps done Time Maps Time Reduces Job 1 Job 2 Job 3
30
The Problem Time Maps Time Reduces Job 1 Job 2 Job 3
31
The Problem Problem: Job 3 can’t launch reduces until Job 1 finishes Time Maps Time Reduces Job 1 Job 2 Job 3
32
Solutions to Reduce Slot Hoarding Task killing (preemption): – If a job’s fair share is not met after a timeout, kill tasks from jobs over their share – Pros: simple, predictable – Cons: wasted work – Available in Hadoop 0.21 and CDH 3
33
Solutions to Reduce Slot Hoarding Copy-compute splitting: – Divide reduce tasks into two types of tasks: “copy tasks” and “compute tasks” – Separate admission control for these two Larger # of copy slots / node, with per-job cap Smaller # of compute slots that tasks “graduate” into – Pros: no wasted work, overlaps IO & CPU more – Cons: extra memory, may cause IO contention
34
Copy-Comp. Splitting in Single Job Time Maps Maps done Time Reduces Reduce
35
Copy-Comp. Splitting in Single Job Time Reduces Copy Compute Time Maps Maps done Only IOOnly Comp.
36
Copy-Comp. Splitting in Single Job Time Reduces Copy Compute Time Maps Maps done
37
Copy-Comp. Splitting Evaluation Initial experiments showed 10% gain in single job and 1.5-2x response time gains over naïve fair sharing in multi-job scenario Copy-compute splitting not yet implemented anywhere (killing was good enough for now)
38
Outline Challenge 1: data locality – Solution: delay scheduling Challenge 2: reduce slot hoarding – Solution: copy-compute splitting Current work: Mesos
39
Challenges in Hadoop Cluster Management Isolation (both fault and resource isolation) – E.g. if co-locating production and ad-hoc jobs Testing and deploying upgrades JobTracker scalability (in larger clusters) Running non-Hadoop jobs – MPI, streaming MapReduce, etc
40
Mesos Mesos is a cluster resource manager over which multiple instances of Hadoop and other parallel applications can run Hadoop (production) Hadoop (production) Hadoop (ad-hoc) Hadoop (ad-hoc) MPI Mesos Node …
41
What’s Different About It? Other resource managers exist today, including – Batch schedulers (e.g. Torque, Sun Grid Engine) – VM schedulers (e.g. Eucalyptus) However, have 2 problems with Hadoop-like workloads: – Data locality hurt due to static partitioning – Utilization compromised because jobs hold nodes for their full duration Mesos addresses these through two features: – Fine-grained sharing at the level of tasks – Two-level scheduling where jobs control placement
42
Mesos Architecture Mesos slave Mesos master Hadoop 0.20 scheduler Mesos slave Hadoop job Hadoop 20 executor task Mesos slave Hadoop 19 executor task MPI scheduler MPI job MPI executor task Hadoop 0.19 scheduler Hadoop job Hadoop 19 executor task MPI executor task
43
Mesos Status Implemented in 10,000 lines of C++ – Efficient event-driven messaging lets master scale to 10,000 scheduling decisions / second – Fault tolerance using ZooKeeper – Supports Hadoop, MPI & Torque Test deployments at Twitter and Facebook Open source, at www.github.com/mesoswww.github.com/mesos
44
Thanks! Any questions? ? ? ? matei@eecs.berkeley.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.