Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella
Performance of cluster schedulers We observe that: 1 Time to finish a set of jobs Resources are fragmented i.e. machines are running below capacity Even at 100% usage, goodput is much smaller due to over-allocation Even pareto-efficient multi-resource fair schemes result in much lower performance Tetris up to 40% improvement in makespan 1 and job completion time with near-perfect fairness
Findings from Bing and Facebook traces analysis Tasks need varying amounts of each resource Demands for resources are weakly correlated Diversity in multi-resource requirements: Multiple resources become tight This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores 3 Upper bounding potential gains reduce makespan 1 by up to 49% reduce avg. job compl. time by up to 46%
4 Why so bad #1 Production schedulers neither pack tasks nor consider all their relevant resource demands #1 Resource Fragmentation #2 Over-allocation
Current Schedulers “Packer” Scheduler Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Resource Fragmentation (RF) STOP Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Avg. task compl. time = 1 t 5 Current Schedulers RF increase with the number of resources being allocated ! Avg. task compl.time = 1.33 t Resources allocated in terms of Slots Free resources unable to be assigned to tasks
Current Schedulers “Packer” Scheduler Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory STOP 20 MB/s Nw. 6 Over-Allocation Not all tasks resource demands are explicitly allocated Disk and network are over-allocated Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t Current Schedulers
Work Conserving != no fragmentation, over-allocation Treat cluster as a big bag of resources Hides the impact of resource fragmentation Assume job has a fixed resource profile Different tasks in the same job have different demands Multi-resource Fairness Schemes do not help either Why so bad #2 The schedule impacts job’s current resource profiles Can schedule to create complementarity profiles Packer Scheduler vs. DRF Avg. Job Compl.Time: 50% Makespan: 33% Pareto 1 efficient != performant 1 no job can increase share without decreasing the share of another 7
Competing objectives Job completion time Fairness vs. Cluster efficiency vs. Current Schedulers 1. Resource Fragmentation 3. Fair allocations sacrifice performance 2. Over-Allocation 8
# 1 Pack tasks along multiple resources to improve cluster efficiency and reduce makespan 9
TheoryPractice Multi-Resource Packing of Tasks similar to Multi-Dimensional Bin Packing Balls could be tasks Bin could be machine, time 1 APX-Hard is a strict subset of NP-hard APX-Hard 1 Existing heuristics do not directly apply here: Assume balls of a fixed size Assume balls are known apriori 10 vary with time / machine placed elastic cope with online arrival of jobs, dependencies, cluster activity Avoiding fragmentation looks like: Tight bin packing Reduces # of bins used -> reduce makespan
# 1 Packing heuristic Packing tasks to machines = Multi-Dimensional Bin Packing Ball = Task resource demands vector Bin = Machine available resource vector 1. Check for fit ensure no over-allocation Over-Allocation Alignment score (A) 11 A packing heuristic Tasks resources demand vector Machine resource vector < Fit “A” works because: 2. Bigger balls get bigger scores 3. Abundant resources used first Resource Fragmentation 4. Can spread load across machines
# 2 Faster average job completion time 12
13 CHALLENGE # 2 Shortest Remaining Time First 1 (SRTF) 1 SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time Job Completion Time Heuristic Q: What is the shortest “remaining time” ? “ remaining work ” remaining # tasks tasks durations tasks resource demands & & = A job completion time heuristic Gives a score P to every job Extended SRTF to incorporate multiple resources
14 CHALLENGE # 2 Job Completion Time Heuristic Combine A and P scores ! Packing Efficiency Completion Time ? 1: among J runnable jobs 2: score (j) = A(t, R)+ P(j) 3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j) A: delays job completion time P: loss in packing efficiency
# 3 Achieve performance and fairness 15
# 3 16 A says: “task i should go here to improve packing efficiency” Feasible solution which typically can satisfy all of them P says: “schedule job j next to improve job completion time” Fairness says: “this set of jobs should be scheduled next” Fairness Heuristic Performance and fairness do not mix well in general But …. We can get “perfect fairness” and much better performance
# 3 17 Fairness Knob, F [0, 1) F = 0 most efficient scheduling F → 1 close to perfect fairness Pick the best-for-perf. task from among 1-F fraction of jobs furthest from fair share Fairness Heuristic Fairness is not a tight constraint Long term fairness not short term fairness Lose a bit of fairness for a lot of gains in performance Heuristic
18 Putting it all together We saw: Other things in the paper: Packing efficiency Prefer small remaining work Fairness knob Estimate task demands Deal with inaccuracies, barriers Ingestion / evacuation Job Manager 1 Node Manager 1 Cluster-wide Resource Manager Multi-resource asks; barrier hint Track resource usage; enforce allocations New logic to match tasks to machines (+packing, +SRTF, +fairness) Allocations Asks Offers Resource availability reports Yarn architecture Changes to add Tetris(shown in orange)
Evaluation Pluggable scheduler in Yarn 2.4 250 machine cluster deployment Replay Bing and Facebook traces 19
20 Efficiency Makespan DRF 28 % Avg. Job Compl. Time 35% Tetris Gains from avoiding fragmentation avoid over-allocation Tetris vs. Capacity Scheduler 29 %30 % Over-allocation Lower value => higher resource fragmentation Utilization (%) Time (s) Over-allocation Lower value => higher resource fragmentation Capacity Scheduler
21 Fairness Fairness Knob quantifies the extent to which Tetris adheres to fair allocation No Fairness F = 0 Makespan 50 % 10 % 25 % Job Compl. Time 40 % 23 % 35 % Avg. Slowdown [over impacted jobs] 25 % 2 % 5 % Full Fairness F → 1 F = 0.25
Pack efficiently along multiple resources Prefer jobs with less “remaining work” Incorporate Fairness combine heuristics that improve packing efficiency with those that lower average job completion time achieving desired amounts of fairness can coexist with improving cluster performance implemented inside YARN; trace-driven simulations and deployment show encouraging initial results We are working towards a Yarn check-in 22
23 Backup slides
Estimating resource requirements Estimating Resource Demands Under-utilization from: o finished tasks in the same phase peak usage demands estimates Machine 1 - In Network MBytes / sec Time (sec) In Network Used In Network Free Resource Tracker o report unused resources o aware of other cluster activities: ingestion and evacuation Resource Tracker o collecting statistics from recurring jobs Peak Demand o inputs size/location of tasks 24 Placement Impacts network/disk requirements
Packer Scheduler vs. DRF DRF Scheduler Packer Schedulers 2 tasks Job Schedule Resources used 2 tasks 6 tasks A B C 18 cores 16 GB 18 cores 16 GB 18 cores 16 GB t 2t 3t 0 tasks Job Schedule Resources used 0 tasks 6 tasks 0 tasks6 tasks 18 tasks A B C 18 cores 6 GB 18 cores 6 GB t 2t 3t 36 GB Durations : A: 3t B: 3t C: 3t Durations : A: t B: 2t C: 3t 33% improvement Dominant Resource Fairness (DRF) computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks A[1 Core, 2 GB], 18 B[3 Cores, 1 GB], 6 C 25
1 Time to finish a set of jobs Machine 1,2: [2 Cores, 4 GB] Job: [Task Prof.], # tasks A[2 Cores, 3 GB], 6 B[1 Core, 2 GB], 2 Resources used 4 cores 6 GB 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 2 cores 4 GB Resources used 2 cores 4 GB 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 4 cores 6 GB Pack No Pack Durations: A: 3t B: 4t Durations: A: 4t B: t 29% improvement 26 Packing efficiency does not achieve everything Achieving packing efficiency does not necessarily improve job completion time
27 Ingestion / evacuation ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations e.g. some clusters reports volumes of up to 10 TB per hour Other cluster activities which produce background traffic e.g. rack decommission for machines re-imaging Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities
28 Workload analysis
29 Alternative Packing Heuristics
30 Fairness vs. Efficiency
31 Fairness vs. Efficiency
32 Virtual Machine Packing != Tetris Virtual Machine Packing But focus on different challenges and not task packing: balance load across servers ensure VM availability inspite of failures allow for quick software and hardware updates NO corresponding entity to a job and hence job completion time is inexpressible Explicit resource requirements (e.g. small VM) makes VM packing simpler Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers
33 Barrier knob, b [0, 1) Tetris gives preference for last tasks in a stage Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished b = 1 no tasks preferentially treated
34 Starvation Prevention It could take a long time to accommodate large tasks ? But … 1.most tasks have demands within one order of magnitude of one another 2.machines report resource availability to the scheduler periodically scheduler learn about all the resources freed up by tasks that finish in the preceding period together => can to reservation for large tasks
35 Cluster load vs. Tetris performance