Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella.

Slides:



Advertisements
Similar presentations
QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,
Advertisements

Capacity Planning in a Virtual Environment
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
SDN + Storage.
Effective Straggler Mitigation: Attack of the Clones [1]
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Performance Anomalies Within The Cloud 1 This slide includes content from slides by Venkatanathan Varadarajan and Benjamin Farley.
Enabling High-level SLOs on Shared Storage Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica Cake 1.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Memory Buddies: Exploiting Page Sharing for Smart Colocation in Virtualized Data Centers Timothy Wood, Gabriel Tarasuk-Levin, Prashant Shenoy, Peter Desnoyers*,
Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Online Auctions in IaaS Clouds: Welfare and Profit Maximization with Server Costs Xiaoxi Zhang 1, Zhiyi Huang 1, Chuan Wu 1, Zongpeng Li 2, Francis C.M.
“On the Integration of MPEG-4 streams Pulled Out of High Performance Mobile Devices and Data Traffic over a Wireless Network” Spyros Psychis, Polychronis.
The Power of Choice in Data-Aware Cluster Scheduling
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Scheduling a Large DataCenter Cliff Stein Columbia University Google Research June, 2009 Monika Henzinger, Ana Radovanovic Google Research.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Sharing the Data Center Network Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, Bikas Saha Microsoft Research, Cornell University, Windows.
Department of Computer Science Engineering SRM University
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
CPU Scheduling Chapter 6 Chapter 6.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
Network Aware Resource Allocation in Distributed Clouds.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Energy Aware Consolidation for Cloud Computing Srikanaiah, Kansal, Zhao Usenix HotPower 2008.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 Agility in Virtualized Utility Computing Hangwei Qian, Elliot Miller, Wei Zhang Michael Rabinovich, Craig E. Wills {EECS Department, Case Western Reserve.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
A Platform for Fine-Grained Resource Sharing in the Data Center
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
HOW TO BUILD A BETTER TESTBED Fabien Hermenier Robert Ricci LESSONS FROM A DECADE OF NETWORK EXPERIMENTS ON EMULAB TridentCom ’
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
GRASS: Trimming Stragglers in Approximation Analytics
C Loomis (CNRS/LAL) and V. Floros (GRNET)
OPERATING SYSTEMS CS 3502 Fall 2017
Packing Tasks with Dependencies
Memory Management.
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Measurement-based Design
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Altruistic Scheduling in Multi-Resource Clusters
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
Main Memory Management
ISP and Egress Path Selection for Multihomed Networks
Altruistic Scheduling in Multi-Resource Clusters
Main Memory Background Swapping Contiguous Allocation Paging
Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.
Virtual Memory: Working Sets
Towards Predictable Datacenter Networks
Presentation transcript:

Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella

Performance of cluster schedulers We observe that: 1 Time to finish a set of jobs  Resources are fragmented i.e. machines are running below capacity  Even at 100% usage, goodput is much smaller due to over-allocation  Even pareto-efficient multi-resource fair schemes result in much lower performance Tetris up to 40% improvement in makespan 1 and job completion time with near-perfect fairness

Findings from Bing and Facebook traces analysis  Tasks need varying amounts of each resource  Demands for resources are weakly correlated Diversity in multi-resource requirements: Multiple resources become tight This matters because no single bottleneck resource:  Enough cross-rack network bandwidth to use all CPU cores 3 Upper bounding potential gains  reduce makespan 1 by up to 49%  reduce avg. job compl. time by up to 46%

4 Why so bad #1 Production schedulers neither pack tasks nor consider all their relevant resource demands #1 Resource Fragmentation #2 Over-allocation

Current Schedulers “Packer” Scheduler Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Resource Fragmentation (RF) STOP Machine A 4 GB Memory Machine B 4 GB Memory T1: 2 GB T3: 4 GB T2: 2 GB Time Avg. task compl. time = 1 t 5 Current Schedulers RF increase with the number of resources being allocated ! Avg. task compl.time = 1.33 t Resources allocated in terms of Slots Free resources unable to be assigned to tasks

Current Schedulers “Packer” Scheduler Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory Machine A 4 GB Memory; 20 MB/s Nw. Time T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory STOP 20 MB/s Nw. 6 Over-Allocation  Not all tasks resource demands are explicitly allocated  Disk and network are over-allocated Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t Current Schedulers

Work Conserving != no fragmentation, over-allocation  Treat cluster as a big bag of resources  Hides the impact of resource fragmentation  Assume job has a fixed resource profile  Different tasks in the same job have different demands Multi-resource Fairness Schemes do not help either Why so bad #2  The schedule impacts job’s current resource profiles  Can schedule to create complementarity profiles Packer Scheduler vs. DRF  Avg. Job Compl.Time: 50%  Makespan: 33% Pareto 1 efficient != performant 1 no job can increase share without decreasing the share of another 7

Competing objectives Job completion time Fairness vs. Cluster efficiency vs. Current Schedulers 1. Resource Fragmentation 3. Fair allocations sacrifice performance 2. Over-Allocation 8

# 1 Pack tasks along multiple resources to improve cluster efficiency and reduce makespan 9

TheoryPractice Multi-Resource Packing of Tasks similar to Multi-Dimensional Bin Packing Balls could be tasks Bin could be machine, time 1 APX-Hard is a strict subset of NP-hard APX-Hard 1 Existing heuristics do not directly apply here:  Assume balls of a fixed size  Assume balls are known apriori 10  vary with time / machine placed  elastic  cope with online arrival of jobs, dependencies, cluster activity Avoiding fragmentation looks like:  Tight bin packing  Reduces # of bins used -> reduce makespan

# 1 Packing heuristic Packing tasks to machines = Multi-Dimensional Bin Packing Ball = Task resource demands vector Bin = Machine available resource vector 1. Check for fit ensure no over-allocation Over-Allocation Alignment score (A) 11 A packing heuristic  Tasks resources demand vector  Machine resource vector < Fit “A” works because: 2. Bigger balls get bigger scores 3. Abundant resources used first Resource Fragmentation 4. Can spread load across machines

# 2 Faster average job completion time 12

13 CHALLENGE # 2 Shortest Remaining Time First 1 (SRTF) 1 SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time Job Completion Time Heuristic Q: What is the shortest “remaining time” ? “ remaining work ” remaining # tasks tasks durations tasks resource demands & & = A job completion time heuristic  Gives a score P to every job  Extended SRTF to incorporate multiple resources

14 CHALLENGE # 2 Job Completion Time Heuristic Combine A and P scores ! Packing Efficiency Completion Time ? 1: among J runnable jobs 2: score (j) = A(t, R)+  P(j) 3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j) A: delays job completion time P: loss in packing efficiency

# 3 Achieve performance and fairness 15

# 3 16  A says: “task i should go here to improve packing efficiency” Feasible solution which typically can satisfy all of them  P says: “schedule job j next to improve job completion time”  Fairness says: “this set of jobs should be scheduled next” Fairness Heuristic Performance and fairness do not mix well in general But …. We can get “perfect fairness” and much better performance

# 3 17  Fairness Knob, F  [0, 1)  F = 0 most efficient scheduling  F → 1 close to perfect fairness Pick the best-for-perf. task from among  1-F  fraction of jobs furthest from fair share Fairness Heuristic Fairness is not a tight constraint  Long term fairness not short term fairness  Lose a bit of fairness for a lot of gains in performance Heuristic

18 Putting it all together We saw: Other things in the paper:  Packing efficiency  Prefer small remaining work  Fairness knob  Estimate task demands  Deal with inaccuracies, barriers  Ingestion / evacuation Job Manager 1 Node Manager 1 Cluster-wide Resource Manager Multi-resource asks; barrier hint Track resource usage; enforce allocations New logic to match tasks to machines (+packing, +SRTF, +fairness) Allocations Asks Offers Resource availability reports Yarn architecture Changes to add Tetris(shown in orange)

Evaluation  Pluggable scheduler in Yarn 2.4  250 machine cluster deployment  Replay Bing and Facebook traces 19

20 Efficiency Makespan DRF 28 % Avg. Job Compl. Time 35% Tetris Gains from  avoiding fragmentation  avoid over-allocation Tetris vs. Capacity Scheduler 29 %30 % Over-allocation Lower value => higher resource fragmentation Utilization (%) Time (s) Over-allocation Lower value => higher resource fragmentation Capacity Scheduler

21 Fairness Fairness Knob  quantifies the extent to which Tetris adheres to fair allocation No Fairness F = 0 Makespan 50 % 10 % 25 % Job Compl. Time 40 % 23 % 35 % Avg. Slowdown [over impacted jobs] 25 % 2 % 5 % Full Fairness F → 1 F = 0.25

Pack efficiently along multiple resources Prefer jobs with less “remaining work” Incorporate Fairness  combine heuristics that improve packing efficiency with those that lower average job completion time  achieving desired amounts of fairness can coexist with improving cluster performance  implemented inside YARN; trace-driven simulations and deployment show encouraging initial results We are working towards a Yarn check-in 22

23 Backup slides

Estimating resource requirements Estimating Resource Demands Under-utilization  from: o finished tasks in the same phase  peak usage demands estimates Machine 1 - In Network MBytes / sec Time (sec) In Network Used In Network Free Resource Tracker o report unused resources o aware of other cluster activities: ingestion and evacuation Resource Tracker o collecting statistics from recurring jobs Peak Demand o inputs size/location of tasks 24 Placement Impacts network/disk requirements

Packer Scheduler vs. DRF DRF Scheduler Packer Schedulers 2 tasks Job Schedule Resources used 2 tasks 6 tasks A B C 18 cores 16 GB 18 cores 16 GB 18 cores 16 GB t 2t 3t 0 tasks Job Schedule Resources used 0 tasks 6 tasks 0 tasks6 tasks 18 tasks A B C 18 cores 6 GB 18 cores 6 GB t 2t 3t 36 GB Durations : A: 3t B: 3t C: 3t Durations : A: t B: 2t C: 3t 33% improvement Dominant Resource Fairness (DRF) computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks A[1 Core, 2 GB], 18 B[3 Cores, 1 GB], 6 C 25

1 Time to finish a set of jobs Machine 1,2: [2 Cores, 4 GB] Job: [Task Prof.], # tasks A[2 Cores, 3 GB], 6 B[1 Core, 2 GB], 2 Resources used 4 cores 6 GB 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 2 cores 4 GB Resources used 2 cores 4 GB 2 tasks t 2t 3t 4t Job Schedule 4 cores 6 GB 4 cores 6 GB 4 cores 6 GB Pack No Pack Durations: A: 3t B: 4t Durations: A: 4t B: t 29% improvement 26 Packing efficiency does not achieve everything Achieving packing efficiency does not necessarily improve job completion time

27 Ingestion / evacuation ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations  e.g. some clusters reports volumes of up to 10 TB per hour Other cluster activities which produce background traffic  e.g. rack decommission for machines re-imaging Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities

28 Workload analysis

29 Alternative Packing Heuristics

30 Fairness vs. Efficiency

31 Fairness vs. Efficiency

32 Virtual Machine Packing != Tetris Virtual Machine Packing But focus on different challenges and not task packing:  balance load across servers  ensure VM availability inspite of failures  allow for quick software and hardware updates  NO corresponding entity to a job and hence job completion time is inexpressible  Explicit resource requirements (e.g. small VM) makes VM packing simpler Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers

33 Barrier knob, b  [0, 1) Tetris gives preference for last tasks in a stage Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished  b = 1 no tasks preferentially treated

34 Starvation Prevention It could take a long time to accommodate large tasks ? But … 1.most tasks have demands within one order of magnitude of one another 2.machines report resource availability to the scheduler periodically  scheduler learn about all the resources freed up by tasks that finish in the preceding period together => can to reservation for large tasks

35 Cluster load vs. Tetris performance