Altruistic Scheduling in Multi-Resource Clusters

Slides:



Advertisements
Similar presentations
CPU Scheduling Questions answered in this lecture: What is scheduling vs. allocation? What is preemptive vs. non-preemptive scheduling? What are FCFS,
Advertisements

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
Scheduling policies for real- time embedded systems.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Static Process Scheduling
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella.
Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Efficient Coflow Scheduling with Varys
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
R-Storm: Resource Aware Scheduling in Storm
GRASS: Trimming Stragglers in Approximation Analytics
vCAT: Dynamic Cache Management using CAT Virtualization
OPERATING SYSTEMS CS 3502 Fall 2017
Packing Tasks with Dependencies
OPERATING SYSTEMS CS 3502 Fall 2017
Jacob R. Lorch Microsoft Research
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
Reducing Memory Interference in Multicore Systems
Introduction to Load Balancing:
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Measurement-based Design
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
On Scheduling in Map-Reduce and Flow-Shops
Chapter 6: CPU Scheduling
Lottery Scheduling Ish Baid.
April 30th – Scheduling / parallel
Altruistic Scheduling in Multi-Resource Clusters
ICS 143 Principles of Operating Systems
湖南大学-信息科学与工程学院-计算机与科学系
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Operating System Concepts
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
COT 4600 Operating Systems Spring 2011
Admission Control and Request Scheduling in E-Commerce Web Sites
Chapter 6: CPU Scheduling
CPU SCHEDULING.
CGS 3763 Operating Systems Concepts Spring 2013
Cloud Computing MapReduce in Heterogeneous Environments
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
Uniprocessor scheduling
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Tiresias A GPU Cluster Manager for Distributed Deep Learning
Module 5: CPU Scheduling
Towards Predictable Datacenter Networks
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
Presentation transcript:

Altruistic Scheduling in Multi-Resource Clusters Robert Grandl, University of Wisconsin—Madison; Mosharaf Chowdhury, University of Michigan; Aditya Akella, University of Wisconsin—Madison; Ganesh Ananthanarayanan, Microsoft Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) ISBN 978-1-931971-33-1

Motivation Problem Statement Approach Design Details Implementation Evaluation Results

Scheduler Resource scheduling remains a key building block of modern data- intensive clusters Deal with multiple resources Complex DAG structures Performance isolation Ensure performance Ensure high Efficiency

Scheduler Current state-of-art algorithms do not optimize over all constraints For our analysis compare the following schedulers Domain Resource Fairness (DRF) - for increasing multi-resource fairness Shortest Job First (SJF) – for minimizing average job completion time Tetris – for increasing average resource utilization Each scheduler outperforms its counterparts only in a preferred metric and significantly underperforms in the secondary metrics

Optimizing Fairness, Performance, Efficiency

Problem Statement Motivation Approach Design Details Implementation Evaluation Results

Problem Statement Given a collection of jobs – along with information about individual tasks’ expected multi-resource requirements, durations, and DAG dependencies – we must schedule them such that each job receives a fair share of cluster resources, jobs complete as fast as possible, and the schedule is work-conserving. Can we build a scheduler that improves performance and efficiency without sacrificing performance isolation?

Goals for Scheduler We expect an ideal such scheduler to satisfy the following goals Fast completion: Each DAG should complete as fast as possible Work conservation: Available resources should not remain unused Starvation freedom: No DAG should starve for arbitrarily long periods Offline DAG scheduling is NP-complete for all the objectives – fairness, performance, and efficiency

Approach Motivation Problem Statement Design Details Implementation Evaluation Results

How Can We Do Better? The goal is to ensure performance isolation and still be competitive at the secondary metrics We observe two key characteristics about jobs All-or-Nothing characteristic User Perceived Performance Isolation Key Insight Greed is not always good User doesn’t care about short-term fair share guarantee

Approach Relax the short-term fairness constraint Delay some tasks, to make way for tasks that have a greater need to run now Altruistic Approach – tasks voluntarily give up some resources Scheduler uses these leftover resource Schedule jobs closest to completion Pack remaining to maximize efficiency

Determining Altruism At a high level, approach is to determine and redistribute the leftover resources, to improve efficiency We first need to determine How much resources can be considered as leftover and used for altruistic scheduling? We make observations in the two categories of jobs properties Stage Level Observations – number of stages, barriers Path Level Observation – length of critical path, number of disjoint paths

Stage Level Observations The number of stages in a DAG provides an approximation of its complexity

Path Level Observations Number of Disjoint paths give a measure of sequence of stages that can run in parallel

Correlating Properties and Altruism The number of barriers in a DAG has a positive correlation (0.75) The number of stages in a DAG has a positive correlation (0.66) The number of critical path length in has a positive correlation (0.71) The number of disjoint paths in a DAG has a positive correlation (0.57)

Design Details Motivation Problem Statement Approach Implementation Evaluation Results

CARBYNE A new scheduler Altruistic, long-term approach Jobs yield a fraction of their resources without impacting its completion time Use the leftover resources to improve secondary attributes

Approach Theorem Altruism will not inflate any job’s completion time in the offline case – i.e., unless new jobs arrive or existing jobs depart – for any inter-job scheduler Solution Approach Develop an Offline Scheduler Modify the Offline Scheduler to work in Online case

Offline Altruistic Scheduling Operate in three levels how to perform inter-job scheduling to maximize the amount of leftover resources? how should an intra-job scheduler determine how much a job should contribute to leftover? how to redistribute the leftover across jobs?

Increasing Leftover (Inter-Job Scheduling) Use a closed-form version of DRF for inter-job scheduling It elongates individual job completion times the most, due to multi-resource, fair-sharing considerations Fair schedulers provide the most opportunities for altruistic scheduling

Determining Leftover (Intra-Job Scheduling) Schedule only those tasks that must start running for Jk to complete within the next Tk duration Altruistically donate the rest of the resources for redistribution Perform a reverse/backward packing of tasks from Tk to current time

Redistribution (Leftover Scheduling) Leftover scheduling has two goals Minimizing the average JCT by scheduling tasks from jobs that are closest to completion using Shortest-Remaining-Time-First Maximizing efficiency by packing as many unscheduled tasks as possible

Pseudocode

Pseudocode

Pseudocode

Pseudocode

From Offline to Online Arrival of new jobs breaks invariant for Theorem In practice, this has marginal impacts (and on only a handful of job). This is due to Lots of parallel Jobs to work with Individual task requirement is very much smaller that resources available So, we don’t need to do anything more?

Other Considerations Data Locality Straggler Mitigation An altruistically delayed data-local task is likely to find data locality when it is eventually schedule Straggler Mitigation CARBYNE is likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order Handling Task Failures Does not distinguish between new and restarted tasks Must recalculate the estimated completion time

Implementation Motivation Problem Statement Approach Design Details Evaluation Results

Implementation Details Enabling altruistic scheduling requires two key components Local Altruistic Resource Management Module in each application must determine how much resources it can yield Leftover resource management Module to reallocate the yielded resources Implementing in Apache Yarn

Note about Yarn/Tez The scheduling procedure into three parts Node Manager runs on every machine responsible for running tasks and reporting available resources Job Manager Runs on few machines Holds Job context information Resource Manager Cluster-wide, runs on only one machine (usually) Assigns tasks to machines

Implementation RPC Mechanism Tez Job Manager (AM) We extended the Ask data structure AsksDEFAULT for tasks that it must run in order to not be slow down AsksALTRUISTIC for tasks that it may run if the job scheduler tries to use all the allocated resources Tez Job Manager (AM) Implements IntraJobScheduler procedure from Pseudocode Does reverse packing to identify the minimum set of tasks that should run as of now

Implementation YARN’s Resource Manager triggered whenever an NM reports available resources periodically computes their DRF allocation and propagates it to AM schedules tasks requests from jobs AsksDEFAULT to do packing and reduce job completion time

Implementation Results Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

Workload Public Benchmarks Traces from TPC-DS TPC-H BigBench Microsoft - 30000 job DAGs with millions of tasks Facebook - 7000 jobs and 650, 000 tasks spanning six hours

Methodology The jobs are randomly chosen from one of the benchmark Arrival rate matches Poission’s distribution, average inter-arrival time of 20 seconds Each experiment is run three times, and median is presented

Setup Cluster Simulator 100 bare-metal servers Each machine has 20 cores, 128 GB of memory, 128 GB SSD, a 10 Gbps NIC Simulator a simulator that replays job traces mimics various aspects of the logs, handling jobs with different arrival times and dependencies

Demand Estimation Relies on estimates of tasks’ resource demands – across CPU, memory, disk, and the network Use history of prior runs for recurring jobs Manual annotation is an option as well

Evaluation Metrics Avearage Job Completion Time Factor of Improvement = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐼𝑛 𝑎𝑛 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑎𝑟𝑏𝑦𝑛𝑒 Makespan - the total length of the schedule Jain’s Fair Index

Implementation Results Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

Performance vs. Efficiency vs. Fairness (Offline Case)

Performance vs. Efficiency vs. Fairness (Online Case)

JCT Improvements Across Entire Workloads

Large Scale Simulation on Traces

Impact of Contention

Impact of Misestimation

Impact of Altruism

Better DAG Scheduler

Conclusion A novel policy for Scheduling (Greedy vs Altruism) Significantly improves secondary metrics To further optimize, the paper suggests the us of other fair schedulers