Altruistic Scheduling in Multi-Resource Clusters

Altruistic Scheduling in Multi-Resource Clusters
Robert Grandl, University of Wisconsin—Madison; Mosharaf Chowdhury, University of Michigan; Aditya Akella, University of Wisconsin—Madison; Ganesh Ananthanarayanan, Microsoft Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) ISBN

Motivation Problem Statement Approach Design Details Implementation
Evaluation Results

Scheduler Resource scheduling remains a key building block of modern data- intensive clusters Deal with multiple resources Complex DAG structures Performance isolation Ensure performance Ensure high Efficiency

Scheduler Current state-of-art algorithms do not optimize over all constraints For our analysis compare the following schedulers Domain Resource Fairness (DRF) - for increasing multi-resource fairness Shortest Job First (SJF) – for minimizing average job completion time Tetris – for increasing average resource utilization Each scheduler outperforms its counterparts only in a preferred metric and significantly underperforms in the secondary metrics

Optimizing Fairness, Performance, Efficiency

Problem Statement Motivation Approach Design Details Implementation
Evaluation Results

Problem Statement Given a collection of jobs – along with information about individual tasks’ expected multi-resource requirements, durations, and DAG dependencies – we must schedule them such that each job receives a fair share of cluster resources, jobs complete as fast as possible, and the schedule is work-conserving. Can we build a scheduler that improves performance and efficiency without sacrificing performance isolation?

Goals for Scheduler We expect an ideal such scheduler to satisfy the following goals Fast completion: Each DAG should complete as fast as possible Work conservation: Available resources should not remain unused Starvation freedom: No DAG should starve for arbitrarily long periods Offline DAG scheduling is NP-complete for all the objectives – fairness, performance, and efficiency

Approach Motivation Problem Statement Design Details Implementation
Evaluation Results

How Can We Do Better? The goal is to ensure performance isolation and still be competitive at the secondary metrics We observe two key characteristics about jobs All-or-Nothing characteristic User Perceived Performance Isolation Key Insight Greed is not always good User doesn’t care about short-term fair share guarantee

Approach Relax the short-term fairness constraint
Delay some tasks, to make way for tasks that have a greater need to run now Altruistic Approach – tasks voluntarily give up some resources Scheduler uses these leftover resource Schedule jobs closest to completion Pack remaining to maximize efficiency

Determining Altruism At a high level, approach is to determine and redistribute the leftover resources, to improve efficiency We first need to determine How much resources can be considered as leftover and used for altruistic scheduling? We make observations in the two categories of jobs properties Stage Level Observations – number of stages, barriers Path Level Observation – length of critical path, number of disjoint paths

Stage Level Observations
The number of stages in a DAG provides an approximation of its complexity

Path Level Observations
Number of Disjoint paths give a measure of sequence of stages that can run in parallel

Correlating Properties and Altruism
The number of barriers in a DAG has a positive correlation (0.75) The number of stages in a DAG has a positive correlation (0.66) The number of critical path length in has a positive correlation (0.71) The number of disjoint paths in a DAG has a positive correlation (0.57)

Design Details Motivation Problem Statement Approach Implementation
Evaluation Results

CARBYNE A new scheduler Altruistic, long-term approach
Jobs yield a fraction of their resources without impacting its completion time Use the leftover resources to improve secondary attributes

Approach Theorem Altruism will not inflate any job’s completion time in the offline case – i.e., unless new jobs arrive or existing jobs depart – for any inter-job scheduler Solution Approach Develop an Offline Scheduler Modify the Offline Scheduler to work in Online case

Offline Altruistic Scheduling
Operate in three levels how to perform inter-job scheduling to maximize the amount of leftover resources? how should an intra-job scheduler determine how much a job should contribute to leftover? how to redistribute the leftover across jobs?

Increasing Leftover (Inter-Job Scheduling)
Use a closed-form version of DRF for inter-job scheduling It elongates individual job completion times the most, due to multi-resource, fair-sharing considerations Fair schedulers provide the most opportunities for altruistic scheduling

Determining Leftover (Intra-Job Scheduling)
Schedule only those tasks that must start running for Jk to complete within the next Tk duration Altruistically donate the rest of the resources for redistribution Perform a reverse/backward packing of tasks from Tk to current time

Redistribution (Leftover Scheduling)
Leftover scheduling has two goals Minimizing the average JCT by scheduling tasks from jobs that are closest to completion using Shortest-Remaining-Time-First Maximizing efficiency by packing as many unscheduled tasks as possible

Pseudocode

From Offline to Online Arrival of new jobs breaks invariant for Theorem In practice, this has marginal impacts (and on only a handful of job). This is due to Lots of parallel Jobs to work with Individual task requirement is very much smaller that resources available So, we don’t need to do anything more?

Other Considerations Data Locality Straggler Mitigation
An altruistically delayed data-local task is likely to find data locality when it is eventually schedule Straggler Mitigation CARBYNE is likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order Handling Task Failures Does not distinguish between new and restarted tasks Must recalculate the estimated completion time

Implementation Motivation Problem Statement Approach Design Details
Evaluation Results

Implementation Details
Enabling altruistic scheduling requires two key components Local Altruistic Resource Management Module in each application must determine how much resources it can yield Leftover resource management Module to reallocate the yielded resources Implementing in Apache Yarn

Note about Yarn/Tez The scheduling procedure into three parts
Node Manager runs on every machine responsible for running tasks and reporting available resources Job Manager Runs on few machines Holds Job context information Resource Manager Cluster-wide, runs on only one machine (usually) Assigns tasks to machines

Implementation RPC Mechanism Tez Job Manager (AM)
We extended the Ask data structure AsksDEFAULT for tasks that it must run in order to not be slow down AsksALTRUISTIC for tasks that it may run if the job scheduler tries to use all the allocated resources Tez Job Manager (AM) Implements IntraJobScheduler procedure from Pseudocode Does reverse packing to identify the minimum set of tasks that should run as of now

Implementation YARN’s Resource Manager
triggered whenever an NM reports available resources periodically computes their DRF allocation and propagates it to AM schedules tasks requests from jobs AsksDEFAULT to do packing and reduce job completion time

Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

Workload Public Benchmarks Traces from TPC-DS TPC-H BigBench
Microsoft job DAGs with millions of tasks Facebook jobs and 650, 000 tasks spanning six hours

Methodology The jobs are randomly chosen from one of the benchmark
Arrival rate matches Poission’s distribution, average inter-arrival time of 20 seconds Each experiment is run three times, and median is presented

Setup Cluster Simulator 100 bare-metal servers
Each machine has 20 cores, 128 GB of memory, 128 GB SSD, a 10 Gbps NIC Simulator a simulator that replays job traces mimics various aspects of the logs, handling jobs with different arrival times and dependencies

Demand Estimation Relies on estimates of tasks’ resource demands – across CPU, memory, disk, and the network Use history of prior runs for recurring jobs Manual annotation is an option as well

Evaluation Metrics Avearage Job Completion Time
Factor of Improvement = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐼𝑛 𝑎𝑛 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑎𝑟𝑏𝑦𝑛𝑒 Makespan - the total length of the schedule Jain’s Fair Index

Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

Performance vs. Efficiency vs. Fairness
(Offline Case)

Performance vs. Efficiency vs. Fairness
(Online Case)

JCT Improvements Across Entire Workloads

Large Scale Simulation on Traces

Impact of Contention

Impact of Misestimation

Impact of Altruism

Better DAG Scheduler

Conclusion A novel policy for Scheduling (Greedy vs Altruism)
Significantly improves secondary metrics To further optimize, the paper suggests the us of other fair schedulers

Altruistic Scheduling in Multi-Resource Clusters

Similar presentations

Presentation on theme: "Altruistic Scheduling in Multi-Resource Clusters"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Altruistic Scheduling in Multi-Resource Clusters

Similar presentations

Presentation on theme: "Altruistic Scheduling in Multi-Resource Clusters"— Presentation transcript:

Similar presentations

About project

Feedback