Download presentation
Presentation is loading. Please wait.
Published byWhitney Lewis Modified over 6 years ago
1
Altruistic Scheduling in Multi-Resource Clusters
Robert Grandl, University of Wisconsin—Madison; Mosharaf Chowdhury, University of Michigan; Aditya Akella, University of Wisconsin—Madison; Ganesh Ananthanarayanan, Microsoft Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) ISBN
2
Motivation Problem Statement Approach Design Details Implementation
Evaluation Results
3
Scheduler Resource scheduling remains a key building block of modern data- intensive clusters Deal with multiple resources Complex DAG structures Performance isolation Ensure performance Ensure high Efficiency
4
Scheduler Current state-of-art algorithms do not optimize over all constraints For our analysis compare the following schedulers Domain Resource Fairness (DRF) - for increasing multi-resource fairness Shortest Job First (SJF) – for minimizing average job completion time Tetris – for increasing average resource utilization Each scheduler outperforms its counterparts only in a preferred metric and significantly underperforms in the secondary metrics
5
Optimizing Fairness, Performance, Efficiency
6
Problem Statement Motivation Approach Design Details Implementation
Evaluation Results
7
Problem Statement Given a collection of jobs – along with information about individual tasks’ expected multi-resource requirements, durations, and DAG dependencies – we must schedule them such that each job receives a fair share of cluster resources, jobs complete as fast as possible, and the schedule is work-conserving. Can we build a scheduler that improves performance and efficiency without sacrificing performance isolation?
8
Goals for Scheduler We expect an ideal such scheduler to satisfy the following goals Fast completion: Each DAG should complete as fast as possible Work conservation: Available resources should not remain unused Starvation freedom: No DAG should starve for arbitrarily long periods Offline DAG scheduling is NP-complete for all the objectives – fairness, performance, and efficiency
9
Approach Motivation Problem Statement Design Details Implementation
Evaluation Results
10
How Can We Do Better? The goal is to ensure performance isolation and still be competitive at the secondary metrics We observe two key characteristics about jobs All-or-Nothing characteristic User Perceived Performance Isolation Key Insight Greed is not always good User doesn’t care about short-term fair share guarantee
11
Approach Relax the short-term fairness constraint
Delay some tasks, to make way for tasks that have a greater need to run now Altruistic Approach – tasks voluntarily give up some resources Scheduler uses these leftover resource Schedule jobs closest to completion Pack remaining to maximize efficiency
12
Determining Altruism At a high level, approach is to determine and redistribute the leftover resources, to improve efficiency We first need to determine How much resources can be considered as leftover and used for altruistic scheduling? We make observations in the two categories of jobs properties Stage Level Observations – number of stages, barriers Path Level Observation – length of critical path, number of disjoint paths
13
Stage Level Observations
The number of stages in a DAG provides an approximation of its complexity
14
Path Level Observations
Number of Disjoint paths give a measure of sequence of stages that can run in parallel
15
Correlating Properties and Altruism
The number of barriers in a DAG has a positive correlation (0.75) The number of stages in a DAG has a positive correlation (0.66) The number of critical path length in has a positive correlation (0.71) The number of disjoint paths in a DAG has a positive correlation (0.57)
16
Design Details Motivation Problem Statement Approach Implementation
Evaluation Results
17
CARBYNE A new scheduler Altruistic, long-term approach
Jobs yield a fraction of their resources without impacting its completion time Use the leftover resources to improve secondary attributes
18
Approach Theorem Altruism will not inflate any job’s completion time in the offline case – i.e., unless new jobs arrive or existing jobs depart – for any inter-job scheduler Solution Approach Develop an Offline Scheduler Modify the Offline Scheduler to work in Online case
19
Offline Altruistic Scheduling
Operate in three levels how to perform inter-job scheduling to maximize the amount of leftover resources? how should an intra-job scheduler determine how much a job should contribute to leftover? how to redistribute the leftover across jobs?
20
Increasing Leftover (Inter-Job Scheduling)
Use a closed-form version of DRF for inter-job scheduling It elongates individual job completion times the most, due to multi-resource, fair-sharing considerations Fair schedulers provide the most opportunities for altruistic scheduling
21
Determining Leftover (Intra-Job Scheduling)
Schedule only those tasks that must start running for Jk to complete within the next Tk duration Altruistically donate the rest of the resources for redistribution Perform a reverse/backward packing of tasks from Tk to current time
22
Redistribution (Leftover Scheduling)
Leftover scheduling has two goals Minimizing the average JCT by scheduling tasks from jobs that are closest to completion using Shortest-Remaining-Time-First Maximizing efficiency by packing as many unscheduled tasks as possible
23
Pseudocode
24
Pseudocode
25
Pseudocode
26
Pseudocode
27
From Offline to Online Arrival of new jobs breaks invariant for Theorem In practice, this has marginal impacts (and on only a handful of job). This is due to Lots of parallel Jobs to work with Individual task requirement is very much smaller that resources available So, we don’t need to do anything more?
28
Other Considerations Data Locality Straggler Mitigation
An altruistically delayed data-local task is likely to find data locality when it is eventually schedule Straggler Mitigation CARBYNE is likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order Handling Task Failures Does not distinguish between new and restarted tasks Must recalculate the estimated completion time
29
Implementation Motivation Problem Statement Approach Design Details
Evaluation Results
30
Implementation Details
Enabling altruistic scheduling requires two key components Local Altruistic Resource Management Module in each application must determine how much resources it can yield Leftover resource management Module to reallocate the yielded resources Implementing in Apache Yarn
31
Note about Yarn/Tez The scheduling procedure into three parts
Node Manager runs on every machine responsible for running tasks and reporting available resources Job Manager Runs on few machines Holds Job context information Resource Manager Cluster-wide, runs on only one machine (usually) Assigns tasks to machines
32
Implementation RPC Mechanism Tez Job Manager (AM)
We extended the Ask data structure AsksDEFAULT for tasks that it must run in order to not be slow down AsksALTRUISTIC for tasks that it may run if the job scheduler tries to use all the allocated resources Tez Job Manager (AM) Implements IntraJobScheduler procedure from Pseudocode Does reverse packing to identify the minimum set of tasks that should run as of now
33
Implementation YARN’s Resource Manager
triggered whenever an NM reports available resources periodically computes their DRF allocation and propagates it to AM schedules tasks requests from jobs AsksDEFAULT to do packing and reduce job completion time
34
Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results
35
Workload Public Benchmarks Traces from TPC-DS TPC-H BigBench
Microsoft job DAGs with millions of tasks Facebook jobs and 650, 000 tasks spanning six hours
36
Methodology The jobs are randomly chosen from one of the benchmark
Arrival rate matches Poission’s distribution, average inter-arrival time of 20 seconds Each experiment is run three times, and median is presented
37
Setup Cluster Simulator 100 bare-metal servers
Each machine has 20 cores, 128 GB of memory, 128 GB SSD, a 10 Gbps NIC Simulator a simulator that replays job traces mimics various aspects of the logs, handling jobs with different arrival times and dependencies
38
Demand Estimation Relies on estimates of tasks’ resource demands – across CPU, memory, disk, and the network Use history of prior runs for recurring jobs Manual annotation is an option as well
39
Evaluation Metrics Avearage Job Completion Time
Factor of Improvement = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐼𝑛 𝑎𝑛 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑎𝑟𝑏𝑦𝑛𝑒 Makespan - the total length of the schedule Jain’s Fair Index
40
Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results
41
Performance vs. Efficiency vs. Fairness
(Offline Case)
42
Performance vs. Efficiency vs. Fairness
(Online Case)
43
JCT Improvements Across Entire Workloads
44
Large Scale Simulation on Traces
45
Impact of Contention
46
Impact of Misestimation
47
Impact of Altruism
48
Better DAG Scheduler
49
Conclusion A novel policy for Scheduling (Greedy vs Altruism)
Significantly improves secondary metrics To further optimize, the paper suggests the us of other fair schedulers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.