Presentation is loading. Please wait.

Presentation is loading. Please wait.

Altruistic Scheduling in Multi-Resource Clusters

Similar presentations


Presentation on theme: "Altruistic Scheduling in Multi-Resource Clusters"— Presentation transcript:

1 Altruistic Scheduling in Multi-Resource Clusters
Robert Grandl, University of Wisconsin—Madison; Mosharaf Chowdhury, University of Michigan; Aditya Akella, University of Wisconsin—Madison; Ganesh Ananthanarayanan, Microsoft Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) ISBN

2 Motivation Problem Statement Approach Design Details Implementation
Evaluation Results

3 Scheduler Resource scheduling remains a key building block of modern data- intensive clusters Deal with multiple resources Complex DAG structures Performance isolation Ensure performance Ensure high Efficiency

4 Scheduler Current state-of-art algorithms do not optimize over all constraints For our analysis compare the following schedulers Domain Resource Fairness (DRF) - for increasing multi-resource fairness Shortest Job First (SJF) – for minimizing average job completion time Tetris – for increasing average resource utilization Each scheduler outperforms its counterparts only in a preferred metric and significantly underperforms in the secondary metrics

5 Optimizing Fairness, Performance, Efficiency

6 Problem Statement Motivation Approach Design Details Implementation
Evaluation Results

7 Problem Statement Given a collection of jobs – along with information about individual tasks’ expected multi-resource requirements, durations, and DAG dependencies – we must schedule them such that each job receives a fair share of cluster resources, jobs complete as fast as possible, and the schedule is work-conserving. Can we build a scheduler that improves performance and efficiency without sacrificing performance isolation?

8 Goals for Scheduler We expect an ideal such scheduler to satisfy the following goals Fast completion: Each DAG should complete as fast as possible Work conservation: Available resources should not remain unused Starvation freedom: No DAG should starve for arbitrarily long periods Offline DAG scheduling is NP-complete for all the objectives – fairness, performance, and efficiency

9 Approach Motivation Problem Statement Design Details Implementation
Evaluation Results

10 How Can We Do Better? The goal is to ensure performance isolation and still be competitive at the secondary metrics We observe two key characteristics about jobs All-or-Nothing characteristic User Perceived Performance Isolation Key Insight Greed is not always good User doesn’t care about short-term fair share guarantee

11 Approach Relax the short-term fairness constraint
Delay some tasks, to make way for tasks that have a greater need to run now Altruistic Approach – tasks voluntarily give up some resources Scheduler uses these leftover resource Schedule jobs closest to completion Pack remaining to maximize efficiency

12 Determining Altruism At a high level, approach is to determine and redistribute the leftover resources, to improve efficiency We first need to determine How much resources can be considered as leftover and used for altruistic scheduling? We make observations in the two categories of jobs properties Stage Level Observations – number of stages, barriers Path Level Observation – length of critical path, number of disjoint paths

13 Stage Level Observations
The number of stages in a DAG provides an approximation of its complexity

14 Path Level Observations
Number of Disjoint paths give a measure of sequence of stages that can run in parallel

15 Correlating Properties and Altruism
The number of barriers in a DAG has a positive correlation (0.75) The number of stages in a DAG has a positive correlation (0.66) The number of critical path length in has a positive correlation (0.71) The number of disjoint paths in a DAG has a positive correlation (0.57)

16 Design Details Motivation Problem Statement Approach Implementation
Evaluation Results

17 CARBYNE A new scheduler Altruistic, long-term approach
Jobs yield a fraction of their resources without impacting its completion time Use the leftover resources to improve secondary attributes

18 Approach Theorem Altruism will not inflate any job’s completion time in the offline case – i.e., unless new jobs arrive or existing jobs depart – for any inter-job scheduler Solution Approach Develop an Offline Scheduler Modify the Offline Scheduler to work in Online case

19 Offline Altruistic Scheduling
Operate in three levels how to perform inter-job scheduling to maximize the amount of leftover resources? how should an intra-job scheduler determine how much a job should contribute to leftover? how to redistribute the leftover across jobs?

20 Increasing Leftover (Inter-Job Scheduling)
Use a closed-form version of DRF for inter-job scheduling It elongates individual job completion times the most, due to multi-resource, fair-sharing considerations Fair schedulers provide the most opportunities for altruistic scheduling

21 Determining Leftover (Intra-Job Scheduling)
Schedule only those tasks that must start running for Jk to complete within the next Tk duration Altruistically donate the rest of the resources for redistribution Perform a reverse/backward packing of tasks from Tk to current time

22 Redistribution (Leftover Scheduling)
Leftover scheduling has two goals Minimizing the average JCT by scheduling tasks from jobs that are closest to completion using Shortest-Remaining-Time-First Maximizing efficiency by packing as many unscheduled tasks as possible

23 Pseudocode

24 Pseudocode

25 Pseudocode

26 Pseudocode

27 From Offline to Online Arrival of new jobs breaks invariant for Theorem In practice, this has marginal impacts (and on only a handful of job). This is due to Lots of parallel Jobs to work with Individual task requirement is very much smaller that resources available So, we don’t need to do anything more?

28 Other Considerations Data Locality Straggler Mitigation
An altruistically delayed data-local task is likely to find data locality when it is eventually schedule Straggler Mitigation CARBYNE is likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order Handling Task Failures Does not distinguish between new and restarted tasks Must recalculate the estimated completion time

29 Implementation Motivation Problem Statement Approach Design Details
Evaluation Results

30 Implementation Details
Enabling altruistic scheduling requires two key components Local Altruistic Resource Management Module in each application must determine how much resources it can yield Leftover resource management Module to reallocate the yielded resources Implementing in Apache Yarn

31 Note about Yarn/Tez The scheduling procedure into three parts
Node Manager runs on every machine responsible for running tasks and reporting available resources Job Manager Runs on few machines Holds Job context information Resource Manager Cluster-wide, runs on only one machine (usually) Assigns tasks to machines

32 Implementation RPC Mechanism Tez Job Manager (AM)
We extended the Ask data structure AsksDEFAULT for tasks that it must run in order to not be slow down AsksALTRUISTIC for tasks that it may run if the job scheduler tries to use all the allocated resources Tez Job Manager (AM) Implements IntraJobScheduler procedure from Pseudocode Does reverse packing to identify the minimum set of tasks that should run as of now

33 Implementation YARN’s Resource Manager
triggered whenever an NM reports available resources periodically computes their DRF allocation and propagates it to AM schedules tasks requests from jobs AsksDEFAULT to do packing and reduce job completion time

34 Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

35 Workload Public Benchmarks Traces from TPC-DS TPC-H BigBench
Microsoft job DAGs with millions of tasks Facebook jobs and 650, 000 tasks spanning six hours

36 Methodology The jobs are randomly chosen from one of the benchmark
Arrival rate matches Poission’s distribution, average inter-arrival time of 20 seconds Each experiment is run three times, and median is presented

37 Setup Cluster Simulator 100 bare-metal servers
Each machine has 20 cores, 128 GB of memory, 128 GB SSD, a 10 Gbps NIC Simulator a simulator that replays job traces mimics various aspects of the logs, handling jobs with different arrival times and dependencies

38 Demand Estimation Relies on estimates of tasks’ resource demands – across CPU, memory, disk, and the network Use history of prior runs for recurring jobs Manual annotation is an option as well

39 Evaluation Metrics Avearage Job Completion Time
Factor of Improvement = 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝐼𝑛 𝑎𝑛 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐶𝑎𝑟𝑏𝑦𝑛𝑒 Makespan - the total length of the schedule Jain’s Fair Index

40 Implementation Results
Motivation Problem Statement Approach Design Details Implementation Results Evaluation Results

41 Performance vs. Efficiency vs. Fairness
(Offline Case)

42 Performance vs. Efficiency vs. Fairness
(Online Case)

43 JCT Improvements Across Entire Workloads

44 Large Scale Simulation on Traces

45 Impact of Contention

46 Impact of Misestimation

47 Impact of Altruism

48 Better DAG Scheduler

49 Conclusion A novel policy for Scheduling (Greedy vs Altruism)
Significantly improves secondary metrics To further optimize, the paper suggests the us of other fair schedulers


Download ppt "Altruistic Scheduling in Multi-Resource Clusters"

Similar presentations


Ads by Google