Packing Tasks with Dependencies

Slides:



Advertisements
Similar presentations
Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.
Advertisements

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.
Varys Efficient Coflow Scheduling Mosharaf Chowdhury,
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Solutions for Scheduling Assays. Why do we use laboratory automation? Improve quality control (QC) Free resources Reduce sa fety risks Automatic data.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor Paul Regnier † George Lima † Ernesto Massa † Greg Levin ‡ Scott Brandt ‡
Quincy: Fair Scheduling for Distributed Computing Clusters Microsoft Research Silicon Valley SOSP’09 Presented at the Big Data Reading Group by Babu Pillai.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
Making Sense of Spark Performance
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
The Power of Choice in Data-Aware Cluster Scheduling
Distributed Low-Latency Scheduling
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Introduction to Hadoop and HDFS
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Scheduling policies for real- time embedded systems.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Low Latency Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.
Reducing Structural Bias in Technology Mapping
Performance Assurance for Large Scale Big Data Systems
GRASS: Trimming Stragglers in Approximation Analytics
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 2: Performance Evaluation
Jacob R. Lorch Microsoft Research
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Measurement-based Design
Introduction | Model | Solution | Evaluation
Data Streaming in Computer Networking
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Altruistic Scheduling in Multi-Resource Clusters
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Computing Resource Allocation and Scheduling in A Data Center
Drum: A Rhythmic Approach to Interactive Analytics on Large Data
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Server Allocation for Multiplayer Cloud Gaming
On Scheduling in Map-Reduce and Flow-Shops
Altruistic Scheduling in Multi-Resource Clusters
StreamApprox Approximate Stream Analytics in Apache Spark
Omega: flexible, scalable schedulers for large compute clusters
Automatic Physical Design Tuning: Workload as a Sequence
Sanjoy Baruah The University of North Carolina at Chapel Hill
F# for Parallel and Asynchronous Programming
CPU SCHEDULING.
Cloud Computing Large-scale Resource Management
Hawk: Hybrid Datacenter Scheduling
Scheduling of Regular Tasks in Linux
CherryPick: Adaptively Unearthing the Best
Tiresias A GPU Cluster Manager for Distributed Deep Learning
Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University
Scheduling of Regular Tasks in Linux
Presentation transcript:

Packing Tasks with Dependencies Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni University of Wisconsin, Madison

The Cluster Scheduling Problem Jobs Goal: match tasks to resources to achieve High cluster utilization Fast job completion Guarantees (deadlines, fair shares) Constraints Scale  fast twitch Tasks Large and high-value deployments E.g., Spark, Yarn*, Mesos*, Cosmos Today, schedulers are simple and (as we show) performance can improve a lot

Jobs have heterogeneous DAGs Example DAG User queries  Query optimizer  Job DAG DAGs have deep and complex structures Task durations range from <1s to >100s Tasks use different amounts of resources (Dryad, Spark-SQL, Hive,…) # Tasks 1 100 1000 10 Edges need shuffle can be local 60s 0s 200s Duration Legend

Challenges in scheduling heterogeneous DAGs 𝑛 tasks 𝑑 resources Packers1 CPSched cannot overlap tasks with complementary demands 𝜺T 𝟎.𝟖𝟒, 𝜺 3: Packers do not handle dependencies 𝟏+𝟓𝜺 T 𝟎.𝟔, 𝟎.𝟑 𝜺T 𝟎.𝟖𝟓, 𝜺 0: 1: 𝜺T 𝟎.𝟐, 𝟎.𝟓 4: 𝐓 𝟎.𝟐, 𝟎.𝟔 𝟏+𝟐𝜺 𝐓 𝟎.𝟐,𝟎.𝟏 2: 5: [1] Tetris: Multi-resource Packing for Cluster Schedulers, SIGCOMM’14

Challenges in scheduling heterogeneous DAGs … Simple heuristics lead to poor schedules Production DAGs are roughly 50% slower than lower bounds Simple variants of “Packing dependent tasks” are NP-hard problems Prior analytical solutions miss some practical concerns Multiple resources Complex dependencies Machine-level fragmentation Scale; Online; …

Given an annotated DAG and available resources, compute a good schedule + practical model

Main ideas for one DAG resources time Existing schedulers: … resources time Existing schedulers: A task is schedulable after all its parents have finished Graphene: Identifies troublesome tasks and places them first

Main ideas for one DAG resources time Existing schedulers: … … time resources Existing schedulers: A task is schedulable after all its parents have finished Graphene: Identifies troublesome tasks and places them first Place other tasks around trouble

Does placing troublesome tasks first help? 𝟎.𝟖𝟒, 𝜺 Revisit the example 3: 2 𝟏+𝟓𝜺 T 𝟎.𝟔, 𝟎.𝟑 𝜺T 𝟎.𝟖𝟓, 𝜺 0: 1: 𝜺T 𝟎.𝟐, 𝟎.𝟓 1 3 4 5 4: time 𝐓 𝟎.𝟐, 𝟎.𝟔 𝟏+𝟐𝜺 𝐓 𝟎.𝟐,𝟎.𝟏 2: 5: If troublesome tasks  long-running tasks, Graphene  OPT

How to choose troublesome tasks T? Optimal choice is intractable (recall: NP-Hard) Graphene: T= tasks longer than 𝑙 frag ≥𝑓 BuildSchedule(T) f Stage fragmentation score and or Vary 𝑙, 𝑓 𝑙 Task duration Pick the most compact schedule Extensions Explore different choices of T in parallel Recurse Memoize …

Schedule dead-ends resources T time TfbPbSfCf TfbPbCfSf TfbSfbPbCf Which of these orders are legit? TfbPbSfCf TfbPbCfSf P … … TfbSfbPbCf time resources T Graphene explores all orders and avoids dead-ends T C S Since some parents and children of are already placed with T, may not be able to place When placing tasks in , P, have to go backwards (place task after all children are placed) T  TransitiveClosure (T)

Main ideas for one DAG P resources resources s C T T time time … … P time resources T s resources C T T C S time Identify troublesome tasks and place them first Systematically place tasks to avoid dead-ends

One DAG Multiple DAGs Computed offline schedule for Production clusters have

Convert offline schedule to priority order on tasks 2 4 5 1 3 Job1 4’ 5’ 0’ 3’ 2’ Job2 1’ 𝟏+𝟓𝜺 T 𝟎.𝟔, 𝟎.𝟑 𝜺T 𝟎.𝟖𝟓, 𝜺 𝟎.𝟐, 𝟎.𝟓 𝐓 𝟎.𝟐, 𝟎.𝟔 𝟏+𝟐𝜺 𝐓 𝟎.𝟐,𝟎.𝟏 3: 1: 2: 5: 𝟎.𝟖𝟒, 𝜺 4: 0: Offline 𝑡 1 → 𝑡 3 → 𝑡 4 → 𝑡 0 , 𝑡 2 , 𝑡 5 Runtime time 5 5’ 4’ 2’ 2 1 1’ 3 3’ 4 0’ 2T 2 CPSched 0’ 3’ 3 4 4’ 5 5’ 1 1’ 2’ 3T 4T

Main ideas for multiple DAGs Convert offline schedule to priority order on tasks Online, enforce schedule priority along with heuristics for Multi-resource packing “SRPT” to lower average job completion time Bounded amount of unfairness Overbooking … Schedule priority best for one DAG; overall? may lose packing efficiency, … Job Completion Time Trade-offs: We show that: {best “perf” |bounded unfairness} ~ best “perf” Packing Efficiency ? may delay job completion, … Fairness may counteract all others

Graphene summary & implementation Offline, schedule each DAG by placing troublesome tasks first Online, enforce priority over tasks along with other heuristics

Graphene summary & implementation Offline, schedule each DAG by placing troublesome tasks first Online, enforce priority over tasks along with other heuristics NM RM NM Schedule Constructor AM Node heartbeat + priority order + packing + bounded unfairness + + overbook DAG … NM Schedule Constructor AM Task assignment DAG NM

Implementation details DAG annotations Bundling: improve schedule quality w/o killing scheduling latency Co-existence with (many) other scheduler features

Evaluation Prototype Simulations 200 server multi-core cluster TPC-DS, TPC-H, …, GridMix to replay traces Jobs arrive online Simulations Traces from production Microsoft Cosmos and Yarn clusters Compare with many alternatives

Results - 1 [20K DAGs from Cosmos] Packing Packing + Deps. Lower bound 1.5X 2X

Results - 2 [200 jobs from TPC-DS, 200 server cluster] Tez + Pack +Deps Tez + Packing

Conclusions Scheduling heterogeneous DAGs well requires an online solution that handles multiple resources and dependencies Graphene Offline, construct per-DAG schedule by placing troublesome tasks first Online, enforce schedule priority along with other heuristics New lower bound shows nearly optimal for half of the DAGs Experiments show gains in job completion time, makespan, … Graphene generalizes to DAGs in other settings

When scheduler works with erroneous task profiles −0.75, −0.50 −0.50, −0.25 +0.25, +0.50 +0.50, +0.75 Amount of error Fractional change in JCT

DAG annotations G uses per-stage average duration and demands of {cpu, mem, net. disk} Almost all frameworks have user’s annotate cpu and mem Recurring jobs1 have predictable profiles (correcting for input size) Ongoing work on building profiles for ad-hoc jobs Sample and project2 Program analysis3 [1] RoPE, NSDI’12; … [2] Perforator, SOCC’16; … [3] SPEED, POPL’09; …

Using Graphene to schedule other DAGs

Characterizing DAGs in Cosmos clusters

Characterizing DAGs in Cosmos clusters – 2

Runtime of production DAGs

Job completion times on different workloads

Makespan Fairness

Comparison with other alternatives

Online Pseudocode