Hawk: Hybrid Datacenter Scheduling

Slides:



Advertisements
Similar presentations
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
Advertisements

Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.
SDN + Storage.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Making Sense of Spark Performance
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
Design and Performance Evaluation of Queue-and-Rate-Adjustment Dynamic Load Balancing Policies for Distributed Networks Zeng Zeng, Bharadwaj, IEEE TRASACTION.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Performance Evaluation
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.
Distributed Low-Latency Scheduling
Introduction to Monte Carlo Methods D.J.C. Mackay.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Matchmaking: A New MapReduce Scheduling Technique
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Dynamic Load Balancing Tree and Structured Computations.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou, Daniel Sanchez and Christos Kozyrakis Presented By Shiv.
OPERATING SYSTEMS CS 3502 Fall 2017
Packing Tasks with Dependencies
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Introduction to Load Balancing:
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Measurement-based Design
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Alternative system models
Modeling and Simulation (An Introduction)
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Resource Elasticity for Large-Scale Machine Learning
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Scheduling Jobs Across Geo-distributed Datacenters
Memory Management for Scalable Web Data Servers
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Taking an Iteration Down to Code
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
Process Scheduling B.Ramamurthy 11/18/2018.
Omega: flexible, scalable schedulers for large compute clusters
Module 5: CPU Scheduling
Process Scheduling B.Ramamurthy 12/5/2018.
Web switch support for differentiated services
A Characterization of Approaches to Parrallel Job Scheduling
Load Balancing/Sharing/Scheduling Part II
Sampling Distributions
Process Scheduling B.Ramamurthy 2/23/2019.
Cloud Computing Large-scale Resource Management
CS 6290 Many-core & Interconnect
Job-aware Scheduling in Eagle: Divide and Stick to Your Probes
Process Scheduling B.Ramamurthy 4/11/2019.
Process Scheduling B.Ramamurthy 4/7/2019.
Uniprocessor scheduling
Size-Based Scheduling Policies with Inaccurate Scheduling Information
Chapter 4: Simulation Designs
EE384Y: Packet Switch Architectures II
Tiresias A GPU Cluster Manager for Distributed Deep Learning
Design Issues Lecture Topic 6.
Presentation transcript:

Hawk: Hybrid Datacenter Scheduling Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, Willy Zwaenepoel Present a new kind of scheduler July 10th, 2015

Introduction: datacenter scheduling cluster Job 1 … task task scheduler … … Job N … Lets take a look to the scheduling problem In a data center we have a cluster composed of many nodes (typically 10’s thousands) AND we have a set of jobs, normally divided into tasks in order to be able to run in parallel And between these two we have the scheduler (or resource manager) The GOAL of the scheduler is to efficiently assign job tasks to resources/nodes in the cluster We can do this in different ways task task

Introduction: Centralized scheduling cluster Job 1 … centralized scheduler task task … … Job N … We can have one centralized scheduler that will be in charge of scheduling all of the jobs in all of the cluster We can see that since everything goes through this component , it will have perfect visibility of WHAT is running WHERE and WHEN As a consequence It can place tasks in the best possible way HOWEVER there’s a catch task task

Introduction: Centralized scheduling cluster … centralized scheduler Job N … Job 2 Job 1 … … If we have too many incoming jobs It can get overwhelmed And jobs will have to wait in a queue and suffer from head-of-line blocking

Introduction: Centralized scheduling cluster … centralized scheduler Good: placement Not so good: scheduling latency Job N … Job 2 Job 1 … …

Introduction: Distributed scheduling cluster distributed scheduler 1 Job 1 … distributed scheduler 2 Job 2 … … Now what if we try to do it in a distributed way We can have a better scheduling latency The best would be if we have one scheduler per job HOWEVER Typically distributed schedulers have out dated information about the cluster status or EVEN NO INFO at all … distributed scheduler N Job N

Introduction: Distributed scheduling cluster distributed scheduler 1 Job 1 … Good: scheduling latency Not so good: placement distributed scheduler 2 Job 2 … … Now what if we try to do it in a distributed way We can have the best scheduling latency if we have one scheduler per job BUT THEN they will normally have out dated information about the cluster status or EVEN NO INFO at all … distributed scheduler N Job N

Outline 1) Introduction 2) HAWK hybrid scheduling Rationale Design 3) Evaluation Simulation Real cluster 4) Conclusion Put emphasis on Hawk

Hybrid scheduling centralized scheduler … … distributed scheduler 1 … cluster centralized scheduler Job 1 … … Job M distributed scheduler 1 Job 2 … … CAN we get the best of both worlds? The answer is yes The previous talk also introduces a hybrid scheduling approach This work was done in parallel and we were not aware of each other … distributed scheduler N Job N

Hawk: Hybrid scheduling Long jobs  centralized Short jobs  distributed How does Hawk do hybrid scheduling? We let all long jobs to be scheduled with ONE centralized … short… So we talk about long and short jobs, how do we distinguish

Hawk: Hybrid scheduling centralized scheduler Long job 1 … Long job M Short job 1 distributed scheduler 1 Long/short: estimated execution time vs cut-off Short job 2 distributed scheduler 2 Hawk uses hybrid scheduling. How? Classifying How does this classification is done? Estimated time Why ? (heterogeneity of jobs: different in nature: like having mice and elephants) … … … distributed scheduler N Short job N

Rationale for Hawk Typical production workloads most resources few Long job 1 few most resources … Long job M Short job 1 many little resources Short job 2 … Short job N

Rationale for Hawk (continued) Percentage of long jobs Percentage of task-seconds for long jobs Task-seconds EXPLAIN! Occupancy ratio Source: Design Insights for MapReduce from Diverse Production Workloads, Chen et al 2012

Rationale for Hawk (continued) Percentage of long jobs Percentage of task-seconds for long jobs Long jobs: minority but take up most of the resources Task-seconds EXPLAIN! Source: Design Insights for MapReduce from Diverse Production Workloads, Chen et al 2012

Hawk: hybrid scheduling centralized Bulk of resources  good placement Long job 1 Few jobs  reasonable scheduling latency … Short job 1 distributed 1 Few resources  can trade not-so-good placement Latency sensitive  Fast scheduling Put boxesm, queues, etc … … distributed N Short job N

Hawk: hybrid scheduling centralized Bulk of resources  good placement Long job 1 Few jobs  reasonable scheduling latency BEST OF BOTH WORLDS Good: scheduling latency for short jobs Good: placement for long jobs … Short job 1 distributed 1 Few resources  can trade not-so-good placement Latency sensitive  Fast scheduling Explain PAUSE Next: Hawk distributed scheduling … … distributed N Short job N

Hawk: Distributed scheduling Sparrow Work-stealing Lets take a look at how Hawk does distributed scheduling, We use a probing technique introduced in Sparrow AND we also add work-stealing

Hawk: Distributed scheduling Sparrow Work-stealing So lets take a look at how sparrow does this probing technique

Sparrow distributed scheduler … random reservation (power of two) task Technique introduced by Sparrow SOSP 2013 … random reservation (power of two)

Hawk: Distributed scheduling Sparrow Work-stealing Technique introduced by Sparrow SOSP 2013

Sparrow and high load distributed scheduler Random placement: … task Technique introduced by Sparrow SOSP 2013 Sparrow by itself is not good for our goals … Random placement: Low likelihood on finding a free node

High load + job heterogeneity  head-of-line blocking Sparrow and high load distributed scheduler High load + job heterogeneity  head-of-line blocking task Technique introduced by Sparrow SOSP 2013 Sparrow by itself is not good for our goals … Random placement: Low likelihood on finding a free node

Hawk work-stealing Free node!! … Technique introduced by Sparrow SOSP 2013 …

Hawk work-stealing 2. Random node: send short jobs reservation in queue Technique introduced by Sparrow SOSP 2013 1. Free node: contact random node for probes! …

High load  high probability of contacting node with backlog Hawk work-stealing 2. Random node: send short jobs reservation in queue High load  high probability of contacting node with backlog PAUSE 1. Free node: contact random node for probes! …

Hawk cluster partitioning centralized scheduler … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler

Hawk cluster partitioning centralized scheduler … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler

Hawk cluster partitioning centralized scheduler Short jobs schedule anywhere. Long jobs only in non-reserved nodes. … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler

Hawk design summary Hybrid scheduler: long  centralized, short  distributed Work-stealing Cluster partitioning

Evaluation: 1. Simulation Sparrow simulator Google trace Vary number of nodes to vary cluster utilization Measure: Job running time Report 50th and 90th percentiles for short and long jobs Normalized to Spark

Simulated results: short jobs lower better Better across the board 1 is sparrow Lower is better This is not NOT low latency for short jobs!! This is low waiting time for short jobs… low latency? Distinguish from scheduling latency… We are good wrt Sparrow

Simulated results: long jobs lower better Better except under high load

Simulated results: long jobs lower better BECAUSE part of the cluster is reserved for only short jobs Very high utilization: partitioning

Decomposing Hawk Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk)

Decomposing Hawk: no centralized Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) NO CENTRALIZED Performance of long jobs goes high because tasks for different jobs queue one after another Short jobs better because long jobs performance decreases, fewer short tasks encounter queueing there

Decomposing Hawk: no stealing 19.6 Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) WITHOUT STEALING Short greatly penalized: tasks queued behind long tasks Long slightly penalized because they share queuing with more short tasks

Decomposing Hawk: no partitioning 11.9 Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) NO PARTITIONING Short jobs bad, stucked behind long tasks in any node Long jobs slightly better because they can schedule in more nodes

Decomposing Hawk summary 19.6 11.9 Absence of any component reduces Hawk’s performance! NO CENTRALIZED Performance of long jobs goes high because tasks for different jobs queue one after another Short jobs better because long jobs performance decreases, fewer short tasks encounter queueing there WITHOUT STEALING Short greatly penalized: tasks queued behind long tasks Long slightly penalized because they share queuing with more short tasks NO PARTITIONING Short jobs bad, stucked behind long tasks in any node Long jobs slightly better because they can schedule in more nodes

Sensitivity analysis Incorrect estimates of runtime Cutoff long/short Details of stealing Size of small partition

Sensitivity analysis Incorrect estimates of runtime Cutoff long/short Details of stealing Size of small partition Bottom line: relatively stable to variations See paper for details

Evaluation: 2. Implementation Hawk scheduler Hawk daemon Hawk daemon

Experiment 100-node cluster Subset of Google trace Vary inter-arrival time to vary cluster utilization Measure: Job running time Report 50th and 90th percentile for short and long jobs Normalized to Sparrow Say WHY we compressed

Short jobs lower better Inter-arrival time / mean task run time 90th percentile not so good prediction: fewer jobs tested (corner cases)

Long jobs lower better Inter-arrival time / mean task run time 90th percentile not so good prediction: fewer jobs tested (corner cases)

Implementation 1. Hawk works well in real cluster 2. Good correspondence implementation/simulation

Related work Centralized: Hadoop, Quincy Eurosys’10, SOSP‘09 Two level: Yarn, Mesos SoCC’12, NSDI’11 Distributed schedulers: Omega, Sparrow Eurosys’12,SOSP’13 Hybrid schedulers: Mercury A lot of work in this area tradeoff list a few examples

Conclusion Hawk: hybrid scheduler long : centralized, short: distributed work-stealing cluster partitioning Hawk provides good results for short and long jobs Even under high cluster utilization