Hawk: Hybrid Datacenter Scheduling Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, Willy Zwaenepoel Present a new kind of scheduler July 10th, 2015
Introduction: datacenter scheduling cluster Job 1 … task task scheduler … … Job N … Lets take a look to the scheduling problem In a data center we have a cluster composed of many nodes (typically 10’s thousands) AND we have a set of jobs, normally divided into tasks in order to be able to run in parallel And between these two we have the scheduler (or resource manager) The GOAL of the scheduler is to efficiently assign job tasks to resources/nodes in the cluster We can do this in different ways task task
Introduction: Centralized scheduling cluster Job 1 … centralized scheduler task task … … Job N … We can have one centralized scheduler that will be in charge of scheduling all of the jobs in all of the cluster We can see that since everything goes through this component , it will have perfect visibility of WHAT is running WHERE and WHEN As a consequence It can place tasks in the best possible way HOWEVER there’s a catch task task
Introduction: Centralized scheduling cluster … centralized scheduler Job N … Job 2 Job 1 … … If we have too many incoming jobs It can get overwhelmed And jobs will have to wait in a queue and suffer from head-of-line blocking
Introduction: Centralized scheduling cluster … centralized scheduler Good: placement Not so good: scheduling latency Job N … Job 2 Job 1 … …
Introduction: Distributed scheduling cluster distributed scheduler 1 Job 1 … distributed scheduler 2 Job 2 … … Now what if we try to do it in a distributed way We can have a better scheduling latency The best would be if we have one scheduler per job HOWEVER Typically distributed schedulers have out dated information about the cluster status or EVEN NO INFO at all … distributed scheduler N Job N
Introduction: Distributed scheduling cluster distributed scheduler 1 Job 1 … Good: scheduling latency Not so good: placement distributed scheduler 2 Job 2 … … Now what if we try to do it in a distributed way We can have the best scheduling latency if we have one scheduler per job BUT THEN they will normally have out dated information about the cluster status or EVEN NO INFO at all … distributed scheduler N Job N
Outline 1) Introduction 2) HAWK hybrid scheduling Rationale Design 3) Evaluation Simulation Real cluster 4) Conclusion Put emphasis on Hawk
Hybrid scheduling centralized scheduler … … distributed scheduler 1 … cluster centralized scheduler Job 1 … … Job M distributed scheduler 1 Job 2 … … CAN we get the best of both worlds? The answer is yes The previous talk also introduces a hybrid scheduling approach This work was done in parallel and we were not aware of each other … distributed scheduler N Job N
Hawk: Hybrid scheduling Long jobs centralized Short jobs distributed How does Hawk do hybrid scheduling? We let all long jobs to be scheduled with ONE centralized … short… So we talk about long and short jobs, how do we distinguish
Hawk: Hybrid scheduling centralized scheduler Long job 1 … Long job M Short job 1 distributed scheduler 1 Long/short: estimated execution time vs cut-off Short job 2 distributed scheduler 2 Hawk uses hybrid scheduling. How? Classifying How does this classification is done? Estimated time Why ? (heterogeneity of jobs: different in nature: like having mice and elephants) … … … distributed scheduler N Short job N
Rationale for Hawk Typical production workloads most resources few Long job 1 few most resources … Long job M Short job 1 many little resources Short job 2 … Short job N
Rationale for Hawk (continued) Percentage of long jobs Percentage of task-seconds for long jobs Task-seconds EXPLAIN! Occupancy ratio Source: Design Insights for MapReduce from Diverse Production Workloads, Chen et al 2012
Rationale for Hawk (continued) Percentage of long jobs Percentage of task-seconds for long jobs Long jobs: minority but take up most of the resources Task-seconds EXPLAIN! Source: Design Insights for MapReduce from Diverse Production Workloads, Chen et al 2012
Hawk: hybrid scheduling centralized Bulk of resources good placement Long job 1 Few jobs reasonable scheduling latency … Short job 1 distributed 1 Few resources can trade not-so-good placement Latency sensitive Fast scheduling Put boxesm, queues, etc … … distributed N Short job N
Hawk: hybrid scheduling centralized Bulk of resources good placement Long job 1 Few jobs reasonable scheduling latency BEST OF BOTH WORLDS Good: scheduling latency for short jobs Good: placement for long jobs … Short job 1 distributed 1 Few resources can trade not-so-good placement Latency sensitive Fast scheduling Explain PAUSE Next: Hawk distributed scheduling … … distributed N Short job N
Hawk: Distributed scheduling Sparrow Work-stealing Lets take a look at how Hawk does distributed scheduling, We use a probing technique introduced in Sparrow AND we also add work-stealing
Hawk: Distributed scheduling Sparrow Work-stealing So lets take a look at how sparrow does this probing technique
Sparrow distributed scheduler … random reservation (power of two) task Technique introduced by Sparrow SOSP 2013 … random reservation (power of two)
Hawk: Distributed scheduling Sparrow Work-stealing Technique introduced by Sparrow SOSP 2013
Sparrow and high load distributed scheduler Random placement: … task Technique introduced by Sparrow SOSP 2013 Sparrow by itself is not good for our goals … Random placement: Low likelihood on finding a free node
High load + job heterogeneity head-of-line blocking Sparrow and high load distributed scheduler High load + job heterogeneity head-of-line blocking task Technique introduced by Sparrow SOSP 2013 Sparrow by itself is not good for our goals … Random placement: Low likelihood on finding a free node
Hawk work-stealing Free node!! … Technique introduced by Sparrow SOSP 2013 …
Hawk work-stealing 2. Random node: send short jobs reservation in queue Technique introduced by Sparrow SOSP 2013 1. Free node: contact random node for probes! …
High load high probability of contacting node with backlog Hawk work-stealing 2. Random node: send short jobs reservation in queue High load high probability of contacting node with backlog PAUSE 1. Free node: contact random node for probes! …
Hawk cluster partitioning centralized scheduler … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler
Hawk cluster partitioning centralized scheduler … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler
Hawk cluster partitioning centralized scheduler Short jobs schedule anywhere. Long jobs only in non-reserved nodes. … No coordination, challenge: no free nodes for mice! Reserved nodes: small cluster partition NO coordination between centralized and distributed distributed scheduler
Hawk design summary Hybrid scheduler: long centralized, short distributed Work-stealing Cluster partitioning
Evaluation: 1. Simulation Sparrow simulator Google trace Vary number of nodes to vary cluster utilization Measure: Job running time Report 50th and 90th percentiles for short and long jobs Normalized to Spark
Simulated results: short jobs lower better Better across the board 1 is sparrow Lower is better This is not NOT low latency for short jobs!! This is low waiting time for short jobs… low latency? Distinguish from scheduling latency… We are good wrt Sparrow
Simulated results: long jobs lower better Better except under high load
Simulated results: long jobs lower better BECAUSE part of the cluster is reserved for only short jobs Very high utilization: partitioning
Decomposing Hawk Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk)
Decomposing Hawk: no centralized Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) NO CENTRALIZED Performance of long jobs goes high because tasks for different jobs queue one after another Short jobs better because long jobs performance decreases, fewer short tasks encounter queueing there
Decomposing Hawk: no stealing 19.6 Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) WITHOUT STEALING Short greatly penalized: tasks queued behind long tasks Long slightly penalized because they share queuing with more short tasks
Decomposing Hawk: no partitioning 11.9 Hawk minus centralized Hawk minus stealing Hawk minus partitioning (normalized to Hawk) NO PARTITIONING Short jobs bad, stucked behind long tasks in any node Long jobs slightly better because they can schedule in more nodes
Decomposing Hawk summary 19.6 11.9 Absence of any component reduces Hawk’s performance! NO CENTRALIZED Performance of long jobs goes high because tasks for different jobs queue one after another Short jobs better because long jobs performance decreases, fewer short tasks encounter queueing there WITHOUT STEALING Short greatly penalized: tasks queued behind long tasks Long slightly penalized because they share queuing with more short tasks NO PARTITIONING Short jobs bad, stucked behind long tasks in any node Long jobs slightly better because they can schedule in more nodes
Sensitivity analysis Incorrect estimates of runtime Cutoff long/short Details of stealing Size of small partition
Sensitivity analysis Incorrect estimates of runtime Cutoff long/short Details of stealing Size of small partition Bottom line: relatively stable to variations See paper for details
Evaluation: 2. Implementation Hawk scheduler Hawk daemon Hawk daemon
Experiment 100-node cluster Subset of Google trace Vary inter-arrival time to vary cluster utilization Measure: Job running time Report 50th and 90th percentile for short and long jobs Normalized to Sparrow Say WHY we compressed
Short jobs lower better Inter-arrival time / mean task run time 90th percentile not so good prediction: fewer jobs tested (corner cases)
Long jobs lower better Inter-arrival time / mean task run time 90th percentile not so good prediction: fewer jobs tested (corner cases)
Implementation 1. Hawk works well in real cluster 2. Good correspondence implementation/simulation
Related work Centralized: Hadoop, Quincy Eurosys’10, SOSP‘09 Two level: Yarn, Mesos SoCC’12, NSDI’11 Distributed schedulers: Omega, Sparrow Eurosys’12,SOSP’13 Hybrid schedulers: Mercury A lot of work in this area tradeoff list a few examples
Conclusion Hawk: hybrid scheduler long : centralized, short: distributed work-stealing cluster partitioning Hawk provides good results for short and long jobs Even under high cluster utilization