Job-aware Scheduling in Eagle: Divide and Stick to Your Probes Pamela Delgado, Diego Didona, Florin Dinu, Willy Zwaenepoel
I. Data-center scheduling cluster Job 1 task … task scheduler … … The context of this presentation is data center scheduling Job N task … task Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
I. Data-center scheduling challenges Heterogeneous workloads Short vs long tasks Problem: Head-of-line blocking (short behind long) Short Long Short Short In data-center scheduling we face some challenges combination of tasks that have a long execution time and tasks with short execution time for the purpose of this talk if a job has short tasks we call it short Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
I. Data-center scheduling challenges Scheduler induced stragglers Problem: Non job-aware scheduling Large scale task 1 Job completion time … task n task x time cluster In this case one task finishes later than others, this leads to BAD job completion time schedulers schedule at the task level, this leads to non job-aware scheduling Scale: both in terms of cluster size and terms of load Tens of thousands tasks/second … Tens of thousands … Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II. Eagle Contributions Divide: Stick to Your Probes: Hybrid scheduler Novel technique to avoid head-of-line blocking Stick to Your Probes: Decentralized job-awareness Hybrid scheduler On top of Hybrid Scheduler to have necessary scalability so what is hybrid scheduling? hybrid means a mix of centralized/distributed how does it work Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
I. Hybrid scheduling: long centralized L L L L L L L L L centralized scheduler L L L L L L L … L L
I. Hybrid scheduling: short distributed L L L L L L distributed scheduler distributed scheduler … s probe probe not use late binding L L L L … L L
II.1. Problem: Head-of-line blocking Short behind long High likelihood (long = many resources) Long A short task is enqueued behind a long task (either in the queue or running) Short Short Short head of queue Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.1. Rationale for Divide Expected completion time of a task proportional to variance of task execution times* DIVIDE by execution time Long Long Short Short Short *Pollaczek-Khinchine formula: Theory Vol1, Queueing Systems. L. Kleinroch 1975 Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.1. Dynamic division Long Long Long … Short Short Short Short Short Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
Succinct State Sharing II.1. Eagle – Divide IDEA: Dynamic partitioning Succinct State Sharing * Centralized: send bitmap of nodes with long tasks * Distributed: based on bitmap avoid Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.1. Eagle – Divide L L L reject L L L L L L distributed distributed scheduler distributed scheduler centralized scheduler … L L L reject L L c L L … L L reschedule Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.1. Eagle – Divide No head-of-line blocking Dynamic: mitigate resource wastage Scalable: no burden on centralized Succinct: bitmap Because its dynamic we mitigate Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.2. Problem: stragglers distributed scheduler task 1 task 2 Task waiting to execute! probe Completely distributed schedulers like in Hawk, Sparrow, Tarcil, send random probes to n1 n2 n3 n4 Node free! Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.2. Rationale Expected completion time of a job inversely proportional to number of jobs* Better finish one job entirely than to execute many jobs partially Expected completion time of a job is inversely proportional to the number of jobs present in the system Job 1 Job N task … task … task … task *Little’s formula: A proof for the queueing formula: L=𝜆𝑤. J.D.C. Little 1961 Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.2. Eagle - Stick to Your Probes IDEA: Get a job out of the system ASAP Sticky Batch Probing * Probe STICKS to a node. * Probe can execute more tasks. Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.2. Eagle - Stick to Your Probes distributed scheduler task 1 task 2 probe Probe STICKS there! n1 n2 n3 n4 Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II.2. Eagle – Stick to Your Probes Job-awareness Straggler mitigation Decentralized end on a high note Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
II. Eagle – Recap Divide Stick to your probes Hybrid scheduler dynamically divide nodes for short/long tasks Stick to your probes probe sticks to the node able to execute more tasks Hybrid scheduler Queue reorder: Shortest Remaining Processing Time (SRPT) Related work has shown the advantages of queue reordering Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III. Evaluation - simulation Event-driven simulator Google trace – half a million jobs 15000 – 23000 nodes Measure: Job running time Report short jobs 50th, 90th and 99th percentiles Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III.A. Hawk Hybrid scheduler Work stealing free nodes steal tasks from another try to avoid head-of-line blocking But this will not really avoid the head of line blocking as we will see Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
Better across the board III.A. Eagle vs Hawk Short job running times lower better Better across the board We show only short jobs because long jobs are scheduled in the same LWL fashion in both systems Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III.A. Eagle vs Hawk none some Why are we better? Eagle Hawk Avoids head-of-line blocking none some Job-aware scheduler Queue reordering Partitioning + stealing do not get rid of all short behind long Stealing randomized Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III.B. State-of-the-art (SOTA) [Apollo+] Schedule all jobs in Least Work Left (LWL) [Apollo+] Distributed: waiting times updated at heartbeat interval Google: 3 [s] [Yaq-d*] Queue reordering SRPT +Apollo: Scalable and coordinated scheduling for cloud-scale computing. E. Boutin et.al.OSDI'14 *Efficient queue management for cluster scheduling. J. Rasley et.al. EuroSys'16 Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
Better across the board III.B. Eagle vs SOTA Short job running times lower better Better across the board Better at higher loads The same at lower loads Lower Higher Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III.B. Eagle vs SOTA Why are we better? Eagle: more flexible task assignment SOTA: task assigned to one node SOTA heartbeats: stale information SOTA: concurrent scheduling Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III. Evaluation - Implementation Spark plug-in 100-node cluster Subset of Google trace Measure job running time Report short jobs 50th, 90th and 99th percentiles Compare to Hawk We don’t have availability for the other system Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
III. Evaluation - Implementation Subset of Google trace lower better Eagle works well in a real cluster Better at higher loads The same at lower loads Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion
IV. Conclusion Eagle new techniques Succinct State Sharing (Divide) No head-of-line blocking Sticky Batch Probing (Stick to Your Probes) Job-aware Two new techniques to improve scheduling of data-parallel jobs in data centers SSS : dynamically divide nodes into partitions long/short SBP: a probe sticks until job is done Introduction Eagle: Divide Eagle: Stick to Your Probes Evaluation Conclusion