Job-aware Scheduling in Eagle: Divide and Stick to Your Probes

Slides:

Advertisements

Similar presentations

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.

The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.

Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.

Design and Performance Evaluation of Queue-and-Rate-Adjustment Dynamic Load Balancing Policies for Distributed Networks Zeng Zeng, Bharadwaj, IEEE TRASACTION.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Cs238 CPU Scheduling Dr. Alan R. Davis. CPU Scheduling The objective of multiprogramming is to have some process running at all times, to maximize CPU.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

Distributed Low-Latency Scheduling

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

Cloud MapReduce: A MapReduce Implementation on top of a Cloud Operation System 江嘉福徐光成章博遠 2011, 11th IEEE/ACM International.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

© 2003, Carla Ellis Simulation Techniques Overview Simulation environments emulation exec- driven sim trace- driven sim stochastic sim Workload parameters.

임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

Record Linkage in a Distributed Environment

Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Distributed Process Scheduling : A Summary

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

Parallel and Distributed Simulation Time Parallel Simulation.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.

Scalable and Coordinated Scheduling for Cloud-Scale computing

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Critical Area Attention in Traffic Aware Dynamic Node Scheduling for Low Power Sensor Network Proceeding of the 2005 IEEE Wireless Communications and Networking.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Zeta: Scheduling Interactive Services with Partial Execution Yuxiong He, Sameh Elnikety, James Larus, Chenyu Yan Microsoft Research and Microsoft Bing.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters Christina Delimitrou, Daniel Sanchez and Christos Kozyrakis Presented By Shiv.

Computer Architecture: Parallel Task Assignment

Introduction to Load Balancing:

Running Multiple Schedulers in Kubernetes

Introduction | Model | Solution | Evaluation

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Operating Systems (CS 340 D)

Processes and Threads Processes and their scheduling

Load Balancing and Data centers

Edinburgh Napier University

Lecture Topics: 11/1 Processes Process Management

So far we have covered … Basic visualization algorithms

Parallel Algorithm Design

Scheduling Jobs Across Geo-distributed Datacenters

CS 143A - Principles of Operating Systems

Omega: flexible, scalable schedulers for large compute clusters

TDC 311 Process Scheduling.

CSCI1600: Embedded and Real Time Software

Distributed computing deals with hardware

CPU SCHEDULING.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

CSCI1600: Embedded and Real Time Software

Cloud Computing Large-scale Resource Management

Hawk: Hybrid Datacenter Scheduling

COS 518: Distributed Systems Lecture 11 Mike Freedman

Tiresias A GPU Cluster Manager for Distributed Deep Learning

CS 584 Lecture 5 Assignment. Due NOW!!.

Presentation transcript:

Job-aware Scheduling in Eagle: Divide and Stick to Your Probes Pamela Delgado, Diego Didona, Florin Dinu, Willy Zwaenepoel

I. Data-center scheduling cluster Job 1 task … task scheduler … … The context of this presentation is data center scheduling Job N task … task  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

I. Data-center scheduling challenges Heterogeneous workloads Short vs long tasks Problem: Head-of-line blocking (short behind long) Short Long Short Short In data-center scheduling we face some challenges combination of tasks that have a long execution time and tasks with short execution time for the purpose of this talk if a job has short tasks we call it short  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

I. Data-center scheduling challenges Scheduler induced stragglers Problem: Non job-aware scheduling Large scale task 1 Job completion time … task n task x time cluster In this case one task finishes later than others, this leads to BAD job completion time schedulers schedule at the task level, this leads to non job-aware scheduling Scale: both in terms of cluster size and terms of load Tens of thousands tasks/second … Tens of thousands …  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II. Eagle Contributions Divide: Stick to Your Probes: Hybrid scheduler Novel technique to avoid head-of-line blocking Stick to Your Probes: Decentralized job-awareness Hybrid scheduler On top of Hybrid Scheduler to have necessary scalability so what is hybrid scheduling? hybrid means a mix of centralized/distributed how does it work  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

I. Hybrid scheduling: long centralized L L L L L L L L L centralized scheduler L L L L L L L … L L

I. Hybrid scheduling: short distributed L L L L L L distributed scheduler distributed scheduler … s probe probe not use late binding L L L L … L L

II.1. Problem: Head-of-line blocking Short behind long High likelihood (long = many resources) Long A short task is enqueued behind a long task (either in the queue or running) Short Short Short head of queue  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.1. Rationale for Divide Expected completion time of a task proportional to variance of task execution times* DIVIDE by execution time Long Long Short Short Short *Pollaczek-Khinchine formula: Theory Vol1, Queueing Systems. L. Kleinroch 1975  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.1. Dynamic division Long Long Long … Short Short Short Short Short  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

Succinct State Sharing II.1. Eagle – Divide IDEA: Dynamic partitioning Succinct State Sharing * Centralized: send bitmap of nodes with long tasks * Distributed: based on bitmap avoid  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.1. Eagle – Divide L L L reject L L L L L L distributed distributed scheduler distributed scheduler centralized scheduler … L L L reject L L c L L … L L reschedule  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.1. Eagle – Divide No head-of-line blocking Dynamic: mitigate resource wastage Scalable: no burden on centralized Succinct: bitmap Because its dynamic we mitigate  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.2. Problem: stragglers distributed scheduler task 1 task 2 Task waiting to execute! probe Completely distributed schedulers like in Hawk, Sparrow, Tarcil, send random probes to n1 n2 n3 n4 Node free!  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.2. Rationale Expected completion time of a job inversely proportional to number of jobs* Better finish one job entirely than to execute many jobs partially Expected completion time of a job is inversely proportional to the number of jobs present in the system Job 1 Job N task … task … task … task *Little’s formula: A proof for the queueing formula: L=𝜆𝑤. J.D.C. Little 1961  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.2. Eagle - Stick to Your Probes IDEA: Get a job out of the system ASAP Sticky Batch Probing * Probe STICKS to a node. * Probe can execute more tasks.  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.2. Eagle - Stick to Your Probes distributed scheduler task 1 task 2 probe Probe STICKS there! n1 n2 n3 n4  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II.2. Eagle – Stick to Your Probes Job-awareness Straggler mitigation Decentralized end on a high note  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

II. Eagle – Recap Divide Stick to your probes Hybrid scheduler dynamically divide nodes for short/long tasks Stick to your probes probe sticks to the node able to execute more tasks Hybrid scheduler Queue reorder: Shortest Remaining Processing Time (SRPT) Related work has shown the advantages of queue reordering  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III. Evaluation - simulation Event-driven simulator Google trace – half a million jobs 15000 – 23000 nodes Measure: Job running time Report short jobs 50th, 90th and 99th percentiles  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III.A. Hawk Hybrid scheduler Work stealing  free nodes steal tasks from another  try to avoid head-of-line blocking But this will not really avoid the head of line blocking as we will see  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

Better across the board III.A. Eagle vs Hawk Short job running times lower better Better across the board We show only short jobs because long jobs are scheduled in the same LWL fashion in both systems  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III.A. Eagle vs Hawk none some   Why are we better? Eagle Hawk Avoids head-of-line blocking none some Job-aware scheduler   Queue reordering Partitioning + stealing  do not get rid of all short behind long Stealing randomized  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III.B. State-of-the-art (SOTA) [Apollo+] Schedule all jobs in Least Work Left (LWL) [Apollo+] Distributed: waiting times updated at heartbeat interval Google: 3 [s] [Yaq-d*] Queue reordering SRPT +Apollo: Scalable and coordinated scheduling for cloud-scale computing. E. Boutin et.al.OSDI'14 *Efficient queue management for cluster scheduling. J. Rasley et.al. EuroSys'16  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

Better across the board III.B. Eagle vs SOTA Short job running times lower better Better across the board Better at higher loads The same at lower loads Lower Higher  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III.B. Eagle vs SOTA Why are we better? Eagle: more flexible task assignment SOTA: task assigned to one node SOTA heartbeats: stale information SOTA: concurrent scheduling  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III. Evaluation - Implementation Spark plug-in 100-node cluster Subset of Google trace Measure job running time Report short jobs 50th, 90th and 99th percentiles Compare to Hawk We don’t have availability for the other system  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

III. Evaluation - Implementation Subset of Google trace lower better Eagle works well in a real cluster Better at higher loads The same at lower loads  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion

IV. Conclusion Eagle new techniques Succinct State Sharing (Divide) No head-of-line blocking Sticky Batch Probing (Stick to Your Probes) Job-aware Two new techniques to improve scheduling of data-parallel jobs in data centers SSS : dynamically divide nodes into partitions long/short SBP: a probe sticks until job is done  Introduction  Eagle: Divide  Eagle: Stick to Your Probes  Evaluation  Conclusion