Scheduling Jobs Across Geo-distributed Datacenters

Slides:

Advertisements

Similar presentations

CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.

Advertisements

CH 5. CPU Scheduling Basic Concepts F CPU Scheduling  context switching u CPU switching for another process u saving old PCB and loading.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor Paul Regnier † George Lima † Ernesto Massa † Greg Levin ‡ Scott Brandt ‡

Before start… Earlier work single-path routing in sensor networks

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

Scheduling of Parallel Jobs In a Heterogeneous Multi-Site Environment By Gerald Sabin from Ohio State Reviewed by Shengchao Yu 02/2005.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Analysis of Simulation Results Chapter 25. Overview  Analysis of Simulation Results  Model Verification Techniques  Model Validation Techniques  Transient.

The Common Shock Model for Correlations Between Lines of Insurance

CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.

ANOVA: Analysis of Variance.

On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.

Job Scheduling P. (Saday) Sadayappan Ohio State University.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

1/L Get Training or Wait? Long-Run Employment Effects of Training Programs for the Unemployed in West Germany Bernd Fitzenberger, Aderonke Osikominu, and.

A Comparison of Application-Level and Router-Assisted Hierarchical Schemes for Reliable Multicast Part 2 of the paper Pavlin Radoslavov, Christos Papadopoulos,

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Scheduling Jobs Across Geo-distributed Datacenters Chien-Chun Hung, Leana Golubchik, Minlan Yu Department of Computer Science University of Southern California.

Single Season Study Design. 2 Points for consideration Don’t forget; why, what and how. A well designed study will:  highlight gaps in current knowledge.

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

Resource Specification Prediction Model Richard Huang joint work with Henri Casanova and Andrew Chien.

OPERATING SYSTEMS CS 3502 Fall 2017

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

Measurement-based Design

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

April 6, 2001 Gary Kimura Lecture #6 April 6, 2001

PA an Coordinated Memory Caching for Parallel Jobs

On Optimal Distributed Kalman Filtering in Non-ideal Situations

Server Allocation for Multiplayer Cloud Gaming

Comparison of the Three CPU Schedulers in Xen

Chapter 6: CPU Scheduling

Chapter 6: CPU Scheduling

Understanding and Exploiting Amazon EC2 Spot Instances

Multi-Core Parallel Routing

CPSC 531: System Modeling and Simulation

Software Reliability Models.

Matching Methods & Propensity Scores

Module 5: CPU Scheduling

Matching Methods & Propensity Scores

Chapter 7 Sampling Distributions.

3: CPU Scheduling Basic Concepts Scheduling Criteria

Chapter5: CPU Scheduling

Strayer University at Arlington, VA

P. (Saday) Sadayappan Ohio State University

CARP: Compression-Aware Replacement Policies

Qingbo Zhu, Asim Shankar and Yuanyuan Zhou

Chapter 6: CPU Scheduling

Matching Methods & Propensity Scores

Load Balancing/Sharing/Scheduling Part II

Chapter 7 Sampling Distributions.

© 2017 by McGraw-Hill Education

Hawk: Hybrid Datacenter Scheduling

Chapter 7 Sampling Distributions.

Uniprocessor Process Management & Process Scheduling

Psych 231: Research Methods in Psychology

Size-Based Scheduling Policies with Inaccurate Scheduling Information

Adaptive Choice of Information Sources

Chapter 6: CPU Scheduling

NET 424: REAL-TIME SYSTEMS (Practical Part)

Chapter 5: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Chapter 7 Sampling Distributions.

Uniprocessor Process Management & Process Scheduling

Tiresias A GPU Cluster Manager for Distributed Deep Learning

Module 5: CPU Scheduling

Approximate Mean Value Analysis of a Database Grid Application

Presentation transcript:

Scheduling Jobs Across Geo-distributed Datacenters Tarek Elgamal CS525 4/12/2016

Motivation Contributions A job scheduled on multiple datacenters have potential skews in dividing its tasks among datacenters Job completion time depends on last task across datacenters Shortest Remaining Processing Time (SRPT) scheduling fails to optimize the average job completion time in that case The paper presents two approaches for job scheduling across geo-distributed datacenters: Reordering: add-on to any scheduling approach, proven to provide non-decreasing performance improvement under some assumptions SWAG: stand-alone scheduling approach Contributions

Summary of Findings SWAG outperforms the competitors (Global-SRPT, Independent-SRPT) in all settings, including those improved by Reordering Reordering shows more significant improvement in Independent-SRPT than Global-SRPT SWAG ‘s running time is 4.5 ms (significantly smaller than average task duration in the evaluation, which is 2s) SWAG has less communication overhead than other competitors that require communication Reordering and SWAG are robust to inaccuracies in estimating tasks duration

Pros Cons Significant improvement in job completion time over SRTP (27% for Reordering and 50% for SWAG) Simple and intuitive heuristics Clear illustrative example to highlight SRPT shortcomings Proven non-decreasing performance for Reordering Evaluation is based on real job traces Evaluation is simulation-based, no real- environment evaluation Impractical assumptions (most of them are mentioned explicitly in the paper): homogenous task durations Homogenous datacenter capacity and latency Guaranteed data locality Support Single-stage jobs (no DAGs) Non-negligible comp. and comm. overheads for both approaches Comparison with SRPT only and with one primary metric (job completion time) e scheduling running time is very high with SWAG when the cluster utilization is around 78%.

Thoughts Tasks durations within jobs and across jobs are assumed to be homogenous, heterogeneity is not discussed properly Paper addresses heterogeneous task durations within datacenters but not in the global controller In the evaluation, Pareto distribution with shape parameter 1.259 is assumed Task durations with values more than 3 has low impact on cumulative distribution (i.e. most of task durations across jobs range from 1s-3s) The above observation might explain why the inaccuracies in task durations does not affect SWAG (Task durations are already in a small range) Cumulative Pareto distribution Task durations above 3s has low probability The observation might explain why the experiments shows that inaccuracies in task durations does not affect SWAG (Task durations are already in a small range) Is SWAG the same as global-SRPT but with considering the current state of queue at each datacenter ?

Questions What other metrics can be used? Is avg. job completion time enough? What happens if the distributed tasks of a job require results from the previous task that executed at some other location? Is SWAG the same as global-SRPT but with considering the current state of queue at each datacenter ? Are the overhead calculations reliable in simulation-based experiments?