1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

Slides:

Advertisements

Similar presentations

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Advertisements

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Availability in Globally Distributed Storage Systems

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.

Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Silberschatz, Galvin and Gagne ©2007 Processes and Their Scheduling.

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Performance Evaluation

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

Host Load Prediction in a Google Compute Cloud with a Bayesian Model Sheng Di 1, Derrick Kondo 1, Walfredo Cirne 2 1 INRIA 2 Google.

Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.

A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

1 On Failure Recoverability of Client-Server Applications in Mobile Wireless Environments Ing-Ray Chen, Baoshan Gu, Sapna E. George and Sheng- Tzong Cheng.

Delay Analysis of IEEE in Single-Hop Networks Marcel M. Carvalho, J.J.Garcia-Luna-Aceves.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

Dynamic Resource Monitoring and Allocation in a virtualized environment.

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

CONTI'20041 Event Management in Distributed Control Systems Gheorghe Sebestyen Technical University of Cluj-Napoca Computers Department.

10 th December, 2013 Lab Meeting Papers Reviewed:.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

IEOR 4405 Lecture 1 Introduction. Scheduling Topics in this class – Modeling and formulating scheduling problems – Algorithms for solving scheduling problems.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

A Hyper-heuristic for scheduling independent jobs in Computational Grids Author: Juan Antonio Gonzalez Sanchez Coauthors: Maria Serna and Fatos Xhafa.

Uppsala, April 12-16th 2010EGEE 5th User Forum1 A Business-Driven Cloudburst Scheduler for Bag-of-Task Applications Francisco Brasileiro, Ricardo Araújo,

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Scalable and Coordinated Scheduling for Cloud-Scale computing

The IEEE International Conference on Cluster Computing 2010

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Mok & friends. Resource partition for real- time systems (RTAS 2001)

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

OPERATING SYSTEMS CS 3502 Fall 2017

How Much SSD Is Useful For Resilience In Supercomputers

Edinburgh Napier University

Dynamic Graph Partitioning Algorithm

Lecture 2d1: Quality of Measurements

Chapter 2 Scheduling.

Supporting Fault-Tolerance in Streaming Grid Applications

Lecture 2 Part 3 CPU Scheduling

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Statistical Thinking and Applications

Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University

Presentation transcript:

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick Kondo, Franck Cappello

2/22 Outline Background of Google Cloud Task Processing System Overview Research Formulation Optimization of Fault-tolerance Optimization of the Number of Checkpoints Adaptive Optimization of Fault Tolerance Local disk vs. Shared disk Performance Evaluation Conclusion and Future Work

3/22 Background Google trace (released in ): 670,000 jobs, 2,500,000 tasks, 12,000 nodes One-month period (29 days) Various events, Resource request/allocation, Job/task length, Various attributes, etc. There are two types of jobs in Google trace: sequential-task job and Bag-of-Task job 4000 application types, such as map-reduce. Failure events occur often for some tasks! Most of task lengths are short (a few or dozens of minutes), so task execution is sensitive to checkpointing cost.

4/22 System Overview User Interface Receive tasks Task Scheduling Coordinate resource competition among hosts Resource Allocation Coordinate resource usage within a particular host

5/22 System Overview (Cont ’ d) Task Processing Procedure

6/22 Research Formulation Analysis of Google trace: Task failure intervals, Task length, Job structure Equidistant checkpointing model Checkpointing interval for a particular task is fixed Task execution model (suppose k failures) T w (task) = T e (task)+C(x-1)+Σ k {roll-back-loss}+Σ k {restart-cost} Objective: minimizing E(T w (task)) Random Variable: K (# of task failure events) Compute optimal # of checkpoints for a Google task Task’s wall-clock time Productive time Checkpoint cost Roll-back lossRestart cost Task EntryTask Exit

7/22 Theorem 1: x*: the optimal number of checkpointing intervals T e : task execution length (productive length) E(Y): task’s expected # of failures (characterized by MNOF) C: checkpoint cost (time increment per checkpoint) Formula (3): Example: A task’s productive length is 18 seconds, C = 2 sec, expected # of failures = 2 in its execution Optimal # of checkpointing intervals = sqrt(18*2/(2*2))=3 The optimal checkpointing interval = 18/3 = 6 seconds Optimization of the Number of Checkpoints: New formula

8/22 Formula (3) does not depend on probability distribution, unlike Young’s formula Young’s formula (proposed in 1977) Optimal checkpoint interval: C: checkpointing cost T f : mean time between failures (MTBF) Conditions: (1) Task failure intervals follows exponential distribution (2) Checkpoint cost C is far smaller than checkpoint interval T c Due to Taylor series and second-order approximation Optimization of the Number of Checkpoints : Discussion

9/22 The assumption with exponential distribution makes Young’s formula unsuitable for Google task processing Distribution of Google task failure intervals based on priority Optimization of the Number of Checkpoints : Discussion

10/22 Corollary 1: Young’s formula is a special case Two important conditions: Task failure intervals follow exponential distribution Checkpointing cost is small Optimization of the Number of Checkpoints : Discussion

11/22 Optimization of the Number of Checkpoints : Discussion Our formula (3) is easier to apply than Young’s formula in practice - Young’s formula depends on MTBF, while MTBF may not be easy to predict precisely Non-asynchronous clocks across hosts Inevitable influence of checkpointing cost Significant delay of failure detection - By contrast, MNOF is easy to record accurately

12/22 Adaptive Optimization of Chpt Positions Problem: what if the probability distribution of failure intervals (or failure rates) changes over time? This is possible due to changeable priority …. Objective: To design an adaptive algorithm to dynamically suit the changing failure rates. Question: Will the optimal checkpoint positions change with decreasing remaining workload over time? Solution: We just need to monitor MNOF, regardless of the decreasing remaining workload to process - because of Theorem 2 Kth chpt(K+1)th chpt Opt chpt intervals? Later on means current time

13/22 Adaptive Optimization of Fault Tolerance (Cont ’ d) Theorem 2: Optimal # of checkpointing Intervals computed at (k+1)th checkpoint position Optimal # of checkpointing intervals computed at kth checkpoint position

14/22 Local disk vs. Shared disk checkpointing Characterization based on BLCR Operation time cost in setting a checkpoint

15/22 Performance Evaluation Experimental Setting We build a testbed based on Google trace, in a cluster with hundreds of VM instances running across 16 nodes (16*8 cores, 16*16GB memroy size, XEN4.0, BLCR) We call it GloudSim (Google based cloud simulation system) [under review by HiPC’13] We reproduce Google task execution as close as possible to Google trace, e.g., Task arrivals are based on the trace or some distribution Task’s memory is reproduced via Google trace Task’s failure events are reproduced via Google trace Each job is chosen from among all sample jobs in the trace

16/22 Performance Evaluation (Cont ’ d) Experimental Results Job’s Workload-Processing Ratio (WPR) Checkpointing effect with precise prediction (on MNOF and MTBF)

17/22 Performance Evaluation (Cont ’ d) Distribution of WPR with diff. C/R formulas a

18/22 Performance Evaluation (Cont ’ d) MNOF & MTBF w.r.t. Priority in Google trace MNOF is stable with task lengths, while MTBF is not stable (changing from 179 to 4199 secs)

19/22 Performance Evaluation (Cont ’ d) Min/Avg/Max WPR with respect to diff. Priorities Our formula outperforms Young’s formula by 3-10%

20/22 Performance Evaluation (Cont ’ d) Wall-clock lengths of 10,000 job execution Conclusion: Job wall-clock lengths are often incremented by seconds under Young’s formula than ours.

21/22 Performance Evaluation (Cont ’ d) Adaptive Algorithm vs. Static Algorithm

22/22 Conclusion and Future Work Selected conclusions: Our formula (3) is better than Young’s formula by 3-10 percent, w.r.t. Google task processing Job wall-clock lengths are incremented by seconds under Young’s formula than ours. Worst WPR under dynamic algorithm stays about 0.8, compared to 0.5 under static algorithm. Future work Port our theorems to more cases like MPI over Cloud platforms.

23/22 Thanks for your attention!! Contact me at: