Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Slides:

Advertisements

Similar presentations

Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Senior Design Project: Parallel Task Scheduling in Heterogeneous Computing Environments Senior Design Students: Christopher Blandin and Dylan Machovec.

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

Dynamic Tuning of the IEEE Protocol to Achieve a Theoretical Throughput Limit Frederico Calì, Marco Conti, and Enrico Gregori IEEE/ACM TRANSACTIONS.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Scheduling in Batch Systems

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Dept. of Computer Science & Engineering, CUHK Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Chen Xinyu

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

1 Incentive-Based Scheduling for Market-Like Computational Grids Lijuan Xiao, Yanmin Zhu, Member, IEEE, Lionel M. Ni, Fellow, IEEE, and Zhiwei Xu, Senior.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

Simulation of Memory Management Using Paging Mechanism in Operating Systems Tarek M. Sobh and Yanchun Liu Presented by: Bei Wang University of Bridgeport.

Chapter 6: CPU Scheduling

MM Process Management Karrie Karahalios Spring 2007 (based off slides created by Brian Bailey)

November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Xiao Liu, Jinjun Chen, Ke Liu, Yun Yang CS3: Centre for Complex Software Systems and Services Swinburne University of Technology, Melbourne, Australia.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.

1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.

O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.

1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.

A Hyper-heuristic for scheduling independent jobs in Computational Grids Author: Juan Antonio Gonzalez Sanchez Coauthors: Maria Serna and Fatos Xhafa.

June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.

Silberschatz and Galvin  Operating System Concepts Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor.

1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 5 CPU Scheduling Slide 1 Chapter 5 CPU Scheduling.

6.1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Efficient Load Balancing Algorithm for Cloud Computing Network Che-Lun Hung 1, Hsiao-hsi Wang 2 and Yu-Chen Hu 2 1 Dept. of Computer Science & Communication.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Tunable QoS-Aware Network Survivability Presenter : Yen Fen Kao Advisor : Yeong Sung Lin 2013 Proceedings IEEE INFOCOM.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.

CPU Scheduling Operating Systems CS 550. Last Time Deadlock Detection and Recovery Methods to handle deadlock – Ignore it! – Detect and Recover – Avoidance.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

CPU Scheduling G.Anuradha Reference : Galvin. CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time.

1 Lecture 5: CPU Scheduling Operating System Fall 2006.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Ching-Chi Lin Institute of Information Science, Academia Sinica

Process Scheduling B.Ramamurthy 9/16/2018.

Chapter 6: CPU Scheduling

Process Scheduling B.Ramamurthy 11/18/2018.

CPU Scheduling Basic Concepts Scheduling Criteria

CPU Scheduling G.Anuradha

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 5: CPU Scheduling

Multi-hop Coflow Routing and Scheduling in Data Centers

3: CPU Scheduling Basic Concepts Scheduling Criteria

Process Scheduling B.Ramamurthy 12/5/2018.

Chapter 6: CPU Scheduling

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 2/23/2019.

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt, Member, IEEE, Filip De Turck, Member, IEEE, Piet Demeester, Senior Member, IEEE, AND Peter A. Vanrolleghem

2 Table of Content Introduction Adaptive Checkpointing Heuristics Replication-Based Heuristics Conclusion and Future Work

3 Introduction A novel fault-tolerant algorithm combine –Checkpointing –Replication Be evaluated –Newly developed grid simulation environment Dynamic Scheduling in Distributed Environments (DSiDE)

4 Introduction (cont.) Simulation –Run employing workload –System parameters From several large-scale parallel production systems’ logs –Using the discrete event grid simulator DSiDE

5 Introduction (cont.) Comparable throughput and fault tolerance –Static checkpointing with optimal parameters –Replication with optimal parameters

6 Adaptive Checkpointing Heuristics The Checkpointing Model –Limites Runtime overhead (C) Network latency (L) Recovery delay (R) –Concentrates on the reduction of the checkpointing runtime overhead

7 Adaptive Checkpointing Heuristics (cont.) –Problem Assuming the execution time can be exactly determined in advance –Simulation The upper bounds of the algorithms performance, with respect to this parameter

8 Adaptive Checkpointing Heuristics (cont.) Last Failure Dependent Checkpointing (LastFailureCP) –Goal To reduce the overhead

9 Adaptive Checkpointing Heuristics (cont.) Mean Failure Dependent Checkpointing (MeanFailureCP) –Only considers checkpoint omissions –Modify the checkpointing interval based on the runtime information The remaining job execution time The average failure interval of the resource

10 Adaptive Checkpointing Heuristics (cont.) DSiDE Simulation Environment –Goal Validate –Architecture DExec DGen –Each DSiDE event has a time stamp Provide a priori or at runtime –Support several types of dynamic system modifications

11 Adaptive Checkpointing Heuristics (cont.) The DSiDE simulator architecture

12 Adaptive Checkpointing Heuristics (cont.) –The resource performed useful computations –Total grid availability –DSiDE provides a set of events to specify network links and routes

13 Adaptive Checkpointing Heuristics (cont.) Simulation Result –To compare the performance Checkpointing heuristics Realistic workload System failure model

14 Adaptive Checkpointing Heuristics (cont.) –Submit’s time 80% (7 a.m. ~ 9 p.m.) 20% (9 p.m. ~ 7 a.m.)

15 Adaptive Checkpointing Heuristics (cont.) –Execution time More than 80% of percent of all submitted jobs have medium execution times 1 hour to 6 hours

16 Adaptive Checkpointing Heuristics (cont.) –I decreases and longer jobs can get processed –Increase in job runtime is in effect –The results The results achieved with PeriodicCP are partially improved by LastFailureCP due to omission of redundant checkpoints The technique provides the best results for short checkpointing intervals The effectiveness of LastFailureCP strongly depends on failure periodically

17 Adaptive Checkpointing Heuristics (cont.) Failures occur quite periodically –Can easily be predicted by the algorithm –LastFailureCP will perform similar to PeriodicCP The fully dynamic scheme of MeanFailureCP proves to be the most effective Selective increase in checkpointing keeps the number of processed jobs and the average execution time of MeanFailureCP more or less constant PeriodicCP and LastFailureCP algorithms, the performance drops considerably

18 Replication-based Heuristics Load-Dependent Replication (LoadDependentRep) –Providing fault tolerance in distributed environments through replication Idle resources can be utilized to run job copies without significantly delaying the execution of the original job

19 Replication-based Heuristics (cont.) –The algorithm requires a number of parameters to be provided in advance Minimum number of job copies (Rep min ) Maximum number of job copies (Rep max ) The CPU limit (CL)

20 Replication-based Heuristics (cont.) –The outcome of the comparison determines the choice for the next job to be scheduled CA >= CL (Less than Rep max ) 0 < CA < CL (Less than Rep min ) CA = 0 (Skip the current scheduling round) –When one of the job duplicates finishes, other replicas are automatically canceled

21 Replication-based Heuristics (cont.) Failure Detection and Load Dependent Replication (FailureDependentRep) –Increase the fault tolerance of the previously discussed LoadDependentRep heuristic –Offer a higher level of fault tolerance compared to solely replication-based strategies –Not ensure job execution

22 Replication-based Heuristics (cont.) Adaptive Checkpoint and Replication- Based Fault Tolerance (CombinedFT) –Dynamically switches between both techniques based on runtime information on system load Checkpointing mode Replication mode

23 Replication-based Heuristics (cont.) –Checkpointing mode CPU availability is low (CA < CL) Combined FT rolls back The earlier distributed active job replicas (AR j ) Starts job checkpointing –AR j > 0 –AR j = 0 & CA > 0 –AR j = 0 & CA = 0 & ∃ i: AR i > 1 –AR j = 0 & CA = 0 & ¬ ∃ i: AR i > 1

24 Replication-based Heuristics (cont.) –Replication mode Either the system load decreases Enough resources restore from failure (CA ≧ CL) All jobs with less than Rep max replicas are considered for submission to the available resources Assign to the fastest resource connected to a grid site S with the maximum Speed S The smallest number of identical replicas

25 Replication-based Heuristics (cont.) Simulation Results –Approaches Unconditional RL(1) Unconditional RL(2) Unconditional RL(3) LoadDependentRL(1, 3, 40) FailureDependentRL(1, 3, 40) MeanFailureCP CombinedFT

26 Replication-based Heuristics (cont.)

27 Replication-based Heuristics (cont.)

28 Conclusion and Future Work Fault tolerance forms an important problem –Job checkpointing –Replication Evaluate in the DSiDE grid simulator The runtime overhead characteristic to periodic checkpointing can be reduced

29 Conclusion and Future Work (cont.) Advantage –When the distributed system properties are not known in advance, both techniques can best be applied Future Work –Scheduling methods will be considered

Present by Chen, Ting-Wei Thank you for your attention