Download presentation
Presentation is loading. Please wait.
1
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt, Member, IEEE, Filip De Turck, Member, IEEE, Piet Demeester, Senior Member, IEEE, AND Peter A. Vanrolleghem
2
2 Table of Content Introduction Adaptive Checkpointing Heuristics Replication-Based Heuristics Conclusion and Future Work
3
3 Introduction A novel fault-tolerant algorithm combine –Checkpointing –Replication Be evaluated –Newly developed grid simulation environment Dynamic Scheduling in Distributed Environments (DSiDE)
4
4 Introduction (cont.) Simulation –Run employing workload –System parameters From several large-scale parallel production systems’ logs –Using the discrete event grid simulator DSiDE
5
5 Introduction (cont.) Comparable throughput and fault tolerance –Static checkpointing with optimal parameters –Replication with optimal parameters
6
6 Adaptive Checkpointing Heuristics The Checkpointing Model –Limites Runtime overhead (C) Network latency (L) Recovery delay (R) –Concentrates on the reduction of the checkpointing runtime overhead
7
7 Adaptive Checkpointing Heuristics (cont.) –Problem Assuming the execution time can be exactly determined in advance –Simulation The upper bounds of the algorithms performance, with respect to this parameter
8
8 Adaptive Checkpointing Heuristics (cont.) Last Failure Dependent Checkpointing (LastFailureCP) –Goal To reduce the overhead
9
9 Adaptive Checkpointing Heuristics (cont.) Mean Failure Dependent Checkpointing (MeanFailureCP) –Only considers checkpoint omissions –Modify the checkpointing interval based on the runtime information The remaining job execution time The average failure interval of the resource
10
10 Adaptive Checkpointing Heuristics (cont.) DSiDE Simulation Environment –Goal Validate –Architecture DExec DGen –Each DSiDE event has a time stamp Provide a priori or at runtime –Support several types of dynamic system modifications
11
11 Adaptive Checkpointing Heuristics (cont.) The DSiDE simulator architecture
12
12 Adaptive Checkpointing Heuristics (cont.) –The resource performed useful computations –Total grid availability –DSiDE provides a set of events to specify network links and routes
13
13 Adaptive Checkpointing Heuristics (cont.) Simulation Result –To compare the performance Checkpointing heuristics Realistic workload System failure model
14
14 Adaptive Checkpointing Heuristics (cont.) –Submit’s time 80% (7 a.m. ~ 9 p.m.) 20% (9 p.m. ~ 7 a.m.)
15
15 Adaptive Checkpointing Heuristics (cont.) –Execution time More than 80% of percent of all submitted jobs have medium execution times 1 hour to 6 hours
16
16 Adaptive Checkpointing Heuristics (cont.) –I decreases and longer jobs can get processed –Increase in job runtime is in effect –The results The results achieved with PeriodicCP are partially improved by LastFailureCP due to omission of redundant checkpoints The technique provides the best results for short checkpointing intervals The effectiveness of LastFailureCP strongly depends on failure periodically
17
17 Adaptive Checkpointing Heuristics (cont.) Failures occur quite periodically –Can easily be predicted by the algorithm –LastFailureCP will perform similar to PeriodicCP The fully dynamic scheme of MeanFailureCP proves to be the most effective Selective increase in checkpointing keeps the number of processed jobs and the average execution time of MeanFailureCP more or less constant PeriodicCP and LastFailureCP algorithms, the performance drops considerably
18
18 Replication-based Heuristics Load-Dependent Replication (LoadDependentRep) –Providing fault tolerance in distributed environments through replication Idle resources can be utilized to run job copies without significantly delaying the execution of the original job
19
19 Replication-based Heuristics (cont.) –The algorithm requires a number of parameters to be provided in advance Minimum number of job copies (Rep min ) Maximum number of job copies (Rep max ) The CPU limit (CL)
20
20 Replication-based Heuristics (cont.) –The outcome of the comparison determines the choice for the next job to be scheduled CA >= CL (Less than Rep max ) 0 < CA < CL (Less than Rep min ) CA = 0 (Skip the current scheduling round) –When one of the job duplicates finishes, other replicas are automatically canceled
21
21 Replication-based Heuristics (cont.) Failure Detection and Load Dependent Replication (FailureDependentRep) –Increase the fault tolerance of the previously discussed LoadDependentRep heuristic –Offer a higher level of fault tolerance compared to solely replication-based strategies –Not ensure job execution
22
22 Replication-based Heuristics (cont.) Adaptive Checkpoint and Replication- Based Fault Tolerance (CombinedFT) –Dynamically switches between both techniques based on runtime information on system load Checkpointing mode Replication mode
23
23 Replication-based Heuristics (cont.) –Checkpointing mode CPU availability is low (CA < CL) Combined FT rolls back The earlier distributed active job replicas (AR j ) Starts job checkpointing –AR j > 0 –AR j = 0 & CA > 0 –AR j = 0 & CA = 0 & ∃ i: AR i > 1 –AR j = 0 & CA = 0 & ¬ ∃ i: AR i > 1
24
24 Replication-based Heuristics (cont.) –Replication mode Either the system load decreases Enough resources restore from failure (CA ≧ CL) All jobs with less than Rep max replicas are considered for submission to the available resources Assign to the fastest resource connected to a grid site S with the maximum Speed S The smallest number of identical replicas
25
25 Replication-based Heuristics (cont.) Simulation Results –Approaches Unconditional RL(1) Unconditional RL(2) Unconditional RL(3) LoadDependentRL(1, 3, 40) FailureDependentRL(1, 3, 40) MeanFailureCP CombinedFT
26
26 Replication-based Heuristics (cont.)
27
27 Replication-based Heuristics (cont.)
28
28 Conclusion and Future Work Fault tolerance forms an important problem –Job checkpointing –Replication Evaluate in the DSiDE grid simulator The runtime overhead characteristic to periodic checkpointing can be reduced
29
29 Conclusion and Future Work (cont.) Advantage –When the distributed system properties are not known in advance, both techniques can best be applied Future Work –Scheduling methods will be considered
30
Present by Chen, Ting-Wei Thank you for your attention
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.