Download presentation
Presentation is loading. Please wait.
Published byKory O’Connor’ Modified over 9 years ago
1
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello INRIA and ANL 2013
2
2/20 Outline Background of Multi-level Checkpoint Model Problem Formulation Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals for each level Optimizing the Selection of Levels Performance Evaluation Conclusion and Future Work
3
3/20 Background of Multi-level Ckpt Model Traditional Ckpt/Restart model always stores checkpoint files onto Parallel File System (PFS) PFS is of central-controlled mode, which suffers bottle-neck issue for large-scale app. For example, our experiments shows that the checkpoint overhead on PFS increases quickly with problem size and execution scale: # cores1282565121024 Ckpt cost7.4 sec10.8 sec16.8 sec43.1 sec
4
4/20 Background of Multi-level Ckpt Model Existing Multi-level checkpoint toolkits Scalable Checkpoint/Restart Library (SCR) – SC’10 RAM disk / local disk Partner-copy / XOR encoding Parallel File System (PFS), e.g., NFS Fault Tolerance Interface (FTI) - SC’11 Local disk: storing ckpt files into local disk Partner-copy: storing ckpt files in local disk & partner disk Reed-Solomon encoding (RS-encoding) Parallel File System (PFS): such as NFS
5
5/20 Problem Formulation Different Types of Failures CPL1: There are no hardware failures but software errors. CPL2: There are non-adjacent hardware failures CPL3: There are a few adjacent hardware failures CPL4: There are a lot of hardware failures
6
6/20 Problem Formulation The process of running an HPC application with failures over multi-level checkpoint model
7
7/20 Problem Formulation Our Objective - Minimize the expected wall- clock length for each HPC application with: optimized selection of levels optimized checkpoint intervals on each level Mathematical Expectation of Wall-clock Length: Productive time # of levels# of ckpt intervals at level i Ckpt overhead Rollback lossRestart cost # of failures at level i probability
8
8/20 Optimization of Multi-level Checkpoint Model E(T w ) is convex, because x i is referred to as the # of ckpt intervals at level i We get optimal solution as long as we solve the simultaneous equations, optimal x i * : where i = 1, 2, 3, …., L
9
9/20 Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals Simplified equations: We use an iterative algorithm to solve it: k=0: err=0.2 k=1: err=0.08 k=2: err=0.005 K=3: err=0.0001 …… We use Young’s formula to initialize x i (0) k+1 k k
10
10/20 Optimization of Multi-level Checkpoint Model Optimizing Checkpoint Intervals How fast is our iterative optimal algorithm? If we set the error threshold to 10 -6, the algorithm will converge with only about 20-30 iterations !! What is the performance gain under our method, compared to the traditional Young’s formula? Suppose there are 8 levels and application execution length is 1000 ~ 9000 seconds The checkpoint overheads on the 8 levels are 10, 30, 45, 50, 55, 60, 65, 240 seconds per checkpoint. Numerical simulation shows that our method is better than Young’s formula by 4.2% - 17.8%.
11
11/20 Optimization of Multi-level Checkpoint Model Optimizing Selection of Checkpoint Levels For a particular combination of levels, the computation complexity is only about 30 iterations. It is feasible to traverse all of combinations of levels to find the optimal selection of levels. Suppose there are 8 levels, so there are 2 8 -1=255 different combinations of levels, and the total computation complexity is 255*30=7650, which is very small!
12
12/20 Optimization of Multi-level Checkpoint Model Analysis of A Practical Case – FTI There are 4 levels: local disk, partner-copy, RS- encoding, and PFS Use C lf, C pc, C rs, C pf to denote ckpt overheads Use R lf, R pc, R rs, R pf to denote restart overheads
13
13/20 Optimization of Multi-level Checkpoint Model Analysis of A Practical Case – FTI The target simultaneous equations derived from convex optimization (first-order derivatives) is: The solution to the above equations must be optimal We can use iterative method to get it very quickly.
14
14/20 Performance Evaluation Experimental Setting Evaluation Type A: Numerical Simulation To evaluate a large number of various cases with different parameters, including different ckpt overheads, restart cost, application length, etc. Evaluation Type B: Real Experiment To validate the feasibility of using our optimal checkpoint model in a real use case – FTI scenario. MPI program used in our experiment: Head distribution
15
15/20 Performance Evaluation Checkpoint Overhead of FTI on FUSION cluster Key Indicator: Workload Processing Ratio (WPR) = productive time / wall-clock length 26MB per proc 57MB per proc
16
16/20 Performance Evaluation Different Selections of Checkpoint Levels Simulation Settings
17
17/20 Performance Evaluation Different Selections of Checkpoint Levels Simulation Results Improvement:10-20%
18
18/20 Performance Evaluation Experimental Results on FUSION cluster
19
19/20 Conclusion Optimal Multi-level Checkpoint/Restart Model Key Theoretical Conclusions: Ckpt intervals on each level can be optimized by fast iterative methods (converged within only 30 iterations) The ckpt intervals are optimal based on convex- optimization theory Key Simulation/Experimental Results: For FTI, Iterative Optimal method with best selection of levels is better than other solutions by up to 20%. For other cases like 8 levels, Optimized selection of levels can improve performance by 50% in some cases.
20
20/20 Future Work In the future, we plan to: evaluate our optimal ckpt/restart model using more complex MPI program on real clusters with larger scales, such as CESM. optimize the robustness and stability by taking into account the possible prediction errors on checkpoint overheads and execution length. optimize the execution scale (# of processes) based on checkpoint overheads for some application with specific productive time.
21
21/20 Thanks!! Contact me at: disheng222@gmail.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.