Download presentation
Presentation is loading. Please wait.
Published by민준 곽 Modified over 6 years ago
1
A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign
2
Motivation As machines grow in size Challenges for exascale:
MTBF decreases Jaguar had 2.33 average failures/day from 2008 to 2010 Applications have to tolerate faults Challenges for exascale: Disk-based (NFS reliable disk) checkpointing is slow System-level checkpointing can be expensive Scalable checkpointing/restart can be a communication intensive process Job scheduler prevent fault tolerance support in runtime FTXS 2012
3
Motivation (cont.) Applications on future exascale machines need fast, low cost and scalable fault tolerance support Previous work: double in-memory checkpoint/restart scheme In production version of Charm++ since 2004 FTXS 2012
4
Double in-memory Checkpoint/Restart Protocol
PE0 PE1 PE2 PE3 A B C D E F G H I J A B C D E F G H I J H I J A B C D E F G PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 H I J A B C D E F G D H J A B C F G E I A C E H I J F G B D A object A checkpoint 1 A checkpoint 2 A restored object FTXS 2012
5
Runtime Support for FT Implemented in Charm++/Adaptive MPI runtime
Automatically checkpointing threads Including stack and heap (isomalloc) User helper functions To pack and unpack data Checkpointing only the live variables FTXS 2012
6
Local Disk-Based Protocol
Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small MD, N-body, quantum chemistry Double In-disk checkpointing Make use of local disk (or SSD) Also does not rely on any reliable storage Useful for applications with very big memory footprint FTXS 2012
7
Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing
FTXS 2012
8
Previous Results: Restart with Load Balancing
LeanMD, Apoa1, 128 processors FTXS 2012
9
Previous Result: Recovery Performance
10 crashes 128 processors Checkpoint every 10 time steps FTXS 2012
10
FT on MPI-based Charm++
Practical challenge: job scheduler Job scheduler kills the entire job when a process fails MPI-based Charm++ is portable on major supercomputers A fault injection scheme in MPI machine layer DieNow() MPI process stop responding Fault is detected by keep-alive messages Spare processors to replace failed ones Demonstrated on 64K cores of BG/P machine FTXS 2012
11
Performance at Large Scale
FTXS 2012
12
Optimization for scalability
Communication bottlenecks Checkpoint/restart time takes O(P) time Optimizations: Collectives (barriers) Switch O(P) barrier to a tree-based barrier Stale message handling Epoch number A phase to discard stale messages as quickly as possible Small messages Streaming optimization FTXS 2012
13
LeanMD Checkpoint Time before/after Optimization
FTXS 2012
14
Checkpoint Time for Jacobi/MPI
Kraken Running on AMPI FTXS 2012
15
LeanMD Restart Time FTXS 2012
16
Conclusions and Future work
In-memory checkpointing after optimization is scalable towards Exascale Future work: Non-blocking checkpointing Compress checkpoint data FTXS 2012
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.