Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Similar presentations


Presentation on theme: "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab"— Presentation transcript:

1 A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

2 Motivation As machines grow in size Challenges for exascale:
MTBF decreases Jaguar had 2.33 average failures/day from 2008 to 2010 Applications have to tolerate faults Challenges for exascale: Disk-based (NFS reliable disk) checkpointing is slow System-level checkpointing can be expensive Scalable checkpointing/restart can be a communication intensive process Job scheduler prevent fault tolerance support in runtime FTXS 2012

3 Motivation (cont.) Applications on future exascale machines need fast, low cost and scalable fault tolerance support Previous work: double in-memory checkpoint/restart scheme In production version of Charm++ since 2004 FTXS 2012

4 Double in-memory Checkpoint/Restart Protocol
PE0 PE1 PE2 PE3 A B C D E F G H I J A B C D E F G H I J H I J A B C D E F G PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 H I J A B C D E F G D H J A B C F G E I A C E H I J F G B D A object A checkpoint 1 A checkpoint 2 A restored object FTXS 2012

5 Runtime Support for FT Implemented in Charm++/Adaptive MPI runtime
Automatically checkpointing threads Including stack and heap (isomalloc) User helper functions To pack and unpack data Checkpointing only the live variables FTXS 2012

6 Local Disk-Based Protocol
Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small MD, N-body, quantum chemistry Double In-disk checkpointing Make use of local disk (or SSD) Also does not rely on any reliable storage Useful for applications with very big memory footprint FTXS 2012

7 Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing
FTXS 2012

8 Previous Results: Restart with Load Balancing
LeanMD, Apoa1, 128 processors FTXS 2012

9 Previous Result: Recovery Performance
10 crashes 128 processors Checkpoint every 10 time steps FTXS 2012

10 FT on MPI-based Charm++
Practical challenge: job scheduler Job scheduler kills the entire job when a process fails MPI-based Charm++ is portable on major supercomputers A fault injection scheme in MPI machine layer DieNow() MPI process stop responding Fault is detected by keep-alive messages Spare processors to replace failed ones Demonstrated on 64K cores of BG/P machine FTXS 2012

11 Performance at Large Scale
FTXS 2012

12 Optimization for scalability
Communication bottlenecks Checkpoint/restart time takes O(P) time Optimizations: Collectives (barriers) Switch O(P) barrier to a tree-based barrier Stale message handling Epoch number A phase to discard stale messages as quickly as possible Small messages Streaming optimization FTXS 2012

13 LeanMD Checkpoint Time before/after Optimization
FTXS 2012

14 Checkpoint Time for Jacobi/MPI
Kraken Running on AMPI FTXS 2012

15 LeanMD Restart Time FTXS 2012

16 Conclusions and Future work
In-memory checkpointing after optimization is scalable towards Exascale Future work: Non-blocking checkpointing Compress checkpoint data FTXS 2012


Download ppt "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab"

Similar presentations


Ads by Google