Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation As machines grow in size Challenges for exascale:
MTBF decreases Jaguar had 2.33 average failures/day from 2008 to 2010 Applications have to tolerate faults Challenges for exascale: Disk-based (NFS reliable disk) checkpointing is slow System-level checkpointing can be expensive Scalable checkpointing/restart can be a communication intensive process Job scheduler prevent fault tolerance support in runtime FTXS 2012

Motivation (cont.) Applications on future exascale machines need fast, low cost and scalable fault tolerance support Previous work: double in-memory checkpoint/restart scheme In production version of Charm++ since 2004 FTXS 2012

Double in-memory Checkpoint/Restart Protocol
PE0 PE1 PE2 PE3 A B C D E F G H I J A B C D E F G H I J H I J A B C D E F G PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 H I J A B C D E F G D H J A B C F G E I A C E H I J F G B D A object A checkpoint 1 A checkpoint 2 A restored object FTXS 2012

Runtime Support for FT Implemented in Charm++/Adaptive MPI runtime
Automatically checkpointing threads Including stack and heap (isomalloc) User helper functions To pack and unpack data Checkpointing only the live variables FTXS 2012

Local Disk-Based Protocol
Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small MD, N-body, quantum chemistry Double In-disk checkpointing Make use of local disk (or SSD) Also does not rely on any reliable storage Useful for applications with very big memory footprint FTXS 2012

Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing
FTXS 2012

Previous Results: Restart with Load Balancing
LeanMD, Apoa1, 128 processors FTXS 2012

Previous Result: Recovery Performance
10 crashes 128 processors Checkpoint every 10 time steps FTXS 2012

FT on MPI-based Charm++
Practical challenge: job scheduler Job scheduler kills the entire job when a process fails MPI-based Charm++ is portable on major supercomputers A fault injection scheme in MPI machine layer DieNow() MPI process stop responding Fault is detected by keep-alive messages Spare processors to replace failed ones Demonstrated on 64K cores of BG/P machine FTXS 2012

Performance at Large Scale
FTXS 2012

Optimization for scalability
Communication bottlenecks Checkpoint/restart time takes O(P) time Optimizations: Collectives (barriers) Switch O(P) barrier to a tree-based barrier Stale message handling Epoch number A phase to discard stale messages as quickly as possible Small messages Streaming optimization FTXS 2012

LeanMD Checkpoint Time before/after Optimization
FTXS 2012

Checkpoint Time for Jacobi/MPI
Kraken Running on AMPI FTXS 2012

LeanMD Restart Time FTXS 2012

Conclusions and Future work
In-memory checkpointing after optimization is scalable towards Exascale Future work: Non-blocking checkpointing Compress checkpoint data FTXS 2012

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Similar presentations

Presentation on theme: "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Similar presentations

Presentation on theme: "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab"— Presentation transcript:

Similar presentations

About project

Feedback