Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation As machines grow in size MTBF decreases Jaguar had 2.33 average failures/day from 2008 to 2010 Applications have to tolerate faults Challenges for exascale: Disk-based (NFS reliable disk) checkpointing is slow System-level checkpointing can be expensive Scalable checkpointing/restart can be a communication intensive process Job scheduler prevent fault tolerance support in runtime Charm++ Workshop 2012 2

Motivation (cont.) Applications on future exascale machines need fast, low cost and scalable fault tolerance support Previous work: double in-memory checkpoint/restart scheme In production version of Charm++ since 2004 Charm++ Workshop 2012 3

Double in-memory Checkpoint/Restart Protocol Charm++ Workshop 2012 4 H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor )

Runtime Support for FT Automatically checkpointing threads Including stack and heap (isomalloc) User helper functions To pack and unpack data Checkpointing only the live variables Charm++ Workshop 2012 5

Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small MD, N-body, quantum chemistry Double In-disk checkpointing Make use of local disk (or SSD) Also does not rely on any reliable storage Useful for applications with very big memory footprint Charm++ Workshop 2012 6

Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing Charm++ Workshop 2012 7

Previous Results: Restart with Load Balancing Charm++ Workshop 2012 8 LeanMD, Apoa1, 128 processors

Previous Result: Recovery Performance Charm++ Workshop 2012 9 10 crashes 128 processors Checkpoint every 10 time steps

Charm++ Workshop 2012 10 LeanMD with Apoa1 benchmark 90K atoms 8498 objects

FT on MPI-based Charm++ Practical challenge: job scheduler Job scheduler kills the entire job when a process fails MPI-based Charm++ is portable on major supercomputers A fault injection scheme in MPI machine layer DieNow() MPI process stop responding Fault detection by keep-alive messages Spare processors to replace failed ones Demonstrated on 64K cores of BG/P machine Charm++ Workshop 2012 11

Performance at Large Scale Charm++ Workshop 2012 12

Optimization for scalability Communication bottlenecks Checkpoint/restart time takes O(P) time Optimizations: Collectives (barriers) Switch O(P) barrier to a tree-based barrier Stale message handling Epoch number A phase to discard stale messages as quickly as possible Small messages Streaming optimization Charm++ Workshop 2012 13

LeanMD Checkpoint Time before/after Optimization Charm++ Workshop 2012 14

Checkpoint Time for Jacobi/AMPI Charm++ Workshop 2012 15 Kraken

LeanMD Restart Time Charm++ Workshop 2012 16

Conclusions and Future work In-memory checkpointing after optimization is scalable towards Exascale A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) Future work: Non-blocking checkpointing Charm++ Workshop 2012 17

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback