FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign
Cluster Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ (Parallel C++ language and runtime) Adaptive MPI
Cluster Requirements Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage
4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based methods Message logging
Cluster Design Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart
6 Charm++: Processor Virtualization User View System implementation Charm++ Parallel C++ with Data driven objects - Chares Chares are migratable Asynchronous method invocation Adaptive MPI Implemented on Charm++with migratable threads Multiple virtual processors on a physical processor
7 Benefits of Virtualization Latency tolerant Adaptive overlap of communication and computation Supports migration of virtual processors for load balancing Checkpoint data Objects (instead of process image) Checkpoint == migrate object to another processor
Cluster Checkpoint Protocol Adopt coordinated checkpointing strategy Charm++ runtime provides the functionality for checkpointing Programmers can decide what to checkpoint Each object pack data and send to two different (buddy) processors Charm++ runtime data
Cluster Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance
Cluster H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol
Cluster Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint
Cluster Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet
Cluster Checkpoint Overhead Evaluation Jacobi3D MPI Up to 128 processors Myrinet vs. 100Mbit Ethernet
Cluster Single Checkpoint Overhead AMPI jacobi3D Problem size: 200MB 128 processors
Cluster Comparisons of Program Execution Time Jacobi (200MB data size) on upto 128 processors 8 checkpoints in 100 steps
Cluster Performance Comparisons with Traditional Disk-based Checkpointing
Cluster Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing
Cluster Performance improve with Load Balancing LeanMD, Apoa1, 128 processors
Cluster Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps
Cluster LeanMD with Apoa1 benchmark 90K atoms 8498 objects
Cluster Future work Use our scheme on some extremely large parallel machines Reduce memory usage of the protocol Message logging Paper appeared in IPDPS’04