Presentation is loading. Please wait.

Presentation is loading. Please wait.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Similar presentations


Presentation on theme: "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab."— Presentation transcript:

1 FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

2 Cluster 2004 2 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ (Parallel C++ language and runtime) Adaptive MPI

3 Cluster 2004 3 Requirements Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage

4 4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based methods Message logging

5 Cluster 2004 5 Design Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart

6 6 Charm++: Processor Virtualization User View System implementation Charm++ Parallel C++ with Data driven objects - Chares Chares are migratable Asynchronous method invocation Adaptive MPI Implemented on Charm++with migratable threads Multiple virtual processors on a physical processor

7 7 Benefits of Virtualization Latency tolerant Adaptive overlap of communication and computation Supports migration of virtual processors for load balancing Checkpoint data Objects (instead of process image) Checkpoint == migrate object to another processor

8 Cluster 2004 8 Checkpoint Protocol Adopt coordinated checkpointing strategy Charm++ runtime provides the functionality for checkpointing Programmers can decide what to checkpoint Each object pack data and send to two different (buddy) processors Charm++ runtime data

9 Cluster 2004 9 Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance

10 Cluster 2004 10 H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol

11 Cluster 2004 11 Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint

12 Cluster 2004 12 Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet

13 Cluster 2004 13 Checkpoint Overhead Evaluation Jacobi3D MPI Up to 128 processors Myrinet vs. 100Mbit Ethernet

14 Cluster 2004 14 Single Checkpoint Overhead AMPI jacobi3D Problem size: 200MB 128 processors

15 Cluster 2004 15 Comparisons of Program Execution Time Jacobi (200MB data size) on upto 128 processors 8 checkpoints in 100 steps

16 Cluster 2004 16 Performance Comparisons with Traditional Disk-based Checkpointing

17 Cluster 2004 17 Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing

18 Cluster 2004 18 Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

19 Cluster 2004 19 Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps

20 Cluster 2004 20 LeanMD with Apoa1 benchmark 90K atoms 8498 objects

21 Cluster 2004 21 Future work Use our scheme on some extremely large parallel machines Reduce memory usage of the protocol Message logging Paper appeared in IPDPS’04


Download ppt "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab."

Similar presentations


Ads by Google