Download presentation
Presentation is loading. Please wait.
Published byBritton Edwards Modified over 9 years ago
1
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign
2
2 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ Adaptive MPI
3
3 Outline Disk Checkpoint/Restart FTC-Charm++ in-memory checkpoint/restart Proactive Fault Tolerance FTL-Charm++ message logging
4
4 Disk Checkpoint/Restart
5
5 Checkpoint/Restart Simplest scheme for application fault tolerance Any long running application saves its state into disk periodically at certain point coordinated checkpointing strategy (barrier) State information is saved in a directory of your choosing Checkpoint of the application data is done by invoking pup routine of all objects Restore also uses pup, so no additional application code is needed (pup is all you need)
6
6 Checkpointing Job In Charm++, use: void CkStartCheckpoint(char* dirname,const CkCallback& cb) Called on one processor; calls resume when checkpoint is complete In AMPI, use: MPI_Checkpoint( ); Collective call; returns when checkpoint is complete
7
7 Restart Job from Checkpoint The charmrun option ++restart is used to restart./charmrun +p4./pgm ++restart log Number of processors need not be the same Parallel objects are redistributed when needed
8
8 FTC-Charm++ In-Memory Checkpoint/Restart
9
9 Disk vs. In-memory Scheme Disk checkpointing suffers Need user intervention to restart a job Assume reliable storage - disk Disk I/O is slow In-memory checkpoint/restart scheme Online version of the previous scheme Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage
10
10 Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart
11
11 Checkpoint Protocol Similar to the previous scheme coordinated checkpointing strategy Programmers decide what to checkpoint void CkStartMemCheckpoint(CkCallback &cb) Each object pack data and send to two different (buddy) processors
12
12 Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance
13
13 H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol
14
14 Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint
15
15 Compiling FTC-Charm++ Build charm with “syncft” option./build charm++ net-linux syncft –O Command line switch +ftc_disk for disk/memory checkpointing: charmrun./pgm +ftc_disk
16
16 Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet
17
17 Performance Comparisons with Traditional Disk-based Checkpointing
18
18 Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing
19
19 Performance improve with Load Balancing LeanMD, Apoa1, 128 processors
20
20 Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps
21
21 LeanMD with Apoa1 benchmark 90K atoms 8498 objects
22
22 Proactive Fault Tolerance
23
23 Motivation Run-time reacts to a failure Proactively migrate from a processor about to fail Modern hardware supports early fault indication SMART protocol, Motherboard temperature sensors, Myrinet interface cards Possible to create mechanism for fault prediction
24
24 Requirements Response time should be as low as possible No new processes should be required Collective operations should still work Efficiency loss should be proportional to computing power loss
25
25 System Application is warned of impending fault via signal Processor, memory and interconnect should continue to work correctly for sometime after warning Run-time ensures that application continues to run on the remaining processors even if one processor crashes
26
26 Solution Design Migrate Charm++ objects off warned processor Point to point message delivery should continue to work Collective operations should cope with the possible loss of multiple processors Modify the runtime system's reduction tree to remove the warned processor. Minimal number of processors should be affected Runtime system should remain load balanced after a processor has been evacuated
27
27 Proactive FT: Current Status Status Support for multiple faults ready; currently testing support for simultaneous faults Faults simulated via signal sent to process Current version fully integrated to Charm++ and AMPI Example: sweep3d (MPI code) on NCSA’s tungsten Original utilization Utilization after fault Utilization after LB
28
28 How to Use Part of default version of Charm++ No extra compiler flags required This code does not get executed until a warning Any detection system can be plugged in Can send signal (USR1) to process on compute node Can call a method (CkDecideEvacPe) to evacuate a processor Used with any Charm++ and AMPI program For AMPI needs to be used with -memory isomalloc
29
29 FTL-Charm++ Message Logging
30
30 Motivation Checkpointing not fully automatic Coordinated checkpointing is expensive Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted
31
31 Design Message Logging Sender side message logging Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Checkpoint on its own (no barrier)
32
32 Message to Remote Chares Chare P sender Chare Q receiver If has been seen earlier TN is marked as received Otherwise create new TN and store the
33
33 Status Most of Charm++ and AMPI has been ported Support for migration has not yet been implemented in the fault tolerant protocol Parallel restart not yet implemented Not in Charm main branch
34
34 Thank You! Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/ http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.