Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty

Parallel Programming Laboratory2 Overview Motivation Research Goals Basic Scheme Problems Solutions Status

Parallel Programming Laboratory3 Motivation As machines grow in size  Lower MTTF Plausible figures Number of Nodes100000 MTTF for a Node 4 years (99.997% Reliability) MTTF for the System20 minutes  Checkpointing time is higher  Restart time is higher  If MTBF < Restart time checkpoint/restart is impossible

Parallel Programming Laboratory4 Costly Checkpoint/Restart  Synchronous checkpoints are too costly  Asynchronous checkpoints might cause cascading rollbacks  All nodes have to be restarted after a crash Inefficient use of resources to restart 99,999 nodes just because 1 crashed With low MTBF, large amount of computation time wasted in rolling back and re-computing. Even nodes that are independent of the crashed node are restarted

Parallel Programming Laboratory5 Idea Execution Time Application Progress Crash T_c Checkpoint and restart Our scheme T_p T_p Restart Cost T_c Checkpoint Cost

Parallel Programming Laboratory6 Research Goals Asynchronous checkpoint  Each process takes a checkpoint independently  No overhead for synchronization  Processes need not stop while checkpointing  Prevent cascading rollbacks Restart crashed processor  Only the crashed processor should be rolled back to the previous checkpoint  Fast restart

Parallel Programming Laboratory7 Research Goals (contd) Low runtime cost  While the system is fault-free, cost of fault tolerance should be low Implemented in Charm++  Virtualization and message driven paradigm  Charm++ is latency tolerant  Migration of objects is available Extend to Adaptive MPI

Parallel Programming Laboratory8 Basic Scheme Each object takes its checkpoint asynchronously An object logs the message it sends to a different object When a process crashes, another is restarted.  Might be on a pool of extra processors or another process on the same processors.  Objects might be later migrated away from the extra process and the residual process can be cleaned up.

Parallel Programming Laboratory9 The rest of the scheme When an object is restarted it is restarted from its last checkpoint All objects that sent messages to a restarted object must resend all messages that it received since its last checkpoint. Duplicate messages generated by the reprocessing of the resent messages should be ignored  Sequence number based windowing can be used

Parallel Programming Laboratory10 Blue11 14 15 Pink31 33 Red Pink 15 16 17 32 34 Blue Green 12 13 7 11, 14, 15 15, 16, 17 31, 33 712, 13 32, 34 PE 0 PE 1 PE 2 Checkpoint Storage

Parallel Programming Laboratory11 Correctness Result of the program should be unchanged by the crash State of the restarted Chare, after it has received the resent messages should be the same as before the crash

Parallel Programming Laboratory12 State of a Chare Modified only by messages  So resent messages have all the data to bring the chare up-to date Order of message processing  Same messages processed in different order might lead to a different chare state  Messages must be processed in the same order after the restart as they were originally  This order needn ’ t be a specific order known to the user but any order selected by the system, which it can repeat after a crash

Parallel Programming Laboratory13 Without ordering B A C D S 1 S 2 S 3

Parallel Programming Laboratory14 Solution of the Ordering Problem Who decides the order ?  Best place to do it is the chare that is going to receive the messages Get a ticket number from the receiver and label each message with it Process all messages in the increasing order of ticket numbers Store the copy of the message along with the ticket number on the sender side

Parallel Programming Laboratory15 B A C D T 1 S 3 S 2 T 2 With Ordering

Parallel Programming Laboratory16 Pros and Cons Advantages  Defines an order among messages on the receiving side, which can be repeated after the crash and restore Disadvantages  Increases latency of communication  Overhead of a message increases

Parallel Programming Laboratory17 Logging local messages When a processor crashes, both the receiving object and message log of a local message disappears Obvious solution  Get a ticket from the receiving object using a function call.  Send a copy of the message to a “ buddy ” processor and wait for the ack  Then deliver the message to the receiver

Parallel Programming Laboratory18 Implementation Issues Migration makes things much more complicated As first cut implementing it for groups  They don ’ t migrate  Much simpler than the arrays

Parallel Programming Laboratory19 Status Ongoing Project  Fault tolerant version of Charm++  Currently aimed at small clusters  Present implementation is limited to non- migrating objects  Testing on simple test cases like Jacobi

Parallel Programming Laboratory20 Future Work Immediate Aims  Extend implementation to cover migratable objects  Detection scheme suitable for BlueGene/L  Test it on the BlueGene simulator  Implement a Fault tolerant version of Adaptive MPI  Optimize Performance to reduce runtime overhead  Test a full scale application like NAMD on Fault tolerant Charm running on the BlueGene simulator

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Similar presentations

Presentation on theme: "Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Similar presentations

Presentation on theme: "Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty."— Presentation transcript:

Similar presentations

About project

Feedback