Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty
Parallel Programming Laboratory2 Overview Motivation Research Goals Basic Scheme Problems Solutions Status
Parallel Programming Laboratory3 Motivation As machines grow in size Lower MTTF Plausible figures Number of Nodes MTTF for a Node 4 years (99.997% Reliability) MTTF for the System20 minutes Checkpointing time is higher Restart time is higher If MTBF < Restart time checkpoint/restart is impossible
Parallel Programming Laboratory4 Costly Checkpoint/Restart Synchronous checkpoints are too costly Asynchronous checkpoints might cause cascading rollbacks All nodes have to be restarted after a crash Inefficient use of resources to restart 99,999 nodes just because 1 crashed With low MTBF, large amount of computation time wasted in rolling back and re-computing. Even nodes that are independent of the crashed node are restarted
Parallel Programming Laboratory5 Idea Execution Time Application Progress Crash T_c Checkpoint and restart Our scheme T_p T_p Restart Cost T_c Checkpoint Cost
Parallel Programming Laboratory6 Research Goals Asynchronous checkpoint Each process takes a checkpoint independently No overhead for synchronization Processes need not stop while checkpointing Prevent cascading rollbacks Restart crashed processor Only the crashed processor should be rolled back to the previous checkpoint Fast restart
Parallel Programming Laboratory7 Research Goals (contd) Low runtime cost While the system is fault-free, cost of fault tolerance should be low Implemented in Charm++ Virtualization and message driven paradigm Charm++ is latency tolerant Migration of objects is available Extend to Adaptive MPI
Parallel Programming Laboratory8 Basic Scheme Each object takes its checkpoint asynchronously An object logs the message it sends to a different object When a process crashes, another is restarted. Might be on a pool of extra processors or another process on the same processors. Objects might be later migrated away from the extra process and the residual process can be cleaned up.
Parallel Programming Laboratory9 The rest of the scheme When an object is restarted it is restarted from its last checkpoint All objects that sent messages to a restarted object must resend all messages that it received since its last checkpoint. Duplicate messages generated by the reprocessing of the resent messages should be ignored Sequence number based windowing can be used
Parallel Programming Laboratory10 Blue Pink31 33 Red Pink Blue Green , 14, 15 15, 16, 17 31, , 13 32, 34 PE 0 PE 1 PE 2 Checkpoint Storage
Parallel Programming Laboratory11 Correctness Result of the program should be unchanged by the crash State of the restarted Chare, after it has received the resent messages should be the same as before the crash
Parallel Programming Laboratory12 State of a Chare Modified only by messages So resent messages have all the data to bring the chare up-to date Order of message processing Same messages processed in different order might lead to a different chare state Messages must be processed in the same order after the restart as they were originally This order needn ’ t be a specific order known to the user but any order selected by the system, which it can repeat after a crash
Parallel Programming Laboratory13 Without ordering B A C D S 1 S 2 S 3
Parallel Programming Laboratory14 Solution of the Ordering Problem Who decides the order ? Best place to do it is the chare that is going to receive the messages Get a ticket number from the receiver and label each message with it Process all messages in the increasing order of ticket numbers Store the copy of the message along with the ticket number on the sender side
Parallel Programming Laboratory15 B A C D T 1 S 3 S 2 T 2 With Ordering
Parallel Programming Laboratory16 Pros and Cons Advantages Defines an order among messages on the receiving side, which can be repeated after the crash and restore Disadvantages Increases latency of communication Overhead of a message increases
Parallel Programming Laboratory17 Logging local messages When a processor crashes, both the receiving object and message log of a local message disappears Obvious solution Get a ticket from the receiving object using a function call. Send a copy of the message to a “ buddy ” processor and wait for the ack Then deliver the message to the receiver
Parallel Programming Laboratory18 Implementation Issues Migration makes things much more complicated As first cut implementing it for groups They don ’ t migrate Much simpler than the arrays
Parallel Programming Laboratory19 Status Ongoing Project Fault tolerant version of Charm++ Currently aimed at small clusters Present implementation is limited to non- migrating objects Testing on simple test cases like Jacobi
Parallel Programming Laboratory20 Future Work Immediate Aims Extend implementation to cover migratable objects Detection scheme suitable for BlueGene/L Test it on the BlueGene simulator Implement a Fault tolerant version of Adaptive MPI Optimize Performance to reduce runtime overhead Test a full scale application like NAMD on Fault tolerant Charm running on the BlueGene simulator