Snapshots, checkpoints, rollback, and restart Larry Rudolph MIT CSAIL January 2005
Overview Computers fail When they fail, work gets lost Should save work often to reduce the pain But I back up my disk only when I hear that someone else’s disk crashed Does this make sense?
Checkpoints This talk is concerned about running programs crashing (not disks) We use the term “checkpoint” to be a process that stores sufficient program state to nonvolatile storage (disk) so that the program can be restarted from that point
Do Checkpoints matter? For short computations, just rerun -- NO For long running computations -- YES An OS runs long but it doesn’t matter For parallel programs, it is harder (interesting) Stop all processes and all communication Write all relevant state to disk
OS does all the important stuff Checkpointing incurs overhead stops computation; uses network & disk bw System initiated (helps users, system pays) airlines would rather you bought another ticket if flight was cancelled Application initiated (looking out for #1) Knows best places, e.g. outer loop
Checkpointing Overhead Failure I C Assume periodic checkpointing done every I work period the overhead of checkpointing is C When failure happens, redo work since ckpt
Checkpointing Overhead Failure I C Don’t do checkpoints if C > I Surprisingly, our study show that this improves performance of existing systems!
Key Idea: Just say no (if you don’t want to do it) Application initiates checkpoints (at best place) It asks the system to actually perform it The system can decide to not do it The system knows more about the overall system state than the application
Key Idea: Just say no (if you don’t want to)
Risk-based decision Recall, skip checkpoint if I < C (not worth it) Really want to say: Cost of doing ckpt vs cost of not doing ckpt Let p be probability of a failure skip checkpoint if pI < C
Derivation Expected cost of skipping > expected cost of performing p(2I + C) + (1-p)(0) > p(I + 2C) + (1-p)C pI > pC + (1-p)C pI > C
Failure prediction Examination of logs, can predict failures based on temp, power spikes, retries, alarms have already shown success rate of 70% System has good idea of “p” System has good idea of cost of failure
Results are good (trust me) Hard to believe real numbers hard to get real statistics
When can I trust you? If I ask someone to deliver a message My wife will; my kids might forget The post office will always do it The internet will only make a best effort (TCP/IP Drops packets) If I ask you to save my seat, will you?
How to do checkpoints? Shared memory supercomputers, no choice On message-passing clusters, can do better
Fault Tolerance Example: bank a sends $50 to bank b a must remove $50, b must add $50 If a fails after sending, they both have the $50 If b fails after receiving, neither has the $50
Checkpoint in message- passing systems Each processor or node does a checkpoint of its local state The system takes a snapshot, checkpoints plus messages not yet part of state
Consistent State A system is consistent if: all received messages have been sent all sent messages have been received If an agent has rolled back, all msgs sent since previous checkpoint are invalid and must be removed. If an agent saves its state after receiving a msg, the sender must also save its state
Chandy & Lamport One guy begins a snapshot by sending markers to everyone else Any agent getting a marker, sends markers to everyone else etc ...
Marker Sent Before Message Ensure consistency during snapshot
C&L Evaluation Good news: it works Bad news: too many messages
Extension Do not snapshot the entire system Snapshot only connected components Let any node initiate a snapshot Let any node initiate a roll back
Snapshot per subgroup multiple snapshots per column Agents that fail cause only column to rollback with lots of failures, there is better progress
Snapshot per subgroup When there is a phase change, may need all to roll back If happens rarely, ok Can adapt snapshot time to match this model Take snapshot when connected component size is small enough System-wide knowledge
Conclusion Failures are part of life Must learn how to mitigate their effect Want to understand differences/similarities between parallel programs and pervasive Humans intuitively know some of this but ...