Snapshots, checkpoints, rollback, and restart

Snapshots, checkpoints, rollback, and restart
Larry Rudolph MIT CSAIL January 2005

Overview Computers fail When they fail, work gets lost
Should save work often to reduce the pain But I back up my disk only when I hear that someone else’s disk crashed Does this make sense?

Checkpoints This talk is concerned about running programs crashing (not disks) We use the term “checkpoint” to be a process that stores sufficient program state to nonvolatile storage (disk) so that the program can be restarted from that point

Do Checkpoints matter? For short computations, just rerun -- NO
For long running computations -- YES An OS runs long but it doesn’t matter For parallel programs, it is harder (interesting) Stop all processes and all communication Write all relevant state to disk

OS does all the important stuff
Checkpointing incurs overhead stops computation; uses network & disk bw System initiated (helps users, system pays) airlines would rather you bought another ticket if flight was cancelled Application initiated (looking out for #1) Knows best places, e.g. outer loop

Checkpointing Overhead
Failure I C Assume periodic checkpointing done every I work period the overhead of checkpointing is C When failure happens, redo work since ckpt

Checkpointing Overhead
Failure I C Don’t do checkpoints if C > I Surprisingly, our study show that this improves performance of existing systems!

Key Idea: Just say no (if you don’t want to do it)
Application initiates checkpoints (at best place) It asks the system to actually perform it The system can decide to not do it The system knows more about the overall system state than the application

Key Idea: Just say no (if you don’t want to)

Risk-based decision Recall, skip checkpoint if I < C (not worth it)
Really want to say: Cost of doing ckpt vs cost of not doing ckpt Let p be probability of a failure skip checkpoint if pI < C

Derivation Expected cost of skipping > expected cost of performing
p(2I + C) + (1-p)(0) > p(I + 2C) + (1-p)C pI > pC + (1-p)C pI > C

Failure prediction Examination of logs, can predict failures
based on temp, power spikes, retries, alarms have already shown success rate of 70% System has good idea of “p” System has good idea of cost of failure

Results are good (trust me)
Hard to believe real numbers hard to get real statistics

When can I trust you? If I ask someone to deliver a message
My wife will; my kids might forget The post office will always do it The internet will only make a best effort (TCP/IP Drops packets) If I ask you to save my seat, will you?

How to do checkpoints? Shared memory supercomputers, no choice
On message-passing clusters, can do better

Fault Tolerance Example: bank a sends $50 to bank b
a must remove $50, b must add $50 If a fails after sending, they both have the $50 If b fails after receiving, neither has the $50

Checkpoint in message- passing systems
Each processor or node does a checkpoint of its local state The system takes a snapshot, checkpoints plus messages not yet part of state

Consistent State A system is consistent if:
all received messages have been sent all sent messages have been received If an agent has rolled back, all msgs sent since previous checkpoint are invalid and must be removed. If an agent saves its state after receiving a msg, the sender must also save its state

Chandy & Lamport One guy begins a snapshot by sending markers to everyone else Any agent getting a marker, sends markers to everyone else etc ...

Marker Sent Before Message
Ensure consistency during snapshot

C&L Evaluation Good news: it works Bad news: too many messages

Extension Do not snapshot the entire system
Snapshot only connected components Let any node initiate a snapshot Let any node initiate a roll back

Snapshot per subgroup multiple snapshots per column
Agents that fail cause only column to rollback with lots of failures, there is better progress

Snapshot per subgroup When there is a phase change, may need all to roll back If happens rarely, ok Can adapt snapshot time to match this model Take snapshot when connected component size is small enough System-wide knowledge

Conclusion Failures are part of life
Must learn how to mitigate their effect Want to understand differences/similarities between parallel programs and pervasive Humans intuitively know some of this but ...

Snapshots, checkpoints, rollback, and restart

Similar presentations

Presentation on theme: "Snapshots, checkpoints, rollback, and restart"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Snapshots, checkpoints, rollback, and restart

Similar presentations

Presentation on theme: "Snapshots, checkpoints, rollback, and restart"— Presentation transcript:

Similar presentations

About project

Feedback