Presentation is loading. Please wait.

Presentation is loading. Please wait.

Snapshots, checkpoints, rollback, and restart

Similar presentations


Presentation on theme: "Snapshots, checkpoints, rollback, and restart"— Presentation transcript:

1 Snapshots, checkpoints, rollback, and restart
Larry Rudolph MIT CSAIL January 2005

2 Overview Computers fail When they fail, work gets lost
Should save work often to reduce the pain But I back up my disk only when I hear that someone else’s disk crashed Does this make sense?

3 Checkpoints This talk is concerned about running programs crashing (not disks) We use the term “checkpoint” to be a process that stores sufficient program state to nonvolatile storage (disk) so that the program can be restarted from that point

4 Do Checkpoints matter? For short computations, just rerun -- NO
For long running computations -- YES An OS runs long but it doesn’t matter For parallel programs, it is harder (interesting) Stop all processes and all communication Write all relevant state to disk

5 OS does all the important stuff
Checkpointing incurs overhead stops computation; uses network & disk bw System initiated (helps users, system pays) airlines would rather you bought another ticket if flight was cancelled Application initiated (looking out for #1) Knows best places, e.g. outer loop

6 Checkpointing Overhead
Failure I C Assume periodic checkpointing done every I work period the overhead of checkpointing is C When failure happens, redo work since ckpt

7 Checkpointing Overhead
Failure I C Don’t do checkpoints if C > I Surprisingly, our study show that this improves performance of existing systems!

8 Key Idea: Just say no (if you don’t want to do it)
Application initiates checkpoints (at best place) It asks the system to actually perform it The system can decide to not do it The system knows more about the overall system state than the application

9 Key Idea: Just say no (if you don’t want to)

10 Risk-based decision Recall, skip checkpoint if I < C (not worth it)
Really want to say: Cost of doing ckpt vs cost of not doing ckpt Let p be probability of a failure skip checkpoint if pI < C

11 Derivation Expected cost of skipping > expected cost of performing
p(2I + C) + (1-p)(0) > p(I + 2C) + (1-p)C pI > pC + (1-p)C pI > C

12 Failure prediction Examination of logs, can predict failures
based on temp, power spikes, retries, alarms have already shown success rate of 70% System has good idea of “p” System has good idea of cost of failure

13 Results are good (trust me)
Hard to believe real numbers hard to get real statistics

14 When can I trust you? If I ask someone to deliver a message
My wife will; my kids might forget The post office will always do it The internet will only make a best effort (TCP/IP Drops packets) If I ask you to save my seat, will you?

15 How to do checkpoints? Shared memory supercomputers, no choice
On message-passing clusters, can do better

16 Fault Tolerance Example: bank a sends $50 to bank b
a must remove $50, b must add $50 If a fails after sending, they both have the $50 If b fails after receiving, neither has the $50

17 Checkpoint in message- passing systems
Each processor or node does a checkpoint of its local state The system takes a snapshot, checkpoints plus messages not yet part of state

18 Consistent State A system is consistent if:
all received messages have been sent all sent messages have been received If an agent has rolled back, all msgs sent since previous checkpoint are invalid and must be removed. If an agent saves its state after receiving a msg, the sender must also save its state

19 Chandy & Lamport One guy begins a snapshot by sending markers to everyone else Any agent getting a marker, sends markers to everyone else etc ...

20 Marker Sent Before Message
Ensure consistency during snapshot

21

22 C&L Evaluation Good news: it works Bad news: too many messages

23 Extension Do not snapshot the entire system
Snapshot only connected components Let any node initiate a snapshot Let any node initiate a roll back

24 Snapshot per subgroup multiple snapshots per column
Agents that fail cause only column to rollback with lots of failures, there is better progress

25 Snapshot per subgroup When there is a phase change, may need all to roll back If happens rarely, ok Can adapt snapshot time to match this model Take snapshot when connected component size is small enough System-wide knowledge

26 Conclusion Failures are part of life
Must learn how to mitigate their effect Want to understand differences/similarities between parallel programs and pervasive Humans intuitively know some of this but ...


Download ppt "Snapshots, checkpoints, rollback, and restart"

Similar presentations


Ads by Google