Snapshots, checkpoints, rollback, and restart

Slides:



Advertisements
Similar presentations
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Advertisements

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Cyber Info Gathering Techniques
SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.
The consensus problem in distributed systems
8.6. Recovery By Hemanth Kumar Reddy.
Prepared by Ertuğrul Kuzan
When Is Agreement Possible
SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
Two phase commit.
Operating System Reliability
Operating System Reliability
The TESLA Broadcast Authentication Protocol CS 218 Fall 2017
Distributed Snapshot.
Sarah Diesburg Operating Systems COP 4610
Extraversion Introversion
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Distributed Consensus
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
湖南大学-信息科学与工程学院-计算机与科学系
EEC 688/788 Secure and Dependable Computing
RAID Redundant Array of Inexpensive (Independent) Disks
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
Teamwork & Active Listening Teamwork and Active Listening.
Causal Consistency and Two-Phase Commit
EEC 688/788 Secure and Dependable Computing
Applied Software Project Management
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
EEC 688/788 Secure and Dependable Computing
Distributed Snapshot.
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Operating System Reliability
Operating System Reliability
Distributed Snapshots
Presentation transcript:

Snapshots, checkpoints, rollback, and restart Larry Rudolph MIT CSAIL January 2005

Overview Computers fail When they fail, work gets lost Should save work often to reduce the pain But I back up my disk only when I hear that someone else’s disk crashed Does this make sense?

Checkpoints This talk is concerned about running programs crashing (not disks) We use the term “checkpoint” to be a process that stores sufficient program state to nonvolatile storage (disk) so that the program can be restarted from that point

Do Checkpoints matter? For short computations, just rerun -- NO For long running computations -- YES An OS runs long but it doesn’t matter For parallel programs, it is harder (interesting) Stop all processes and all communication Write all relevant state to disk

OS does all the important stuff Checkpointing incurs overhead stops computation; uses network & disk bw System initiated (helps users, system pays) airlines would rather you bought another ticket if flight was cancelled Application initiated (looking out for #1) Knows best places, e.g. outer loop

Checkpointing Overhead Failure I C Assume periodic checkpointing done every I work period the overhead of checkpointing is C When failure happens, redo work since ckpt

Checkpointing Overhead Failure I C Don’t do checkpoints if C > I Surprisingly, our study show that this improves performance of existing systems!

Key Idea: Just say no (if you don’t want to do it) Application initiates checkpoints (at best place) It asks the system to actually perform it The system can decide to not do it The system knows more about the overall system state than the application

Key Idea: Just say no (if you don’t want to)

Risk-based decision Recall, skip checkpoint if I < C (not worth it) Really want to say: Cost of doing ckpt vs cost of not doing ckpt Let p be probability of a failure skip checkpoint if pI < C

Derivation Expected cost of skipping > expected cost of performing p(2I + C) + (1-p)(0) > p(I + 2C) + (1-p)C pI > pC + (1-p)C pI > C

Failure prediction Examination of logs, can predict failures based on temp, power spikes, retries, alarms have already shown success rate of 70% System has good idea of “p” System has good idea of cost of failure

Results are good (trust me) Hard to believe real numbers hard to get real statistics

When can I trust you? If I ask someone to deliver a message My wife will; my kids might forget The post office will always do it The internet will only make a best effort (TCP/IP Drops packets) If I ask you to save my seat, will you?

How to do checkpoints? Shared memory supercomputers, no choice On message-passing clusters, can do better

Fault Tolerance Example: bank a sends $50 to bank b a must remove $50, b must add $50 If a fails after sending, they both have the $50 If b fails after receiving, neither has the $50

Checkpoint in message- passing systems Each processor or node does a checkpoint of its local state The system takes a snapshot, checkpoints plus messages not yet part of state

Consistent State A system is consistent if: all received messages have been sent all sent messages have been received If an agent has rolled back, all msgs sent since previous checkpoint are invalid and must be removed. If an agent saves its state after receiving a msg, the sender must also save its state

Chandy & Lamport One guy begins a snapshot by sending markers to everyone else Any agent getting a marker, sends markers to everyone else etc ...

Marker Sent Before Message Ensure consistency during snapshot

C&L Evaluation Good news: it works Bad news: too many messages

Extension Do not snapshot the entire system Snapshot only connected components Let any node initiate a snapshot Let any node initiate a roll back

Snapshot per subgroup multiple snapshots per column Agents that fail cause only column to rollback with lots of failures, there is better progress

Snapshot per subgroup When there is a phase change, may need all to roll back If happens rarely, ok Can adapt snapshot time to match this model Take snapshot when connected component size is small enough System-wide knowledge

Conclusion Failures are part of life Must learn how to mitigate their effect Want to understand differences/similarities between parallel programs and pervasive Humans intuitively know some of this but ...