Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Global States in a Distributed System By John Kor and Yvonne Cheng.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Uncoordinated Checkpointing The Global State Recording Algorithm.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
Distributed Deadlocks and Transaction Recovery.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Recovery in Distributed Systems:
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Transactions in Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Checkpointing and Recovery

Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore to the previous checkpoint What happens in case of a distributed application –One (or more) processes fail –Restoration to previous checkpoint should be done consistently

What to Save? Depends on application –Could be as simple as just program counter information –Could be the state of the entire process, including messages received, etc

Stable Storage Checkpoints must survive failure of processes (including failure during a disk write) –A simple approach for stable storage

Approaches Asynchronous –The local checkpoints at different processes are taken independently Synchronous –The local checkpoints at different processes are coordinated –They may not be at the same time

Asynchronous Checkpointing Problem –Domino effect Failed process

Other Issues with Asynchronous Checkpointing Useless checkpoints Need for garbage collection Recovery requires significant coordination

Asynchronous Checkpointing (Continued) Identify dependency between different checkpoint intervals This information is stored along with checkpoints in a stable storage When a process repairs, it requests this information from others to determine the need for rollback

Two Examples of Asynchronous Checkpointing Bhargava and Lian Wang et al

Algorithm by Bhargava et al Draw an edge from c i, x to c j,y if either –i = j and y = x+1 –i  j and a message m is sent from I i, x and received in I j, y Where I i, x is the interval between c i, x-1 and c i, x Rollback recovery line used for recovery as well as garbage collection

Algorithm by Wang et al Difference –If a message sent from I i, x is received in I j, y then draw an edge between c j, x-1 to c j, y Recovery line obtained is similar to that by Bhargava and Lian Advantage –Number of useful checkpoints is at most N(N+1)/2 This can be shown that the number of checkpoints that are ahead of recovery line

Coordinated Checkpointing Using diffusing computation –How can we use diffusing computation to obtain a consistent snapshot?

Algorithm by Tamir and Sequin Blocking checkpoint –A coordinator decides when a checkpoint is taken –Coordinator sends a request message to all –Each process Stops executing Flushes the channels Takes a tentative checkpoint Replies to coordinator –When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

Algorithm by Tamir and Sequin How many checkpoints need to be stored per process?

Tamir and Sequin assume fully connected graph? –How would you do it if it was not fully connected? Use diffusing computation Each node stops `original computation’ when it prorogates the diffusing computation Each node takes tentative checkpoint at completion –Channel flushing achieved in between

Checkpointing in Timed Systems If perfectly synchronized clocks?

Checkpointing in Timed Systems What if clocks are loosely synchronized? –Max clock drift, , is known? All processes take a checkpoint at a fixed (local) time –After the checkpoint, a process does not send any messages for 2  –The set of local checkpoints is guaranteed to be consistent

Minimal Checkpoint Coordination Approach by Koo and Toueg –Require processes to take a checkpoint only if they have to

Checkpointing with HLC Have everyone take a snapshot at the same value. –Simplify by choosing c = 0

Logging Protocols Pessimistic Optimistic Causal

Concept of Logging If restarted process was guaranteed to behave like it would before failure then other processes need not be aborted. –Log non-deterministic events

Definitions Depend(m) –Processes that depend on m Stable(m) –m stored on stable storage Log(m) –Processes that have logged m C –Set of failed processes

Pessimistic Protocols Not Stable(m) => |Depend(m)| = 0 What if –Not Stable(m) => |Depend(m)| <= 1

Causal Protocols Save m on volatile memory of other processes –Ensure Depend(m)  Log(m)