A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

Slides:



Advertisements
Similar presentations
Rollback-Retry Techniques & Checnkpointing Protocols.
Advertisements

Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database Recovery Techniques
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Distributed Databases Recovery
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by: Tina Chhabra

Rollback-Recovery Techniques Checkpoint-based protocols Rely solely on checkpointing for system state restoration Coordinated, Uncoordinated, Communication- Induced Log-based protocols Combine checkpointing with logging on nondeterministic events Pessimistic, Optimistic, Causal

Rollback Recovery Focuses on long-running applications Treats a distributed system as a collection of application processes that communicate through the network Message-passing systems complicate rollback recovery because messages induce inter-process dependencies during failure-free operation

System Model A message-passing system consists of a fixed number of processes that communicate only through messages A process execution is a sequence of state intervals, each started by a nondeterministic event

Consistent System States A consistent system state is one in which if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending that message A fundamental goal of any rollback-recovery protocol is to bring the system into a consistent state when inconsistencies occur because of failure

Checkpointing Protocols Each process periodically saves its state on the stable storage A consistent global checkpoint is a set of local checkpoints, one from each process, forming a consistent state Desirable to minimize the amount of lost work by restoring the system to the most recent consistent global checkpoint, recovery line

The Domino Effect Upon the failure of one or more processes, dependencies may force some of the processes that did not fail to rollback, creating rollback propagation Rollback propagation may extend back to the initial state of the computation, losing all the work performed before a failure, called the domino effect

Logging Protocols Relies on the piecewise deterministic (PWD) assumption Can recover a failed process and replay its execution as it occurred before the failure Generally not susceptible to the domino effect A state interval is recoverable if there is sufficient information to replay the execution up to that state interval despite future failures A state interval is stable if the determinant of the nondeterministic event that started it is logged on stable storage

Logging Protocols contd. States X, Y, and Z form the maximum recoverable state, the most recent recoverable consistent system state Suppose processes P 1 and P 2 fail before logging the determinants corresponding to the deliveries of m 6 and m 5, respectively Message m 7 becomes an orphan message because process P 2 cannot guarantee the regeneration of the same m 6 during recovery and P 1 cannot guarantee the regeneration of the same m 7 without the original m 6 Process P 0 becomes an orphan process and is forced to roll back

Garbage Collection Deletion of useless recovery information Common approach is to identify the recovery line and discard all information relating to events that occurred before that time

Checkpoint-Based Rollback Recovery Restores the system state to the most recent consistent set of checkpoints Does not guarantee that pre-failure execution can be deterministically regenerated after a rollback Uncoordinated, Coordinated, Communication- Induced

Uncoordinated Checkpointing Allows each process maximum autonomy in deciding when to take checkpoints To determine a consistent global checkpoint during recovery, the processes record the dependencies among their checkpoints during fail-free operation

Uncoordinated Checkpointing contd. If failure occurs Recovering process initiates rollback by broadcasting a dependency request message to collect all the dependency information maintained by each process Each process stops its execution and replies with the dependency information saved on stable storage and associated with its current state Initiator calculates the recovery line and broadcasts a rollback request message A process whose current state belongs to the recovery line resumes execution, otherwise it rolls back to an earlier checkpoint indicated by the recovery line

Uncoordinated Checkpointing contd. Advantage Each process may take a checkpoint when it is most convenient Disadvantages Possibility of the domino effect A process may take a useless checkpoint that will never be part of a global consistent state Forces each process to maintain multiple checkpoints

Coordinated Checkpointing Requires processes to orchestrate their checkpoints in order to form a consistent global state A straightforward approach is to block communications while the checkpointing protocol executes

Coordinated Checkpointing contd. A coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to take a checkpoint Upon receiving this message, a process stops its execution, flushes all the processes, takes a tentative checkpoint, and sends an acknowledgment message back to the coordinator After receiving acks from all processes, the coordinator broadcasts a commit message After receiving the commit message, each process removes the old permanent checkpoint and makes the tentative checkpoint permanent

Coordinated Checkpointing contd. Advantages Simplifies recovery Not susceptible to the domino effect Reduces storage overhead and eliminates the need for garbage collection Disadvantage Large latency involved in committing output

Communication-Induced Checkpointing Avoids the domino effect while allowing processes to take some of their checkpoints independently Processes may be forced to take additional checkpoints because process independence is constrained to guarantee the progress of the recovery line Protocol-related information is piggybacked on each application message The receiver of the message uses this information to determine if it has to take a forced checkpoint to advance the recovery line

Communication-Induced Checkpointing contd. Model-based checkpointing System maintains checkpoint and communication structures that prevent the domino effect Index-based checkpointing System uses an indexing scheme for the local and forced checkpoints so that the checkpoints of the same index at all processes form a consistent state

Log-Based Rollback Recovery Makes explicit use of the fact that a process execution can be modeled as a sequence of deterministic state intervals, each starting with the execution of a nondeterministic event Assumes that all nondeterministic events can be identified and their corresponding determinants can be logged to stable storage Guarantees that upon recovery of all failed processes, the system does not contain any orphan process

Log-Based Rollback Recovery contd. During failure-free operation Each process logs the determinants of all the nondeterministic events that it observes onto stable storage Each process also takes checkpoints to reduce the extent of rollback during recovery After a failure occurs The failed processes recover by using the checkpoints and logged determinants to replay the corresponding nondeterministic events precisely as they occurred during the pre-failure execution Pessimistic, Optimistic, Causal

Pessimistic Logging Under the assumption that a failure can occur after any nondeterministic event in the computation Assumption is “pessimistic” since in reality failures are rare The determinant of each nondeterministic event is logged to stable storage before the event is allowed to affect the computation Abides by the always-no-orphans condition The observable state of each process is always recoverable

Pessimistic Logging contd. Suppose processes P 1 and P 2 fail and restart from checkpoints B and C They roll forward using their determinant logs to deliver the same sequence of messages as in the pre-failure execution Once recovery is complete, both processes will be consistent with the state of P 0

Pessimistic Logging contd. Advantages Orphans are never created Simplified recovery and garbage collection Disadvantage High failure-free performance overhead

Optimistic Logging Makes the optimistic assumption that logging will complete before a failure occurs Determinants are kept in a volatile log, which is periodically flushed to stable storage Does not require the application to block waiting for the determinants to be written to stable storage If a process fails, the determinants in its volatile log will be lost If the failed process sent a message during any of the state intervals that cannot be recovered, the receiver of the message becomes an orphan process and must roll back

Optimistic Logging contd. Suppose P 2 fails before the determinant for m 5 is logged to stable storage Process P 1 becomes an orphan process and must roll back to undo the effects of receiving the orphan message m 6 The rollback of P 1 forces P 0 to roll back to undo the effects of receiving message m 7

Optimistic Logging contd. Advantage Low failure-free performance overhead Disadvantages Allows orphans to be created Complicated recovery and garbage collection

Causal Logging Has the failure-free performance advantages of optimistic logging and the advantages of pessimistic logging Allows each process to commit output independently and never creates orphans Limits the rollback of any failed process to the most recent checkpoint on stable storage

Causal Logging contd. Advantage Low performance overhead Disadvantage May require complex recovery and garbage collection

Comparison

Questions??