1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Slides:



Advertisements
Similar presentations
Rollback-Retry Techniques & Checnkpointing Protocols.
Advertisements

Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Database Systems, 8 th Edition Concurrency Control with Time Stamping Methods Assigns global unique time stamp to each transaction Produces explicit.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
1 Minggu 8, Pertemuan 16 Transaction Management (cont.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 2) Academic Year 2014 Spring.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Database Recovery Techniques
8.6. Recovery By Hemanth Kumar Reddy.
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 10 Transaction Management and Concurrency Control
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
ECE 753: FAULT-TOLERANT COMPUTING
Operating System Reliability
Operating System Reliability
Presentation transcript:

1 Rollback-Recovery Protocols II Mahmoud ElGammal

Rollback-Recovery II Taxonomy CS 5204 – Fall,

Rollback-Recovery II Communication-Induced Checkpointing Avoid the domino effect without requiring all checkpoints to be coordinated. Processes take two kinds of checkpoints: local and forced. Local checkpoints can be taken independently. Forced checkpoints must be taken to guarantee the eventual progress of the recovery line. No special coordination messages are exchanged to determine when forced checkpoints should be taken. Protocol-specific information is piggybacked on each application message. The receiver uses this information to decide if it should take a forced checkpoint. CS 5204 – Fall,

Rollback-Recovery II Communication-Induced Checkpointing Notation How does a receiver decide when to take a forced checkpoint? A checkpoint is useless if and only if it is part of a Z-cycle. The receiver should determine if past communication and checkpoint patterns can lead to the creation of useless checkpoints. CS 5204 – Fall, Z-pathZ-cycle

Rollback-Recovery II Communication-Induced Checkpointing Notation Checkpoint c 2,2 is useless under any failure scenario. P 0 must create a forced checkpoint before delivering m5 to break the m 3 -m 4 -m 5 Z-cycle. CS 5204 – Fall,

Rollback-Recovery II Communication-Induced Checkpointing CIC protocols have been classified in two types:  Model-based Protocols: Take more forced checkpoints than is probably necessary, because without explicit coordination, no process has complete information about the global system state.  Index-based protocols: Guarantee that checkpoints having the same index at different processes form a consistent state. CS 5204 – Fall,

Rollback-Recovery II Taxonomy CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery Process execution is modeled as a sequence of deterministic state intervals, each starting with the execution of a nondeterministic event. Non-deterministic event: the receipt of a message or an internal event (something that affects the process). Deterministic event: sending a message (an effect caused by the process). CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery All non-deterministic events can be identified and their determinants are logged to stable storage.  Determinant: the information need to “replay” the occurrence of a non- deterministic event. During failure-free operation, each process logs the determinants of all the non-deterministic events it observes onto stable storage. Each process also takes checkpoints to reduce the extent of rollback during recovery. After a failure occurs, the failed processes recover by using the checkpoints and logged determinants to replay the corresponding nondeterministic events precisely as they occurred during the pre-failure execution. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery The pre-failure execution of a failed process can be reconstructed during recovery up to the first nondeterministic event whose determinant is not logged. Upon recovery of all failed processes, the system does not contain any orphan process: a process whose state depends on a nondeterministic event that cannot be reproduced during recovery: (The No-Orphans Consistency Condition) A process p becomes an orphan when p itself doesn’t fail and p’s state depends on the execution of a nondeterministic event e whose determinant cannot be recovered from stable storage or from the volatile memory of a surviving process. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery Key parameters: Failure-free performance overhead. Output-commit latency. Simplicity of recovery and garbage collection. Potential for rolling back correct processes. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Pessimistic Logging Assumes that a failure can occur after any nondeterministic. The determinant of each nondeterministic event is logged to stable storage before the event is allowed to affect the computation. Employs synchronous logging (a strengthening of the always-no-orphans condition): CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Pessimistic Logging CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Pessimistic Logging Advantages:  Processes can send messages to the outside world without running a special protocol.  Processes restart from their most recent checkpoint, limiting the extent of execution that has to be replayed.  Recovery is simplified because the effects of a failure are confined only to the processes that fail.  Garbage collection is simple. Disadvantages:  Synchronous logging incurs a high performance penalty during failure-free operation. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Optimistic Logging Determinants of non-deterministic events are logged asynchronously: determinants are kept in a volatile log which is periodically flushed to stable storage. Assumes that logging will complete before a failure occurs. Allows the temporary creation of orphan processes, but none should exist by the time recovery is complete. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Optimistic Logging If a process fails, the determinants in its volatile log will be lost, and the state intervals that were started by such events cannot be recovered. If the failed process sent a message during any of these state intervals, the receiver of such message becomes an orphan process and must rollback to undo the effects of receiving the message. To perform these rollbacks correctly, causal dependencies must be tracked. CS 5204 – Fall, m5 still in volatile storage

Rollback-Recovery II Log-Based Rollback Recovery / Optimistic Logging Advantages:  Incurs little overhead during failure-free execution. Disadvantages:  More complicated recovery and garbage collection than pessimistic logging: Must track causal dependencies. May need to keep multiple checkpoints. Output commit requires multi-host coordination to ensure that no failure scenario can revoke the output. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Causal Logging Has the failure-free performance advantages of optimistic logging while retaining most of the advantages of optimistic logging. Avoids synchronous access to stable storage except during output commit. Similar to pessimistic logging in:  Allows each process to commit output independently.  Never creates orphan processes.  Limits the rollback of any failed process to the most recent checkpoint. Cost: a more complex recovery protocol. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Causal Logging Ensures the always-no-orphans property by ensuring that the determinant of each non- deterministic event that causally precedes the state of a process is either stable or it is available locally to that process. Processes piggyback the non-stable determinants in their volatile log on the messages they send to other processes. CS 5204 – Fall,

Rollback-Recovery II Log-Based Rollback Recovery / Causal Logging CS 5204 – Fall, P 0 will be able to “guide” the recovery of P 1 and P 2 since it knows the order in which P 1 should replay messages m 1 and m 3 to reach the state from which P 1 sends m 4. Similarly for P 2.

Rollback-Recovery II CS 5204 – Fall,

Rollback-Recovery II Concluding Remarks Key properties: performance overhead, storage overhead, ease of output commit, ease of garbage collection, ease of recovery, freedom from domino effect, freedom from orphan processes, and the extent of rollback. Coordinated checkpointing generally simplifies recovery and garbage collection, and yields good performance in practice. the nondeterministic nature of communication-induced checkpointing protocols complicates garbage collection and degrades performance. Log-based rollback recovery is often a natural choice for applications that frequently interact with the outside world. CS 5204 – Fall,

Rollback-Recovery II Thanks! Questions? CS 5204 – Fall,