A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Slides:



Advertisements
Similar presentations
Rollback-Retry Techniques & Checnkpointing Protocols.
Advertisements

CM20145 Concurrency Control
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
Chapter 19 Database Recovery Techniques
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Checkpointing 2.0 Compiler-Assisted Checkpointing Uncoordinated Checkpointing.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database Recovery Techniques
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Concurrency Control.
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 15 : Concurrency Control
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Transactions in Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

1. Introduction  Terms  Transparent rollback recovery  Checkpoint  Rollback propagation  Domino effect  Checkpoint-based rollback recovery  Log-based rollback recovery  Garbage collection  Interactions with the outside world

2. Background and Definitions  System Model  A message-passing system consists of a fixed # of processes  Rollback-recovery protocols assume that  Communication network is immune to partitioning  Network partition is different from network reliability  A process execution  Is a sequence of state intervals  PWD assumption (log-based rollback)  The system can detect and capture sufficient events that initiate the state intervals

2. Background and Definitions (Cont.)  Consistent System States  Goal of rollback-recovery protocol : to bring the system into a consistent state when inconsistencies occur because of a failure

2. Background and Definitions (Cont.)  Interactions with the Outside World  The outside world can’t roll back (ex: printer can’t roll back)  OWP (Outside World Process)  Can’t fail  Can’t maintain state  Can’t participate in the recovery protocol  Output commit problem  Before send msg to OWP, system ensure that the state from which the message is sent will be recovered despite any future failure

2. Background and Definitions (Cont.)  In-Transit Messages  Called that the message has been sent but not yet received  Rollback-recovery protocols must guarantee the delivery of in-transit msg  Reliable communication protocol & Unreliable communication protocol

2. Background and Definitions (Cont.)  Logging protocols  Piecewise Deterministic  identify all the nondeterministic events, log a determinant that contains information necessary to replay the event  Don’t occur the domino effect  Ensure consistent without expensive checkpoints  It’s good for repeatedly interaction with OWP  Stable Storage  Save checkpoints, event logs, recovery-related information  Must ensure that the recovery data persist

2. Background and Definitions (Cont.)  Garbage Collection  the deletion of useless recovery information  Checkpoints and event logs consume storage resources  Discard useless information incurs overhead  Need strategies for GC  Complexity  Invocation frequency

3. Checkpoint-based Rollback Recovery  Restores the system state to the most recent consistent set of checkpoints  Does not rely on the PWD assumption → detect, log, replay non-deterministic events. : not necessary → easy to implementation → doesn’t guarantee pre-failure execution can be deterministically regenerated after a rollback  Categories : Uncoordinated, Coordinated, CIC

3. Checkpoint-based Rollback Recovery

3.1. Uncoordinated Checkpointing  Overview  Allows each process the maximum autonomy  Advantage  Each process take a checkpoint when convenient  Disadvantages  Possibility of the domino effect  Produce of useless checkpoint won’t be part of a global consistent state  Require the garbage collection algorithm  Not suitable for applications with frequent output commits

3.1. Uncoordinated Checkpointing (Cont.)  Simple showcase PiPi PjPj c i,x-1 c j,y-1 m Piggyback (i,x) Record the dependency from I i,x to I j,y Dependency Request Dependency Reply Rollback Request

3.1. Uncoordinated Checkpointing (Cont.)  Recovery Line Calculation  Rollback-dependency graph

Rollback-dependency graph

3.1. Uncoordinated Checkpointing (Cont.)  Checkpoint graph  Similar to rollback-dependency graph except when a message is sent from I i,x and received in I j,y, a directed edge is drawn from c i,x-1 to c j,y

Checkpoint graph

3.1. Uncoordinated Checkpointing (Cont.)  The Domino Effect  Cascaded rollback : roll back to beginning of the computation Loss of all the work done by single process failure

3.2. Coordinated Checkpointing  Overview  Orchestrate all processes checkpoints in order to form a consistent global state  Advantages  Simply recovery  Not susceptible to the domino effect  Reduce storage overhead  Not need the garbage collection  Disadvantage  Large latency involved in committing output

3.2. Coordinated Checkpointing (Cont.)  Straightforward approach  Block communications while the check- pointing protocol executes PiPi c j,y-1 Request checkpoint Coordinator Stop execution, Flushes all comm. channel Make the tentative checkpoint ACK. message Commit message c j,y

3.2. Coordinated Checkpointing (Cont.)  Non-blocking Checkpoint Coordination  Possibility to make the checkpoint inconsistent  Problem : prevent a process that could make the checkpoint inconsistent

3.2. Coordinated Checkpointing (Cont.)  With FIFO channels  With non-FIFO channels piggybacked

3.2. Coordinated Checkpointing (Cont.)  Checkpointing with Synchronized Clocks  Loosely synchronized clocks  Trigger all local checkpointing action at approximately the same time without a checkpoint initiator  Without the need of exchanging any messages  If a failure occurs, it is detected within the specified time and the protocol is aborted

 Checkpointing and Communication Reliability  Problem  Solution  Reliable channel : all in-transit messages be saved by their intended destinations  Unreliable channel assumption : in-transit messages need not be saved 3.2. Coordinated Checkpointing (Cont.) p cpcp q cqcq m

 Minimal Checkpoint Coordination  Coordinated checkpointing requires all processes to participate in every checkpoint  So, it is desirable to reduce # of processes  Minimal Checkpoint Coordination Protocol  First phase  Initiator identifies all processes communicated since the last checkpoint  Send them a request  Each process repeats above until no more processes can be identified  Second phase  All processes identified take a checkpoint

3.3. Communication-induced Checkpointing  Overview  Avoid the domino effect without requiring all checkpoints to be coordinated  Local checkpoints : taken independently  Forced checkpoints : guarantee the eventual progress of the recovery line  Prevent the creation of useless checkpoints by these separated checkpoints  Not need any special messages : instead, use piggyback protocol

3.3. Communication-induced Checkpointing  Z-path

3.3. Communication-induced Checkpointing  Z-cycle  Is a Z-path that begins and ends with the same checkpoint  Z-path [m 5, m 3, m 4 ] is a Z-cycle  Z-cycle can be proved that : a checkpoint is useless ↔ part of a Z-cycle

3.3. Communication-induced Checkpointing  Model-based Protocols  Maintain checkpoint and communication structure to prevent useless checkpoint  The MRS model  By ensure all message-receiving events precede all message-sending events  Another way (by Bartlett, 1981)  By taking a checkpoint immediately before every message-sending event

3.3. Communication-induced Checkpointing  Index-based Protocols  Guarantee, through forced checkpoint  If necessary, where ts(c) is the timestamp for checkpoint c and consecutive local checkpoints (increasing)  System uses an indexing scheme for the local and forced checkpoints → the checkpoints of the same index at all processes form a consistent state  performance advantages than others  CIC allows considerable autonomy  Scale up well with a large # of processes