EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Global States and Checkpoints
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
CS 603 Handling Failure in Commit February 20, 2002.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

10/20/2015 EEC688/788: Secure & Dependable Computing Wenbing Zhao Outline Checkpointing and logging  Checkpoint-based protocols Uncoordinted checkpointing Coordinated checkpointing  Logging-based protocols Pessimistic logging Optimistic logging Causal logging

Uncoordinated Checkpointing Uncoordinated checkpointing: full autonomy, appears to be simple. However, we do not recommend it for two reasons  Checkpoints taken might not be useful to reconstruct a consistent global state Cascading rollback to the initial state (domino effect)  To enable the selection of a set of consistent checkpoints during a recovery, the dependency of checkpoints has to be determined and recorded together with each checkpoint Extra overhead and complexity => not simple after all

Cascading Rollback Problem Last checkpoint: C 1,1 by P1, before P1 crashed Cannot use C 0,1 at P0 because it is inconsistent with C 1,1 => P0 rollbacks to C 0,0 Cannot use C 2,1 at P2 because it fails to reflect the sending of m6 => P2 rollbacks to C 2,0 Cannot use C 3,1 and C 3,0 as a result => P3 rollbacks to initial state

Cascading Rollback Problem The rollback of P3 to initial state would invalidate C 2,0 => P2 rollbacks to initial state P1 rollbacks to C 1,0 due to the rollback of P2 to initial state This would invalidate the use of C 0,0 at P0 => P0 rollbacks to initial state The rollback of P0 to initial state would invalidate the use of C 1,0 at P1 => P1 rollbacks to initial state

Tamir and Sequin Global Checkpointing Protocol One of the processes is designated as the coordinator Others are participants The coordinator uses a two-phase commit protocol for consistency on the checkpoints  Global checkpointing is carried out atomically: all or nothing  First phase: create a quiescent point of the distributed system  Second phase: ensure the atomic switchover from old checkpoint to the new one

Tamir and Sequin Global Checkpointing Protocol Control messages for coordination  CHECKPOINT message: initiate a global checkpoint & to create quiescent point  SAVED message: to inform the coordinator that local checkpoint is done by participant  FAULT message: a timeout occurred, global checkpointing should abort  RESUME message: to inform participants that it is time to resume normal operation Sending a control message except SAVED: to all outgoing channels except the one it receives from CHECKPOINT certificate: keep track if received it from every incoming channel

Tamir and Sequin Global Checkpointing Protocol Typos: p24, figure 2.4, p25, figure 2.5 Final state machine => Finite state machine

Tamir and Sequin Global Checkpointing Protocol SAVED: send to up stream node

Tamir and Sequin Global Checkpointing Protocol: Example P0 channel state: m0 P1 channel state: m1 P2 channel state: empty

Tamir and Sequin Global Checkpointing Protocol: Proof of Correctness The protocol produces consistent global state Proof: a consistent global state consists of only two scenarios:  All msgs sent by one process prior to its taking a local checkpoint have been received prior to the other process taking its local checkpointing This is the case if no process sends any msg after the global checkpoint is initiated  Some msgs sent by one process prior to its taking a local checkpoint might arrive after the other process has checkpointed its state, but they are logged for replay Msgs received after the initiation of global checkpointing are logged, but not executed, ensuring this property Note that if a process fails, the global checkpointing would abort

Chandy and Lamport Distributed Snapshot Protocol CL snapshot protocol is a nonblocking protocol  TS checkpointing protocol is blocking  CL protocol is more desirable for applications that do not wish to suspect normal operation  However, CL protocol is only concerned how to obtain a consistent global checkpoint  CL Protocol: no coordinator, any node may initiate a global checkpointing Data structure  Marker message: equivalent to the CHECKPOINT message  Marker certificate: keep track to see a marker is received from every incoming channel

CL Distributed Snapshot Protocol

Example P0 channel state: m0 (p1 to p0 channel) P1 channel state: m1 (p2 to p1 channel) P2 channel state: empty

Comparison of TS & CL Protocols Similarity  Both rely on control msgs to coordinate checkpointing  Both capture channel state in virtually the same way Start logging channel state upon receiving the 1 st checkpoint msg from another channel Stop logging channel state after received checkpoint on the incoming channel  Communication overhead similar

Comparison of TS & CL Protocols Differences: strategies in producing a global checkpoint  TS protocol suspends normal operation upon 1 st checkpoint msg while CL does not  TS protocol captures channel state prior to taking a checkpoint, while CL captures channel state after taking a checkpoint  TS protocol more complete and robust than CL Has fault handling mechanism

Log Based Protocols Work might be lost upon recovery using checkpoint- based protocols By logging messages, we may be able to recover the system to where it was prior to the failure System mode: the execution of a process is modeled as a set of consecutive state intervals  Each interval is initiated by a nondeterministic state or initial state  We assume the only type of nondeterministic event is receiving of a message

Log Based Protocols In practice, logging is always used together wit checkpointing  Limits the recovery time: start with the latest checkpoint instead of from the initial state  Limits the size of the log: after taking a checkpoint, previously logged events can be purged Logging protocol types:  Pessimistic logging: msgs are logged prior to execution  Optimistic logging: msgs are logged asynchronously  Causal logging: nondeterministic events that not yet logged (to stable storage) are piggybacked with each msg sent For optimistic and causal logging, dependency of processes has to be tracked => more complexity, longer recovery time

Pessimistic Logging Synchronously log every incoming message to stable storage prior to execution Each process periodically checkpoints its state: no need for coordination Recovery: a process restores its state using the last checkpoint and replay all logged incoming msgss

Pessimistic Logging: Example Pessimistic logging can cope with concurrent failures and the recovery of two or more processes

Benefits of Pessimistic Logging Processes do not need to track their dependencies  Logging mechanism is easy to implement and less error prone Output commit is automatically ensured No need to carry out coordinated global checkpointing  By replaying the logged msgs, a process can always bring itself to be consistent with other processes Recovery can be done completely locally  Only impact to other processes: duplicate msgs (can be discarded)