EEC 688/788 Secure and Dependable Computing

Slides:



Advertisements
Similar presentations
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Advertisements

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Section 06 (a)RDBMS (a) Supplement RDBMS Issues 2 HSQ - DATABASES & SQL And Franchise Colleges By MANSHA NAWAZ.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Chapter 8 Fault Tolerance Part I Introduction.
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Maulik no show 10/5/2009

Outline Midterm result Checkpointing and logging System models Checkpoint-based protocols

Midterm Result High 100, low 55 mean 83.8 Q1-41.5/50, q2-19/20, q3-9.1/10, q4-14/20

Checkpointing and Logging: Checkpointing and logging are the most essential techniques to achieve dependability By themselves, they provide rollback recovery They are used for more sophisticated dependability schemes Checkpoint: a copy of the system state Can be used to recover the system to the state when the checkpoint was taken Checkpointing: the action of taking a copy of the system state, typically periodically Logging: log incoming/outgoing messages, etc.

Rollback Recovery vs. Rollforward Recovery

System Models Distributed system model Global state: consistent, inconsistent Distributed system model redefined Piecewise deterministic assumption Output commit Stable storage

System Models Distributed system Fault Model: fail stop A DS consists of N processes A process may interact with other processes only by means of sending and receiving messages A process may interact with another process within the DS, or a process in the outside world Fault Model: fail stop

System Models Process state Global state Defined by its entire address space in OS Relevant info can be captured by user-supplied APIs Global state The state of the entire distributed systems Not a simple aggregation of the states of the processes

Capturing Global State Global state can be captured using a set of individual checkpoints Inconsistent state: checkpoints reflects message received but not sent

Capturing Global State: Example P0: bank account A, P1: bank account B m0: deposit $100 to B (after A has debited A) P0 takes checkpoint C0 before debit op P1 takes checkpoint C1 after depositing $100 Scenario: P0 crashes after sending m0, and P1 crashes after taking C1 If the global state is reconstructed based on C0 and C1, it would appear that P1 got $100 from nowhere

Capturing Global State: Example P0 takes checkpoint C0 after sending m0 (reflect debit of $100) P1 takes checkpoint C1 after depositing $100 Dependency of P0 and P1 is captured by C0 and C1 Global state can be reconstructed based on C0 and C1 correctly

Capturing Global State: Example P0 takes checkpoint C0 after sending m0 (reflect debit of $100) P1 takes checkpoint C1 before receiving m0 but after sending m1 P2 takes checkpoint C2 before receiving m1 If using C0, C1, C2 to reconstruct global state, it would appear that m0 is sent but not received Debit $100 from A, but not deposited to B However, the reconstructed global state is still regarded as consistent because this state could have happened: m0 and m1 are still in transit => channel state

Distributed System Model Redefined A distributed system consists of the following: A set of N processes Each process consists of a set of states and a set of events One of the states is the initial state The change of states is caused by an event A set of channels Each channel is a uni-directional reliable communication channel between two processes The state of a channel consists of the set of messages in transit in the channel A pair of neighboring processes are connected by a pair of channels, one in each direction. An event (such as the sending or receiving of a message) at a process may change the state of the process and the state of the channel it is associated with, if any

Back on the Global State Example Global state consists of C0, C1, and C2 Channel state from P0 to P1: m0 Channel state from P1 to P2: m1

Piecewise Deterministic Assumption Using checkpoints to restore system state (after a crash) would mean that any execution after a checkpoint is lost Logging of events in between two checkpoints would ensure full recovery Piecewise deterministic assumption: All nondeterministic events can be identified Sufficient information (referred to as determinant) that can be used to recreate the event deterministic must be logged for each event Examples: receiving of a message, system calls, timeouts, etc. Note that the sending of a message is not a nondeterministic event (it is determined by another nondeterministic event or the initial state)

Output Commit Once a message is sent to the outside world, the state of the distributed system may be exposed to the outside world Should a failure occur, the outside world cannot be relied upon for recovery Output commit problem: To ensure that the recovered state is consistent with the external view, sufficient recovery information must be logged prior to the sending of a message to the outside world. A distributed system usually receives message from, and sends message to, the outside world E.g., the clients of the services provided by the distributed system

Stable Storage Checkpoints and events must be logged to stable storage that can survive failures for recovery Various forms of stable storage Redundant disks: RAID-1, RAID-5 Replicated file systems: GFS

Checkpoint-Based Protocols Uncoordinated protocols Coordinated protocols

Uncoordinated Checkpointing Uncoordinated checkpointing: full autonomy, appears to be simple. However, we do not recommend it for two reasons Checkpoints taken might not be useful to reconstruct a consistent global state Cascading rollback to the initial state (domino effect) To enable the selection of a set of consistent checkpoints during a recovery, the dependency of checkpoints has to be determined and recorded together with each checkpoint Extra overhead and complexity => not simple after all

Cascading Rollback Problem Last checkpoint: C1,1 by P1, before P1 crashed Cannot use C0,1 at P0 because it is inconsistent with C1,1 => P0 rollbacks to C0,0 Cannot use C1,1 at P1 because it is inconsistent with C2,1 => P1 rollbacks to C1,0 Cannot use C21 and C3,1 because they are inconsistent with C1,0 Hence, we have to roll back P2 and P3 to C2,0 and C3,0 Cannot use and C3,0 either due to C2,0 => P3 rollbacks to initial state

Cascading Rollback Problem The rollback of P3 to initial state would invalidate C2,0 => P2 rollbacks to initial state P1 rollbacks to C1,0 due to the rollback of P2 to initial state This would invalidate the use of C0,0 at P0 => P0 rollbacks to initial state The rollback of P0 to initial state would invalidate the use of C1,0 at P1 => P1 rollbacks to initial state

Tamir and Sequin Global Checkpointing Protocol One of the processes is designated as the coordinator Others are participants The coordinator uses a two-phase commit protocol for consistency on the checkpoints Global checkpointing is carried out atomically: all or nothing First phase: create a quiescent point of the distributed system Second phase: ensure the atomic switchover from old checkpoint to the new one

Tamir and Sequin Global Checkpointing Protocol Control messages for coordination CHECKPOINT message: initiate a global checkpoint & to create quiescent point SAVED message: to inform the coordinator that local checkpoint is done by participant FAULT message: a timeout occurred, global checkpointing should abort RESUME message: to inform participants that it is time to resume normal operation Sending a control message: to all outgoing channels except the one it receives from

Tamir and Sequin Global Checkpointing Protocol

Tamir and Sequin Global Checkpointing Protocol

Tamir and Sequin Global Checkpointing Protocol: Example

Tamir and Sequin Global Checkpointing Protocol: Proof of Correctness The protocol produces consistent global state Proof: a consistent global state consists of only two scenarios: All msgs sent by one process prior to its taking a local checkpoint have been received prior to the other process taking its local checkpointing This is the case if no process sends any msg after the global checkpoint is initiated Some msgs sent by one process prior to its taking a local checkpoint might arrive after the other process has checkpointed its state, but they are logged for replay Msgs received after the initiation of global checkpointing are logged, but not executed, ensuring this property Note that if a process fails, the global checkpointing would abort