CIS 620 Advanced Operating Systems

CIS 620 Advanced Operating Systems
Lecture 11 – Fault Tolerance Prof. Timothy Arndt BU 331

Fault Tolerance Dependable systems have the following requirements
Availability Reliability Safety Maintainability

Fault Tolerance Faults can be classified as Transient Intermittent
Persistent

Failure Models Different types of failures. Type of failure
Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy
Triple modular redundancy.

Flat Groups versus Hierarchical Groups
Communication in a flat group. Communication in a simple hierarchical group

Agreement in Faulty Systems
The Byzantine generals problem for 3 loyal generals and1 traitor. The generals announce their troop strengths (in units of 1 kilosoldiers). The vectors that each general assembles based on (a) The vectors that each general receives in step 3.

Agreement in Faulty Systems
The same as in previous slide, except now with 2 loyal generals and one traitor.

RPC Failure Semantics This gets hard and ugly. Can't find the server.
Need some sort of out-of-band response from the client stub to the client. Ada exceptions C signals Multithread the client and start the "exception" thread. This loses transparency (centralized systems don't have this).

RPC Failures Lost request message. Lost reply message.
This is easy if known. That is, if we are sure the request was lost. Also easy if idempotent and we think it might be lost. Simply retransmit the request. Assumes the client still knows the request. Lost reply message. If it is known the reply was lost, have server retransmit.

RPC Failures Assumes the server still has the reply. How long should the server hold the reply? Wait forever for the reply to be ack'ed? No! Discard after "enough" time. Discard after we receive another request from this client. Ask the client if the reply was received. Keep resending reply. What if we are not sure of whether we lost the request or the reply? If the server is stateless, it doesn't know and the client can't tell! If idempotent, simply retransmit the request.

RPC Failures Server crashes
What if the server is not idempotent and can't tell if we lost the request or the reply? Use sequence numbers so server can tell that this is a new request not a retransmission of a request it has already done. Doesn't work for stateless servers. Server crashes Did it crash before or after doing some nonidempotent action? Can't tell from messages.

RPC Failures From databases, we get the idea of transactions and commits. This really does solve the problem but is not cheap. Fairly easy to get “at least once” (try request again if timer expires) or “at most once (give up if timer expires)” semantics. Hard to get “exactly once” without transactions. To be more precise. A transaction either happens exactly once or not at all (sounds like at most once) and the client knows which.

RPC Failures Client crashes Orphan computations exist.
Again transactions work but are expensive. We can have the rebooted client start another epoch and all computations of previous epoch are killed and clients resubmit. It is better is to let old computations with owners that can be found continue. This isn’t a great solution.

RPC Failures Serious programming is needed.
An orphan may hold locks or might have done something not easily undone. Serious programming is needed.

Basic Reliable-Multicasting Schemes
A simple solution to reliable multicasting when all receivers are known and are assumed not to fail Message transmission Reporting feedback

Nonhierarchical Feedback Control
Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control
The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children. A local coordinator handles retransmission requests.

Virtual Synchrony The logical organization of a distributed system to distinguish between message receipt and message delivery

Virtual Synchrony The principle of virtual synchronous multicast.

Message Ordering Process P1 Process P2 Process P3 sends m1 receives m1 receives m2 sends m2 Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Message Ordering Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 sends m4 receives m2 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

Implementing Virtual Synchrony
Multicast Basic Message Ordering Total-ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Six different versions of virtually synchronous reliable multicasting.

Implementing Virtual Synchrony
Process 4 notices that process 7 has crashed, sends a view change Process 6 sends out all its unstable messages, followed by a flush message Process 6 installs the new view when it has received a flush message from everyone else

Transaction, Recovery and Concurrency Control
Transactions are the units to be considered in both recovery and concurrency control Recovery techniques are needed to ensure that transactions complete successfully despite system failures Concurrency control is needed to prevent concurrent transactions from interfering with each other Locking and timestamping are the major techniques for concurrency control

Properties of Transactions
Atomic - a transaction is an atomic unit of processing; it is either performed in its entirety or not performed at all Consistent - a transaction must take the database from one consistent state to another Isolated - changes made by a transaction should not be seen by other transactions until the initial transaction is committed Durable - changes that are made and committed by a transaction must never be lost

Atomicity Either an operation completes fully or the operation does not happen at all Low-level atomic operations are built into hardware High-level atomic operations are called transactions Transaction manager ensures that all transactions either complete (“committed transaction”) have no effect on the db (“aborted transaction”)

System Model Memory Disk read(A) A A B write(B) B read(A) write(B) A A
Local buffer T1 Local buffer T2

Assumptions Each data item can be read and written only once by one single transaction. It can be modified many times in the local buffer Values are written onto the global buffer space in the same sequence that the write orders are issued If the transaction reads a value from the database, and this value was previously modified, it always gets the latest value A data item can be a file, relation, record, physical page, etc., determined by the designer

Transaction Example Transfer $50 from account A to account B
Transaction structure 1. read(A) 2. A := A - 50 3. Write(A) 4. Read(B) 5. B := B + 50 6. Write(B) If initial values are A = 180 and B = 100, then after the execution A = 130 and B = 150 If the system crashes between steps 3 and 6, then the database is in an inconsistent state

Transaction Model A transaction must see a consistent database
During transaction execution the database may be inconsistent When the transaction is committed (completed successfully), the database must be consistent A transaction which does not complete successfully is termed “aborted” An aborted transaction must be “rolled back”, meaning the database is returned to its state prior to the transaction

Problems in Ensuring Atomicity
Transaction failure: transaction cannot complete due to user abort or internal error condition System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock) System crash: a power failure or other hardware failure causes the system to crash Disk failure: a head crash or similar failure destroys all or part of disk storage

Management of Failure Volatile storage Nonvolatile storage
does not survive system crashes examples: main memory, cache memory Nonvolatile storage survives system crashes examples: disk; tape Stable storage a mythical form of storage that survives all failures approximated by maintaining multiple copies on distinct nonvolatile media

Write-ahead Log A log file is kept on stable storage
Each transaction Ti when it starts, registers itself on the log by writing <Ti, starts> Whenever Ti executes write(X), the record < Ti, X, old-value, new-value> is written sequentially on the log, and then the write(X) is executed When Ti reaches its last statement, the record < Ti,commits> is added to the log

Write-ahead Log If X is modified, then its corresponding log record is always first written on the log and then written on the database Before Ti is committed, all its corresponding log records must be in stable storage

Example Transactions T1: read(A) T2: read(A) A := A + 50 A := A + 10
read(B) write(A) B := B read(D) write(B) D := D - 10 read(C) read(E) C := 2C read(B) write(C) E := E + B A := A + B + C write(E) write(A) D := D + E write(D) Initial values: A = 100, B = 300, C = 5, D = 60, E =80

Log Records 1. <T1 starts> Database Values
2. < T1, B, 300, 400> I. B 400 3. < T1, C, 5, 10> II. C 10 4. < T1, A, 100, 560> III. A 560 5. < T1 commits> IV. A 570 6. < T2 starts> V. E 480 7. < T2, A, 560, 570> VI. D 530 8. < T2, E, 80, 480> 9. < T2, D, 60, 350> 10. < T2 commits>

Possible Order of Writes
Log Database 1 2 3 4 5 I II III 6 7 8 IV 9 10 V VI Time

Consequences of a Crash
After a crash, the log is examined Various actions are taken, depending on the last instruction written on the log Example: Last instruction(i) Action i = 0 nothing 1<=i<=4 undo(T1) 5<=i<=9 redo(T1), undo(T2) i>=10 redo(T1), redo(T2)

Algorithm Redo all transactions for which the log has both start and commit operations Undo all transactions for which the log has a start operation but no commit operation Remarks: In a multitasking system, more than one transaction may need to be undone if a system crashes during the recovery stage, the new recovery must still give correct results In this algorithm, a large number of transactions must be redone since we don’t know how far behind the database updates are

Incremental Log with Deferred Updates
Each transaction Ti when it starts, registers itself on the log, by writing < Ti, starts> Whenever executes write(X), the record < Ti, X, new-value> is written on the log When Ti reaches its last statement, the record < Ti, commits> is added to the log Before Ti is committed, all its corresponding log records must be in stable storage Use the log records to perform the actual updates to the database after the commit When system crashes, only need to redo transactions for which log has both start and commit operations

Checkpointing During execution, in addition to the activities of the previous method, periodically perform checkpointing force log buffers on log force database buffers on database force “checkpoint record” on log During recovery undo all transactions that have not been committed redo all transactions that have been committed after a checkpoint

Checkpoint Example T1 okay T2 and T3 redone T4 undone time T1 T2 T3 T4
system failure checkpoint

Other Types of Failure Transaction failure Disk failure
undo transaction message to user - transaction bad Disk failure restore the database from backup copy, rerun transactions

Concurrency Control Concurrency control is needed to handle problems that can occur when concurrent transactions execute Lost Update: an update to a data item by some transaction is overwritten by another interleaved transaction without knowledge of the initial update Temporary Update (Dirty Read): a transaction reads a data item updated by another transaction that later fails Incorrect Summary: a transaction calculating an aggregate function uses some but not all updated data items of another transaction

Lost Update Example T1 T2 read(X) X := X - N X := X + M write(X)
read(Y) Y := Y + N write(Y)

Temporary Update Example
T1 T2 read(X) X := X - N write(X) X := X + M read(Y)

Incorrect Summary Example
T1 T2 read(X) X := X - N write(X) sum := sum + X read(Y) sum := sum + Y Y := Y + N write(Y)

Locking All of the problems of concurrent users can be solved by a concurrency control technique called locking When a transaction needs to be sure that some object will not have its value changed, it acquires a lock on that object Locks may be either exclusive locks (X locks) or shared locks (S locks)

Deadlock The use of locks solves the concurrency problems seen earlier, but it may lead to the problem of deadlock. This can be resolved in one of two ways (for centralized systems): The system can maintain a Wait-For Graph (the graph of who is waiting for whom). When there is a cycle in this graph, a state of deadlock exists and one of the transactions in the cycle must be rolled back A simpler solution is to use a timeout mechanism to rollback transactions which have been inactive for a certain period of time

Concurrency Problem Example Database consisting of two items, X, Y
Only criterion of correctness: X = Y Consider the following two transactions T1, T2 T1: X := X + 1 T2 : Y := 2X Y := Y + 1 Y := 2Y Each transaction when executed alone maintains correctness (i.e., X = Y before and after its execution)

Serializability Serializability is the criterion for correctness for concurrency control. A given interleaved schedule of transactions is considered correct if it is serializable - i.e. produces the same results as some serial schedule

Terminology Given a set of transactions, any execution of those transactions is called a schedule Executing the transactions without interleaving constitutes a serial schedule A schedule that is not serial is an interleaved schedule Two schedules are said to be equivalent if they are guaranteed to produce the same result A schedule is serializable if it is equivalent to some serial schedule

Two-Phase Locking Theorem
The 2PL Theorem is an important result which states the following: If all transactions obey the “two-phase locking prototcol”, then all possible interleaved schedules are serializable. The two-phase locking protocol says Before operating on any object, a database must acquire a lock on that object After releasing a lock, a transaction must never go on to acquire any more locks

Two-Phase Commit When several database take part in a single transaction a protocol called Two-Phase Commit is used Each database is assumed to have its own local “resource manager” A single system component called the Coordinator controls the whole process

Two-Phase Commit The protocol consists of two phases
In the first phase, each participant in the transaction force-writes all log entries for local resources to the log. If the force-write is successful, the resource manager responds “OK”, otherwise it replies “Not OK” When the coordinator receives replies from all participants, it force-writes an entry to its own physical log - either “commit” or “rollback”. The Coordinator then informs each participant of its decision and the local resource managers must obey the decision

Two-Phase Commit The finite state machine for the coordinator in 2PC.
The finite state machine for a participant.

Two-Phase Commit State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT READY Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q.

Two-Phase Commit actions by coordinator: while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } Outline of the steps taken by the coordinator in a two phase commit protocol

Two-Phase Commit Steps taken by participant process in 2PC.
actions by participant: write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } Steps taken by participant process in 2PC.

Two-Phase Commit Steps taken for handling incoming decision requests.
actions for handling decision requests: /* executed by separate thread */ while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Steps taken for handling incoming decision requests.

Three-Phase Commit Finite state machine for the coordinator in 3PC
Finite state machine for a participant

CIS 620 Advanced Operating Systems

Similar presentations

Presentation on theme: "CIS 620 Advanced Operating Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 620 Advanced Operating Systems

Similar presentations

Presentation on theme: "CIS 620 Advanced Operating Systems"— Presentation transcript:

Similar presentations

About project

Feedback