Database Recovery.

Database Recovery

System crash Ti transfers $50 from account A to account B
Initial values: A= $1000 B = $ 2000 Final values: A=$950 B=$2050 begin _transaction read(A) A:= A - 50 write(A) read(B) B:= B + 50 write(B) commit System crash Possible recovery procedures: Reexecute Ti : will result in A=$900; B=$2050  inconsistent state Do not reexecute Reexecute Ti : will result in A=$950; B=$2000

DBMS Transaction Subsystem
Transaction manager coordinates transactions on behalf of application programs It communicates with the scheduler Transaction manager Scheduler Recovery Buffer Systems Access File The scheduler handles concurrency control. Its objective is to maximize concurrency without allowing simultaneous transactions to interfere with one another The recovery manager ensures that the database is restored to the right state before a failure occurred. The buffer manager is responsible for the transfer of data between disk storage and main memory.

Outline Failure Classification Storage Structure and Data Access
Recovery and Atomicity Log-Based Recovery Recovery With Concurrent Transactions Shadow Paging

Recovery and Atomicity Log-Based Recovery Shadow Paging Recovery With Concurrent Transactions

Types of Failure Classification
Transaction failure System crash Disk failure Physical problems and catastrophes

Classification Transaction failure
Types of errors that may cause transaction failure : Logical errors: transaction cannot complete due to some internal error condition (e.g. division by zero) System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock)

Classification System crash
A power failure or other hardware or software failure causes the system to crash. Loss of content of volatile storage Transaction processing is halted Nonvolatile storage remains intact Fail-stop assumption: Hardware errors and software bugs stop system processing Database systems have numerous integrity checks to prevent corruption of disk data non-volatile storage contents are assumed to not be corrupted by system crash

Classification Disk failure
A head crash or failure during data transfer operations destroys all or part of disk storage. Destruction is assumed to be detectable: disk drives use checksums to detect failures

Classification Physical problems and catastrophes
Endless list of problems that includes: power or air-conditioning failure, fire, theft, sabotage, overwriting disks or tapes by mistake, mounting of the wrong tape by the operator, etc.

Storage Structure Volatile storage Nonvolatile storage Stable storage
does not survive system crashes examples: main memory, cache memory Nonvolatile storage survives system crashes examples: disk, tape, non-volatile (battery backed up) RAM Stable storage a mythical form of storage that survives all failures closely approximated by maintaining multiple copies on distinct non-volatile media

Data Access Physical blocks are those blocks residing on the disk.
Buffer blocks are the blocks residing temporarily in main memory. Block movements between disk and main memory are initiated through the following two operations: input(B) transfers the physical block B to main memory. output(B) transfers the buffer block B to the disk, and replaces the appropriate physical block there.

Data Access X Y Buffer A B input(A) Each transaction Ti has its private work-area in which local copies of all data items accessed and updated by it are kept. Ti's local copy of a data item X is called xi. We assume, for simplicity, that each data item fits in, and is stored inside, a single block. Buffer block B output(B) read-item(X) write-item(X) X1 Y1 Work Area of T1 X2 Work Area of T2 MAIN MEMORY

read(X) assigns the value of data item X to the local variable xi.
A transaction transfers data items between system buffer blocks and its private work-area using the following operations : read(X) assigns the value of data item X to the local variable xi. write(X) assigns the value of local variable xi to data item {X} in the buffer block. both these commands may necessitate the issue of an input(BX) instruction before the assignment, if the block BX in which X resides is not already in memory. X Y Buffer A B input(A) Buffer block B output(B) read-(X) write(X) X1 Y1 Work Area of T1 X2 Work Area of T2 MAIN MEMORY

Perform read(X) while accessing X for the first time;
Transactions Perform read(X) while accessing X for the first time; All subsequent accesses are to the local copy. After last access, transaction executes write(X). output(BX) need not immediately follow write(X). System can perform the output operation when it deems fit. If the system crashes after the write(X) operation was executed but before output(BX) was executed, the new value of X is never written to disk and, thus, is lost. A B X Y Buffer MAIN MEMORY X1 Y1 Work Area of T1 X2 Work Area of T2 Buffer block B read-(X) write(X) input(A) output(B)

input(A) A B Buffer Y Buffer block A X Buffer block B read(X) read(Y) output(B) Given: T1: wants to read item X and update item Y. item X is in block A while item Y is in block B Block B is already in memory write(Y) X1 Y1 Work Area of T1 MAIN MEMORY

Recovery Algorithms Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failures Recovery algorithms have two parts: Actions taken during normal transaction processing, to ensure enough information exists to recover from failures Actions taken after a failure, to recover the database contents to a state that ensures atomicity, consistency and durability

System crash Given: Ti that transfers $50 from account A to account B
Initial values:A= $1000;B = $ 2000 begin _transaction read(A) A:= A - 50 write(A) read(B) B:= B + 50 write(B) commit System crash In case of a failure: The system must have enough information to be able to keep track of the values of A and B. The system must be able to recover (perform actions after the failure such that the database is consistent)

Recovery and Atomicity
Modifying the database without ensuring that the transaction will commit may leave the database in an inconsistent state. Several output operations may be required for a transaction Ti (to output A and B). A failure may occur after one of these modifications have been made but before all of them are made.

Suppose a system crash occurred during the execution of Ti: after output(BA) has taken place before output(BB) was executed Recovery and Atomicity begin _transaction read-item(A) A:= A - 50 write-item(A) read(B) B:= B + 50 write-item(B) commit Ti that transfers $50 from account A to account B Initial values: A= $1000 B = $ 2000 Final values: A=$950 B=$2050 System crash Possible recovery procedures: Reexecute Ti : will result in A=$900; B=$2050  inconsistent state Do not reexecute Reexecute Ti : will result in A=$950; B=$2000

To ensure atomicity despite failures: first output information describing the modifications to stable storage without modifying the database itself. Assume (initially) that transactions run serially, that is, one after the other.

2 Recovery Approaches Log-based recovery Shadow paging

Recovery and Atomicity Log-Based Recovery Deferred database modification Immediate database modification Recovery With Concurrent Transactions Shadow Paging

Log-Based Recovery A log is kept on stable storage.
it is a sequence of log records, and maintains a record of update activities on the database. When transaction Ti starts, it registers itself by writing a <Ti start> log record Before Ti executes write(Xj ), a log record <Ti, Xj, V1, V2> is written, where V1 is the value of Xj before the write, and V2 is the value to be written to Xj . Log record notes that Ti has performed a write on data item Xj Xj had value V1 before the write, and will have value V2 after the write. <Ti start> <Ti, Xj, V1, V2>

Log-Based Recovery When Ti finishes its last statement, the log record <Ti commit> is written. When Ti is aborted, the log record <Ti abort> is written. We assume for now that log records are written directly to stable storage (that is, they are not buffered) <Ti, Xj, V1, V2> <Ti start> <Ti commit>

Log-Based Recovery Two approaches using logs:
Deferred database modification Immediate database modification

Recovery and Atomicity Log-Based Recovery Deferred database modification Immediate database modification Shadow Paging Recovery With Concurrent Transactions

Deferred Database Modification: what?
The deferred database modification scheme: records all modifications to the log defers the execution of all the write operations of a transaction until the transaction partially commits a transaction has partially committed once the final action of the transaction has been executed Assume that transactions execute serially

T1 :Withdrawal transaction
When the transaction finishes its final statement, it enters the partially committed state. The actual output may still be residing in memory; thus, a hardware failure may preclude its successful completion partially committed committed active T1 :Withdrawal transaction Time T1 balx t1 begin _transaction 100 t2 read(balx) t3 balx=balx - 10 t4 write(balx) 90 t5 end_transaction commit When the last of the information is written out in the disk, the transaction enters the committed state. In the event of failure, the updates of a transaction can be re-created.

Deferred Database Modification: what
begin _transaction read(A) A:= A - 50 write(A) read(B) B:= B + 50 write(B) commit *deferred *deferred partially commits Update DB

Deferred Database Modification: how
Transaction starts by writing <Ti start> record to log. A write(X) operation results in a log record <Ti,X,V> being written, where V is the new value for X. Note: old value is not needed for this scheme The write is not performed on X at this time, but is deferred. When Ti partially commits, <Ti commit> is written to the log the log records are read and used to actually execute the previously deferred writes. Once the updating of DB is completed, the transaction enters the committed phase.

Deferred Database Modification
<Ti start> begin _transaction read-item(A) A:= A - 50 write-item(A) read(B) B:= B + 50 write-item(B) commit < Ti , A, 950> < Ti, B,2050> < Ti commit> Database is modified using the log record Partially committed A= 950 B= 2050 A= 1000 B= 2000

Transaction starts by writing <Ti start> record to log. When Ti partially commits, <Ti commit> is written to the log Deferred Database Modification A write(X) operation results in a log record <Ti, X, V> being written, where V is the new value for X. Example: T0 and T1 (T0 executes before T1). Initial values: A=$1000, B=$2000, C=700 T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> < T1 commit> T1: begin read -item(C) C:= C - 100 write-item (C) commit

What happens if T0 and T1 successfully commit?
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> Database A=950 B=2050 When T0 partially commits, the log records are read and used to actually execute the previously deferred writes of T0. Once the updating of DB is completed by T0 , T0 enters the committed phase.

What happens if T0 and T1 successfully commit?
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) When T1 partially commits, the log records are read and used to actually execute the previously deferred writes. Once the updating of DB is completed by T1 , T1 enters the committed phase. <T1 start> < T1 , C, 600> < T1 commit> Database C=600

When can crashes occur? When the transaction is executing the original updates, or While recovery action is being taken

How is recovery done? During recovery after a crash, a transaction needs to be redone if and only if both <Ti start> and <Ti commit> are there in the log. Redoing a transaction Ti ( redo(Ti) ) sets the values of all data items updated by the transaction to the new values. The redo operation must be idempotent; that is, executing it several times must be equivalent to executing it once. This characteristic guarantees correct behavior even if failure occurs during recovery.

T0: begin read (A) A:= A - 50 write(A) read (B) B:= B + 50 write(B) commit T1: Begin-transaction read (C) C:= C - 100 write (C) Case 1: Crash occurs just after the log record for the step write(B) of T0 has been written to stable storage. Log record: <T0 start> < T0 , A, 950> < T0, B,2050> crash

T0: begin read (A) A:= A - 50 write(A) read (B) B:= B + 50 write(B) commit T1: Begin-transaction read (C) C:= C - 100 write (C) During recovery after a crash, a transaction needs to be redone if and only if both <Ti start> and <Ti commit> are there in the log. Case 1: Crash occurs just after the log record for the step write(B) of T0 has been written to stable storage. Log record: <T0 start> < T0 , A, 950> < T0, B,2050> crash Since no commit record <Ti commit> appears in the log, no redo action is taken. A= 1000; B=2000; C=700 It is assumed that the transaction was never executed Log records of incomplete transactions can be deleted from the log.

Case 2: Crash occurs just after the log record for the step write(C) of T1 has been written to stable storage. T0: begin read (A) A:= A - 50 write(A) read(B) B:= B + 50 write(B) commit T1: read (C) C:= C - 100 write (C ) Log record: <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> crash

During recovery after a crash, a transaction needs to be redone if and only if both <Ti start> and <Ti commit> are there in the log. Case 2: Crash occurs just after the log record for the step write(C) of T1 has been written to stable storage. T0: begin read (A) A:= A - 50 write(A) read(B) B:= B + 50 write(B) commit T1: read (C) C:= C - 100 write (C ) Log record: <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> redo(T0) is performed since both <T0 start> and <T0 commit> exist in the log No redo for T1 since no commit record <T1 commit> appears in the log After redo(T0) : A= 950; B=2050; C=700 crash

Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage.
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) Log record: <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> < T1 commit> crash

During recovery after a crash, a transaction needs to be redone if and only if both <Ti start> and <Ti commit> are there in the log. Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage. T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> < T1 commit> Two redo operations redo(T0) and redo(T1) are performed After redo(T0) and redo(T1) : Final values: A= 950; B=2050; C=600 crash

Case 4: Suppose a second system crash occurs during the recovery of the first crash.
<T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> <T1 start> < T1 , C, 600> < T1 commit> For each commit record < Ti commit> in the log, redo(Ti) is performed. Since redo writes values independent of the values currently in the database, the result of a successful attempt at redo is the same as though redo had succeeded the first time. Second crash

Outline Failure Classification Storage Structure
Recovery and Atomicity Log-Based Recovery Deferred Database Modification Immediate Database Modification Recovery With Concurrent Transactions Shadow Paging

Immediate Database Modification
The immediate database modification scheme allows database updates of an uncommitted transaction to be made as the writes are issued. Since undoing may be needed, update logs must have both old value and new value.

Immediate Database Modification: How?
Transaction starts by writing <Ti start> record to log. Before executing a write operation, the log record <Ti,Xj,V1,V2> is written where V1 is the old value and V2 is the new value. log record <Ti, Xj, V1, V2> must be written in stable storage before the item Xj is written in the database. That is, before output(B) is executed, the log record <Ti, Xj, V1, V2> is written to stable storage. can be extended to postpone log record output, so long as prior to execution of an output(B) operation for a data block B, all log records corresponding to items B must be flushed (written) to stable storage. Output of updated blocks can take place at any time before or after transaction commits Order in which blocks are output can be different from the order in which they are written. When Ti partially commits, <Ti commit> is written to the log.

<T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050>
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) <T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050> < T0 commit> <T1 start> < T1 , C, 700, 600> < T0 commit> A= 950 A= 1000 B= 2050 B= 2000 C= 600 C= 700

Transaction starts by writing <Ti start> record to log.
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) <T0 start>

T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) <T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050> Before executing a write operation, the log record <Ti, Xj, V1, V2> is written where V1 is the old value and V2 is the new value.

T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) LOG DB <T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050> A=950 B=2050 Output of updated blocks can take place at any time before or after transaction commits

When Ti partially commits, <Ti commit> is written to the log.
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) <T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050> < T0 commit> When Ti partially commits, <Ti commit> is written to the log.

LOG DB <T0 start> < T0 , A, 1000,950>
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C) LOG DB <T0 start> < T0 , A, 1000,950> < T0, B, 2000, 2050> A=950 B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> < T1 commit> C=600

Immediate vs Deferred IMMEDIATE DEFERRED Log Database Log Database
<T0 start> <T0, A, 1000, 950> <T0, B, 2000, 2050> A = 950 B = 2050 <T0 commit> <T1 start> <T1, C, 700, 600> C = 600 <T1 commit> DEFERRED Log Database <T0 start> < T0 , A, 950> < T0, B,2050> < T0 commit> A=950 B=2050 <T1 start> < T1 , C, 600> < T1 commit> C=600

Recovery procedure has two operations: undo(Ti) restores the value of all data items updated by Ti to their old values, going backwards from the last log record for Ti redo(Ti) sets the value of all data items updated by Ti to the new values, going forward from the first log record of Ti

Both operations must be idempotent That is, even if the operation is executed multiple times the effect is the same as if it is executed once. This is needed since operations may get re-executed during recovery

How to recover after a failure: Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the record <Ti commit>. Transaction Ti needs to be redone if the log contains both the record <Ti start> and the record <Ti commit>. Undo operations are performed first, then redo operations.

T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) Case1: Crash occurs just after the log record for the step write-item(B) of T0 has been written to stable storage. log database values <T0 start> < T0, A, 1000, 950> < T0, B, 2000, 2050> A=950;B=2050 crash Since <T0 start> exists but no commit record <T0 commit> appears in the log, undo(T0) action is performed. Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the record <Ti commit>.

going backwards from the last log record for Ti
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 1: Crash occurs just after the log record for the step write-item(B) of T0 has been written to stable storage. log database values <T0 start> < T0, A, 1000, 950> < T0, B, 2000, 2050> A=950;B=2050 crash How to undo(T0 )? : going backwards from the last log record for Ti < T0, B, 2000, 2050>: B is restored to 2000; < T0, A, 1000, 950> : A is restored to 1000

undo(T1) is performed since only <T1 start> is in the log.
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) Case 2: Crash occurs just after the log record for the step write-item(C) of T1 has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C = 600 crash undo(T1) is performed since only <T1 start> is in the log. redo(T0) is performed since both <T0 start> and <T0 commit> exist in the log.

going backwards from the last log record for T1
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 2: Crash occurs just after the log record for the step write-item(C) of T1 has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C = 600 crash First: undo(T1) How system recovers? going backwards from the last log record for T1 < T1 , C, 700, 600>:C is restored to 700;

going forward from the first log record for Ti
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 2: Crash occurs just after the log record for the step write-item(C) of T1 has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C = 600 crash Second: redo (T0) How system recovers? going forward from the first log record for Ti < T0 , A, 1000, 950>: A is updated to 950 < T0, B, 2000, 2050>: B is updated to 2050;

Two redo operations redo(T0) and redo(T1) are performed
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) C:= C - 100 write-item (C ) Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C=600 < T1 commit> crash Two redo operations redo(T0) and redo(T1) are performed

going forward from the first log record for T0
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C=600 < T1 commit> First: redo (T0) crash going forward from the first log record for T0 < T0 , A, 1000, 950>: A is updated to 950 < T0, B, 2000, 2050>: B is updated to 2050;

going forward from the first log record for T1
begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C=600 < T1 commit> Second: redo (T1) crash going forward from the first log record for T1 < T1 , C, 700, 600>: C is updated to 600

Problems in recovery procedure
searching the entire log is time-consuming (log must be searched to determine which transactions are to be redone or undone) we might unnecessarily redo transactions which have already output their updates to the database.

T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Case 3: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> < T1 , C, 700, 600> C=600 < T1 commit> crash Since A, B, C have been updated in the database, redo(T0) and redo(T1) are unnecessary.

Checkpoints To streamline recovery procedure, during execution, in addition to maintaining the log (deferred or immediate updates), periodic checkpointing must be performed and requires the following actions to take place: Output all log records currently residing in main memory onto stable storage. Output all modified buffer blocks to the disk. Write a log record <checkpoint> onto stable storage. Note: while a checkpoint is in progress, transactions are not allowed to perform any update actions such as writing to a buffer block or writing a log record.

Checkpoints During recovery we need to consider only:
the most recent transaction Ti that started before the checkpoint (but has not been completed prior to checkpoint), and transactions that started after Ti. tc tf T1 T2 T3 T4 checkpoint system failure

T0: begin read -item(A) A:= A - 50 write-item (A) read-item (B) B:= B + 50 write-item (B) commit T1: read -item(C) A:= A - 100 write-item (C ) Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> checkpoint < T1 , C, 700, 600> C=600 < T1 commit> crash

Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> < T1 , C, 700, 600> C=600 < T1 commit> Step 1: Scan backwards from end of log to find the most recent <checkpoint> record.

Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> < T1 , C, 700, 600> C=600 < T1 commit> start record Step 2: Continue scanning backwards till a record <Ti start> is found.

Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> < T1 , C, 700, 600> C=600 < T1 commit> start record Step 3: Need only consider the part of log following start record. Earlier part of log can be ignored during recovery, and can be erased whenever desired.

Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> < T1 , C, 700, 600> C=600 < T1 commit> No undo operations to be undertaken start record Step 4: For all transactions (starting from Ti or later) with no <Ti commit>, execute undo(Ti). (Done only in case of immediate modification.)

Example: Crash occurs just after the log record <T1 commit> has been written to stable storage. <T0 start> < T0 , A, 1000, 950> < T0, B, 2000, 2050> A=950; B=2050 < T0 commit> <T1 start> <checkpoint> < T1 , C, 700, 600> C=600 < T1 commit> redo(T1): C is updated to 600 With checkpoints, redo(T0), which is unnecessary was eliminated start record Step 5: Scanning forward in the log, for all transactions starting from Ti or later with a <Ti commit>, execute redo(Ti).

Example of Checkpoints
tc tf T1 T2 T3 T4 checkpoint system failure T1 can be ignored (updates already output to disk due to checkpoint) T2 and T3 redone. T4 undone

We allow immediate modification to the database.
Given: T1 subtracts 100 from A ; T2 increases the value of A by 10% ;Initial value of A is 1000. We allow immediate modification to the database. T1 T2 A = 1000 read(A) A:= A-100 write(A) A = 900 A:= A*1.1 A= 990 commit failure Abort/rollback

Review: Immediate Database Modification
When recovering after failure: Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the record <Ti commit>. Transaction Ti needs to be redone if the log contains both the record <Ti start> and the record <Ti commit>. Undo operations are performed first, then redo operations.

Review: Checkpoints During recovery we need to consider only the most recent transaction Ti that started before the checkpoint, and transactions that started after Ti. Scan backwards from end of log to find the most recent <checkpoint> record. Continue scanning backwards till a record <Ti start> is found. Need only consider the part of log following above start record. Earlier part of log can be ignored during recovery, and can be erased whenever desired.

Review: Checkpoints For all transactions (starting from Ti or later) with no <Ti commit>, execute undo(Ti) (scanning backwards). (Done only in case of immediate modification.) Scanning forward in the log, for all transactions starting from Ti or later with a <Ti commit>, execute redo(Ti).

T2 can be ignored (updates already output to disk due to checkpoint)
read-item(A) <T1 start> A:= A-100 write-item(A) <T1,A,1000,900> A = 900 <T2 start> A:= A*1.1 <T2,A,900,990> A= 990 commit <T2 commit> checkpoint Abort/rollback failure tc tf T2 can be ignored (updates already output to disk due to checkpoint) T1 is not finished; it has to be undone T1 T2 checkpoint system failure

Why is this schedule unrecoverable?
read-item(A) <T1 start> A:= A-100 write-item(A) <T1,A,1000,900> A = 900 <T2 start> A:= A*1.1 <T2,A,900,990> A= 990 commit <T2 commit> checkpoint Abort/rollback Why is this schedule unrecoverable? It does not follow strict 2PL. failure Based on recovery procedures for immediate modification, T1 has to be undone since it has a <T1 start> but not a <T1 commit> However, if T1 is rolled back the updates of T2 will be dirty/lost. Therefore, UNDO is not possible for T1

Recovery With Concurrent Transactions
All transactions share a single disk buffer and a single log A buffer block can have data items updated by one or more transactions We assume concurrency control using strict two-phase locking, that is: A transaction T does not release any of its exclusive (write) locks until after it commits or aborts. Hence, no other transaction can read or write an item that is written by T unless T has committed, leading to a strict schedule for recoverability (the updates of uncommitted transactions should not be visible to other transactions)

The checkpointing technique and actions taken on recovery have to be changed since several transactions may be active when a checkpoint is performed. Checkpoints are performed as before, except that the checkpoint log record is now of the form < checkpoint L> where L is the list of transactions active at the time of the checkpoint We assume no updates are in progress while the checkpoint in the buffer or log is carried out (can be relaxed in advance techniques)

** cannot proceed because T1 locks A
A = 1000; B=500 write-lock(A) read-item(A) A:= A-100 write-item(A) A = 900 A:= A*1.1 commit write-lock(B) read-item(B) B:= B+100 write-item(B) … Locks are released after commit or abort <T1 start> < T1,A,1000,900> <T2 start> ** cannot proceed because T1 locks A <checkpoint [T2,T1]> < T1,B,500,600> Abort/rollback

How the system recovers from crash
Log <T1 start> <T1,A,1000,900> <T2 start> <checkpoint [T2,T1]> <T1,B,500,600> Step 1:Initialize undo-list and redo-list to empty undo-list = [] redo-list = []

Step 2. Scan the log backwards from the end, stopping when the first <checkpoint L> record is found. Log <T1 start> <T1,A,1000,900> <T2 start> <checkpoint [T2,T1]> <T1,B,500,600> For every Ti in L, if Ti is not in redo-list, add Ti to undo-list a. if the record is <Ti commit>, add Ti to redo-list b. if the record is <Ti start>, then if Ti is not in redo-list, add Ti to undo-list redo-list = [] undo-list [T2,T1]

Log <T1 start> <T1,A,1000,900> <T2 start> <checkpoint [T2,T1]> <T1,B,500,600> Step 3. Once the redo-list and undo-list have been constructed, rescan log backwards from most recent record, stopping when <Ti start> records have been encountered for every Ti in undo-list. During the rescan, perform undo for each log record that belongs to a transaction in undo-list. undo-list [T2,T1] undo(T2) undo(T1)

Log <T1 start> <T1,A,1000,900> <T2 start> <checkpoint [T1,T2]> <T1,B,500,600> Step 4. Locate the most recent <checkpoint L> record. (this may involve scanning the log forward if the record was passed in step 4. undo-list [T1,T2] redo-list[]

Log <T1 start> <T1,A,1000,900> <T2 start> <checkpoint [T1,T2]> <T1,B,500,600> Step 5. Scan log forwards from the <checkpoint L> record till the end of the log. During the scan, perform redo for each log record that belongs to a transaction on redo-list redo-list [] No redo-list

At this point undo-list consists of incomplete transactions which must be undone, and redo-list consists of finished transactions that must be redone.

Shadow Paging Shadow paging is an alternative to log-based recovery;
It is useful if transactions execute serially. It is hard to extend shadow paging in concurrent processing of multiple transactions.

In shadow paging, the DB is partitioned into fixed-length blocks called pages (or blocks).
These pages need not be stored in a particular order. However, there must be a way to find the ith page of the DB for any given i. . Page n Database Disk

To find the ith page of the DB for any given i, a page table, which is kept in the memory is used.
DB disk pages The page table has n entries, one for each database page 1 2 3 4 5 6 7 n . page table in memory Each entry in the page table contains a pointer to a page in the DB disk

5 n input(A) Buffer Buffer block 5 X read-item(X) Given: Tj: wants to update item X. item X is in page (block) 5 write-item(X) Work Area of Tj Whenever a page is about to be written for the first time, shadow paging is undertaken X1 MAIN MEMORY

shadow page table (disk)
current page table (memory) 1 2 3 4 5 6 shadow page table (disk) 1 2 3 4 5 6 DB pages (disk) 5 1 4 2 3 6 Shadow Paging idea : maintain two page tables during the lifetime of a transaction – the current page table and the shadow page table. When transaction Tj starts, both pages are identical. The shadow page table, which is saved on a nonvolatile storage, is never modified during transaction execution. The current page table may be modified when the transaction performs a write operation.

current page table (memory) 1 2 3 4 5 6 shadow page table (disk) DB pages (disk) X is found in page 5 Unused pages or free blocks There are unused pages or free blocks in the DB disk

current page table (memory) 1 2 3 4 5 6 shadow page table (disk) DB pages (disk) Tj performs write-item(X) operation: Check: whenever any page is about to be written for the first time: 1. A copy of the page (page 5) is made onto an unused page. System finds an unused page and deletes it from the list of free page frames 5 System copies the contents of page 5 to the found unused page

current page table (memory) 1 2 3 4 5 6 shadow page table (disk) DB pages (disk) 2. The current page table is then made to point to the copy.

current page table (memory) 1 2 3 4 5 6 shadow page table (disk) 1 2 3 4 5 6 DB pages (disk) 5 1 4 2 3 6 5 2. The current page table is then made to point to the copy.

current page table (memory) 1 2 3 4 5 6 shadow page table (disk) 1 2 3 4 5 6 DB pages (disk) 5 1 4 2 3 6 old 5 new 3. The update is performed on the copy.

input(A) Buffer Buffer block 5 X 5 5 new n read-item(X) write-item(X) Work Area of Tj X1 shadow paging is undertaken for first write of page MAIN MEMORY

input(A) Buffer Buffer block 5 X 5 5 new n read-item(X) To commit Tj To commit a transaction 1. Flush (output) all modified pages (by a transaction) in main memory to disk write-item(X) Work Area of Tj X1 shadow paging is undertaken for first write of page MAIN MEMORY

To commit a transaction : 2. Output current page table to disk.
(memory) 1 2 3 4 5 6 DB pages 5 1 4 2 3 6 old shadow page table 1 2 3 4 5 6 copy of current page table 1 2 3 4 5 6 5 new To commit a transaction : 2. Output current page table to disk.

To commit a transaction :
current page table (memory) 1 2 3 4 5 6 DB pages 5 1 4 2 3 6 old shadow page table 1 2 3 4 5 6 copy of current page table 1 2 3 4 5 6 To commit a transaction : 3. Make the current page table the new shadow page table, as follows: a. keep a pointer to the shadow page table at a fixed (known) location on disk. 5 new

To commit a transaction :
current page table (memory) 1 2 3 4 5 6 DB pages 5 1 4 2 3 6 old Old shadow page table 1 2 3 4 5 6 shadow page table 1 2 3 4 5 6 To commit a transaction : 3. Make the current page table the new shadow page table, as follows: b. update the pointer to point to current page table on disk. 5 new

To commit a transaction in SP (summary)
Flush (output) all modified pages (by a transaction) in main memory to disk Output current page table to disk Make the current page table the new shadow page table, as follows: keep a pointer to the shadow page table at a fixed (known) location on disk. to make the current page table the new shadow page table, simply update the pointer to point to current page table on disk.

Shadow Paging Once pointer to shadow page table has been written, transaction is committed. No recovery is needed after a crash — new transactions can start right away, using the shadow page table. Pages not pointed to current/shadow page table should be freed (garbage collected).

Transaction Tj committed
current page table (memory) 1 2 3 4 5 6 DB pages 1 4 2 3 6 5 DB pages 5 1 4 2 3 6 old Old shadow page table 1 2 3 4 5 6 shadow page table 1 2 3 4 5 6 Transaction Tj committed 5 new Old page 5 is freed

Shadow Paging: When a crash occurs
If a crash occurs prior to finally committing a transaction (prior to completion of step 3: to commit a transaction) The system reverts to the state prior to the execution of the transaction. If a crash occurs after a transaction has committed (after completion to step 3: to commit a transaction) The effects of the transaction will be preserved; no redo operations need to be invoked.

Advantages of Shadow Paging
no overhead of writing log records recovery is trivial

Disadvantages of Shadow Paging
Copying the entire page table is very expensive Commit overhead is high even with above extension Need to flush every updated page, and page table

Disadvantages of Shadow Paging
Data gets fragmented (related pages get separated on disk) After every transaction completion, the database pages containing old versions of modified data need to be garbage collected Hard to extend algorithm to allow transactions to run concurrently Easier to extend log based schemes

Stable-Storage Implementation
Maintain multiple copies of each block on separate disks copies can be at remote sites to protect against disasters such as fire or flooding.

Stable-Storage Implementation
Failure during data transfer can still result in inconsistent copies: Block transfer can result in: Successful completion Partial failure: destination block has incorrect information Total failure: destination block was never updated

Write the information onto the first physical block.
Stable-Storage Implementation Protecting storage media from failure during data transfer Part I: Execute output operation as follows (assuming two copies of each block): Write the information onto the first physical block. When the first write successfully completes, write the same information onto the second physical block. The output is completed only after the second write successfully completes.

Stable-Storage Implementation Protecting storage media from failure during data transfer
Part II: Copies of a block may differ due to failure during output operation. To recover from failure: First find inconsistent blocks: Expensive solution: Compare the two copies of every disk block. Better solution: Record in-progress disk writes on non-volatile storage (Non-volatile RAM or special area of disk). Use this information during recovery to find blocks that may be inconsistent, and only compare copies of these. If either copy of an inconsistent block is detected to have an error (bad checksum), overwrite it by the other copy. If both have no error, but are different, overwrite the second block by the first block.

Seatwork <T1 start> < T1 , D,20> < T1 commit>
[checkpoint] <T4 start> < T4, B,15> < T4, A,20> < T4 commit> <T2 start> < T2, B,12> <T3 start> < T3, A,30> < T2 , D, 25> T1 T2 T3 T4 read(A) read(B) read(D) write(B) write(A) write(D) read(C) write(C) Describe the recovery process using deferred updates (of concurrent transactions) with check pointing System crash

Seatwork <T1 start> < T1 , D,20> < T1 commit>
[checkpoint] <T4 start> < T4, B,15> < T4, A,20> < T4 commit> <T2 start> < T2, B,12> <T3 start> < T3, A,30> < T2 , D, 25> T1 T2 T3 T4 read(A) read(B) read(D) write(B) write(A) write(D) read(C) write(C) Describe the recovery process from the system crash using immediate updates with check pointing. Specify which transactions are rolled back, which operations in the log are redone and which (if any) are undone, and whether any cascading rollback takes place. System crash

Database Recovery.

Similar presentations

Presentation on theme: "Database Recovery."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Recovery.

Similar presentations

Presentation on theme: "Database Recovery."— Presentation transcript:

Similar presentations

About project

Feedback