6/25/2015Transactional Information Systems16-1 Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery Gerhard Weikum and Gottfried Vossen “Teamwork is essential. It allows you to blame someone else.”(Anonymous) © 2002 Morgan Kaufmann ISBN
6/25/2015Transactional Information Systems16-2 Part III: Recovery 11 Transaction Recovery 12 Crash Recovery: Notion of Correctness 13 Page-Model Crash Recovery Algorithms 14 Object-Model Crash Recovery Algorithms 15 Special Issues of Recovery 16 Media Recovery 17 Application Recovery
6/25/2015Transactional Information Systems16-3 Chapter 16: Media Recovery 16.2 Log-based Method Database Backup and Archive Logging Database Restore Analyis of MTTDL 16.3 Storage Redundancy Techniques Based on Mirroring Techniques Based on Error-Correcting Codes 16.4 Disaster Recovery 16.5 Lessons Learned “More than any time in history mankind faces a crossroads. One path leads to despair and utter hopelessness, the other to total extinction. Let us pray that we have the wisdom to choose correctly. ” (Woody Allen)
6/25/2015Transactional Information Systems16-4 Failure Model and Assessment Criteria Failures whose repair requires media recovery: disk failures (damaged media) corrupted pages on disk (single-block read error) environmental failures fire, water damage, disasters serious bugs in operational server software erroneous user input Assessment criteria: availability (MTTF / (MTTF + MTTR) survivability level: number of simultaneous failures that can be repaired mean time to data loss (MTTDL)
6/25/2015Transactional Information Systems16-5 Log-based Media Recovery with limited, pragmatic form of environmental recovery by selectively skipping log entries 2-step recovery: replace failed disk (or remap corrupted disk blocks) and reload data from backup copy redo history using archive log from begin of last completed complete backup; can cope with rollbacks and crashes/restarts between time of backup and media recovery, using CLEs; then undo losers (like in crash recovery)
6/25/2015Transactional Information Systems16-6 Components for Log-based Media Recovery... begin- backup June redo pass undo pass write (x,...) end- backup June 27 begin- backup July 4 end (t i ) begin (t i ) begin (t k ) stable database backup June 27 backup June 20 archive log database disk failure soft crash Media RecoveryLSN... files for the stable log shadow database backup July 4
6/25/2015Transactional Information Systems16-7 Database Backup and Archive Logging complete or incremental (modified pages only) online backup of selected tablespaces: creates “fuzzy” copy on backup disk(s) or tape containing updates of active transactions scans page-mapping table and resets modified flags scan position saved in checkpoint log entries may copy (stale) pages directly from disk (bypassing cache) archive log copies (replicates) all log entries from stable log since the begin of the last completed complete backup can garbage-collect log entries older than MediaRecoveryLSN := min {begin-backup log entry of most recent completed backup, SystemRedoLSN as of begin-backup, current OldestUndoLSN}
6/25/2015Transactional Information Systems16-8 Database Restore restore (pageset): for each page in pageset do identify the most recent (incremental or complete) backup that contains a copy of the page; copy the page onto the replaced disk; end /*for*/; perform redo pass on the archive log using the redo-history algorithm, starting from MediaRecoveryLSN and ignoring all log entries not referring to pageset; perform analysis pass on the log, starting from most recent checkpoint, to identify loser transactions; perform undo pass on the log for loser transactions; can be accelerated by parallelizing redo, offline merging multiple incremental backups into complete backup, and/or applying redo offline to backup copy (“shadow database”)
6/25/2015Transactional Information Systems16-9 Correctness and Quality of Log-based Media Recovery Theorem 16.1: The backup/log-based media recovery algorithm provides correct recovery after media failures by reconstructing the data such that it captures exactly all winner transactions in the original serialization order.
6/25/2015Transactional Information Systems16-10 Analysis of MTTDL (1) db ok; backup and log ok db failed; backup and log ok db failed; backup or log failed db ok; backup or log failed 1 / MTTR backup 2 / MTTF 1 / MTTF 2 / MTTF 1 / MTTF 1 / MTTR recovery Markov chain model:
6/25/2015Transactional Information Systems16-11 Analysis of MTTDL (2) r ij : transition rate from state i to state j E ij = E[time from entering state i until entering state j] H i = E[time between entering and leaving state i] = p ik = P[transition from i to k | state i is left] = solve for given Markov chain: E 12 = H 1 + p 13 E 32 E 13 = H 1 + p 12 E23 E 14 = H 1 + p 12 E 24 + p 13 E 34 E 21 = H 2 E 23 = H 2 + p 21 E 13 E 24 = H 2 + p 21 E 14 E 31 = H 3 E 32 = H 3 + p 31 E 12 E 34 = H 3 + p 31 E 14 yielding
6/25/2015Transactional Information Systems16-12 Chapter 16: Media Recovery 16.2 Log-based Method 16.3 Storage Redundancy Techniques Based on Mirroring Techniques Based on Error-Correcting Codes 16.4 Disaster Recovery 16.5 Lessons Learned
6/25/2015Transactional Information Systems16-13 Mirrored Disk Pairs storage redundancy techniques provide protection against disk failure with continuous availability; recovery rebuilds contents of failed disk on hot spare disk 1 block 1.1 = 2.1' block 1.2 = 2.2' block 1.3 = 2.3' block 1.4 = 2.4'... disk 2 block 2.1 = 1.1' block 2.2 = 1.2' block 2.3 = 1.3' block 2.4 =1.4'... disk 3 block 3.1 =4.1' block 3.2 =4.2' block 3.3 =4.3' block 3.4 =4.4'... disk 4 block 4.1 = 3.1' block 4.2 = 3.2' block 4.3 = 3.3' block 4.4 =3.4'... mirrored disk pair writes routed to both disks of a pair, reads optimized for seek time or load balance
6/25/2015Transactional Information Systems16-14 Declustered Mirroring disk = 2.m+1' 1.2 = 3.m+2' 1.3 = 4.m+3' 1.4 = 2.m+4'... disk = 3.m+1' 2.2 = 4.m+2' 2.3 = 1.m+3' 2.4 =3.m+4'... disk =4.m+1' 3.2 =1.m+2' 3.3 =4.m+3' 3.4 =4.m+4'... disk = 1.m+1' 4.2 = 2.m+2' 4.3 = 3.m+3' 4.4 =1.m+4'... 2.m+1 = 1.1' 2.m+2 = 4.2' 2.m+3 = 3.3' 2.m+4 = 1.4'... 3.m+1 =2.1' 3.m+2 =1.2' 3.m+3 =4.3' 3.m+4 =2.4'... 4.m+1 = 3.1' 4.m+2 = 2.2' 4.m+3 = 1.3' 4.m+4 = 3.4'... 1.m+1 = 4.1' 1.m+2 = 3.2' 1.m+3 = 2.3' 1.m+4 = 4.4'... for group size G, replicas of blocks on disk j are placed round-robin on disks j+1,..., G, 1,..., j-1 copy of block j.k of disk j is on disk (j+1+(k mod (G-1))) mod G +1 less performance degradation during rebuild from G-1 disks
6/25/2015Transactional Information Systems16-15 RAID-4: Parity Groups for each block k of disks 1,..., G maintain a parity block on a dedicated parity disk G+1 upon write to block k of disk j: new parity (1.k,..., G.k) on parity disk G+1 := old parity (1.k,..., G.k) old contents (j.k) new contents (j.k) upon failure of disk j, block j.k can be reconstructed from blocks 1.k,..., (j-1).k, (j+1).k,..., G.k and the parity block (G+1).k RAID (redundant arrays of independent disks): lower storage overhead than mirroring, but higher write cost
6/25/2015Transactional Information Systems16-16 Illustration of RAID-4 (Parity Groups) disk 1 block 1.1 block 1.2 block 1.3 block disk 2 block 2.1 block 2.2 block 2.3 block disk N block N.1 block N.2 block N.3 block N.4... parity disk ( N.1) ( N.2) ( N.3) ( N.4)... spare disk during normal operation disk 1 block 1.1 block 1.2 block 1.3 block disk 2disk N block N.1 block N.2 block N.3 block N.4... parity disk ( N.1) ( N.2) ( N.3) ( N.4)... spare disk during repair block 2.1 block 2.2 block 2.3 block
6/25/2015Transactional Information Systems16-17 RAID-5: Parity Striping eliminates the bottleneck of single parity disk by placing the parity blocks of a group round-robin across the group‘s disks (striping): parity block for N blocks with number k resides on disk (k+N-1) mod (N+1) +1 disk 1 block 1.1 ( N+1.2) block 1.3 block disk 2 block 2.1 block 2.2 ( ) block disk 3 block N.1 block N.2 block 3.3 ( )... disk N+1 ( N.1) block N+1.2 block N+1.3 block N
6/25/2015Transactional Information Systems16-18 Extended RAID Systems Reducing the small-write penalty: parity logging (possibly in safe RAM) to defer and batch parity writes floating parity blocks written to convenient tracks (with dynamically adjusted block-mapping table) parity block declustering (clustered RAID): construct parity blocks for groups of G blocks and spread them uniformly across C > G+1 disks shorter rebuild because of lower per-disk extra load in degraded mode Coping with multiple disk failures: use appropriate error-correcting code (e.g., Reed-Solomon code) (RAID-6) to mask two disk failures within a disk group
6/25/2015Transactional Information Systems16-19 Parity-Block Declustering (Clustered RAID) disk 1 group 1 group 2... disk 2 group 1 group 2... disk 3 group 1 group 2... disk 4 parity 1... disk 5 parity 2... group 3 parity 3group 3 group 4parity 4group 4 group 5 parity 5 Requirements for placement of n parity block groups: for each group of G+1 blocks, the blocks must be on different disk each disk holds n/C parity blocks for the m=n(G+1)/C groups represented by the blocks of a given disk, the mG blocks that belong to these groups are evenly distributed across all other C-1 disks combinatorial block design C=5 G=3
6/25/2015Transactional Information Systems16-20 Rebuild Algorithms rebuild failed disk online without interrupting accesses to the data that resided on the failed disk reconstruct blocks of the failed disk on demand optimizations: redirect disk-reads to the new disk for blocks that are already rebuilt, maintain parity like during normal operation for blocks that are already rebuilt cache blocks that are reconstructed for regular accesses and write them to the new disk when convenient (piggyback rebuilding work on regular disk-reads, thus rebuilding popular blocks early)
6/25/2015Transactional Information Systems16-21 Disk-Read Optimization in Degraded Mode disk-read (block (N+1).k): if block (N+1).k has already been rebuilt then fetch (block (N+1).k); else fetch (block 1.k);...; fetch (block N.k) using the algorithm as during normal operation; contents of block (N+1).k := 1.k XOR 2.k XOR... XOR N.k; return the contents of block (N+1).k; flush (block (N+1).k) at the discretion of the disk scheduling for disk N+1; mark block (N+1).k as rebuilt; end /*if*/;
6/25/2015Transactional Information Systems16-22 Disk-Write Optimization in Degraded Mode disk-write (block (N+1).k): if block (N+1).k has already been rebuilt then fetch (block (N+1).k) unless the block is still available in RAM; fetch (parity block j.k of the parity group to which (N+1).k belongs); else fetch (block 1.k);...; fetch (block N.k); old contents of block (N+1).k := 1.k XOR 2.k XOR... XOR N.k; let j.k be the parity block of this parity group; end /*if*/; compute new parity block j.k := old contents of block j.k XOR old contents of block (N+1).k XOR new contents of block (N+1).k flush (block (N+1).k) using the block's new contents; flush (block j.k) using new parity as block contents; mark block (N+1).k as rebuilt;
6/25/2015Transactional Information Systems16-23 Optimized Online Rebuild Algorithm rebuild (disk N+1) on spare disk: for each block k of the failed disk N+1 do if the block has not yet been rebuilt disk-write (block (N+1).k) using the algorithm for disk-writes in degraded mode, with low priority for the resulting fetch and flush I/O requests; end /*if*/; end /*for*/;
6/25/2015Transactional Information Systems16-24 Chapter 16: Media Recovery 16.2 Log-based Method 16.3 Storage Redundancy Techniques Based on Mirroring Techniques Based on Error-Correcting Codes 16.4 Disaster Recovery 16.5 Lessons Learned
6/25/2015Transactional Information Systems16-25 Specific Considerations for Disaster Recovery Backup resides at remote site Maintain archive log at remote site by log shipping: within distributed transactions (or even replicate the database remotely) without transactional control, but preserving the serialization order of log entries (with the risk of losing the tail of the log) Backup server could even be “hot standby” (with failover similar to data-sharing cluster architecture)
6/25/2015Transactional Information Systems16-26 Chapter 16: Media Recovery 16.2 Log-based Method 16.3 Storage Redundancy Techniques Based on Mirroring Techniques Based on Error-Correcting Codes 16.4 Disaster Recovery 16.5 Lessons Learned
6/25/2015Transactional Information Systems16-27 Lessons Learned The redo-history recovery algorithm is appropriate also for media recovery, based on a backup database and an archive log: MediaRecoveryLSN marks log-truncation and redo starting point Log-based media recovery is the most versatile method; storage-redundancy techniques are attractive for continuous availability Mirroring (with declustering) and RAID-5 are commodities, clustered RAID is the best technique in terms of MTTDL and MTTR, but complex to implement (needs block design) Disaster recovery can adopt media recovery techniques with remote backup/replication site