Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum (Rationale,

Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de (Rationale, State of the Art, and Research Directions) What and Why? Correct Undo Where and How? Correct and Efficient Redo Quantitative Guarantees Outline:

Why This Topic? exotic and underrated fits well with theme of graduate studies program critical for dependable Internet-based E-services I wrote a textbook on the topic Paranoia of science: hope (and strive) for the best, but prepare for the worst „All hardware and software systems must be verified.“ (Wolfgang Paul, U Saarland) „Software testing (and verification) doesn‘t scale.“ (James Hamilton, MS and Ex-IBM) „If you can‘t make a problem go away, make it invisible.“ (David Patterson, UC Berkeley)  Recovery Oriented Computing Recovery: repair data after software failures

Why Exactly is Recovery Needed? Database Cache Log Buffer Stable Database Stable Log Volatile Memory Stable Storage page q page p page q 155 88 219 215 page z 217 write(b,t1)... page z 158 216 write(q,t2) 217 write(z,t1) 218 commit(t2) 219 write(q,t1) 220 begin(t3) nil page b 215 page b 215 fetchflush force Atomic transactions (consistent state transitions on db) Log entries: physical (before- and after-images) logical (record ops) physiological (page transitions) timestamped by LSNs

How Does Recovery Work? Analysis pass: determine winners vs. losers (by scanning the stable log) Redo pass: redo winner writes (by applying logged op) Undo pass: undo loser writes (by applying inverse op) Stable Database Stable Log page q page p 155 88 215 write(b,t1)... page z 158 208 216 write(q,t2) 199 217 write(z,t1) 215 218 commit(t2) page b 215 losers: {t1}, winners: {t2} redo(216, write(q,t2) ) undo (215, write(b,t1) ) LSN in page header implements testable state 217 write(z,t1) page q 216 page b 214 216 write(q,t2) 215 write(b,t1) undo (217, write(z,t1) ) ?

Heisenbugs and the Recovery Rationale „Self-healing“ recovery = amnesia + data repair (from „trusted“ store) + re-init  Failure causes: power: 2 000 h MTTF or with UPS chips: 100 000 h MTTF system software: 200 – 1 000 h MTTF telecomm lines: 4 000 h MTTF admin error: ??? or  with auto-admin disks: 800 000 h MTTF environment: > 20 000 h MTTF Transient software failures are the main problem: Heisenbugs (non-repeatable exceptions caused by stochastic confluence of very rare events) Failure model for crash recovery: fail-stop (no dynamic salvation/resurrection code) soft failures (stable storage survives) 

Goal: Continuous Availability Business apps and e-services demand 24 x 7 availability 99.999 availability would be acceptable (5 min outage/year) down 1 / MTTF 1 / MTTR up Downtime costs (per hour): Brokerage operations: $ 6.4 Mio. Credit card authorization: $ 2.6 Mio. Ebay: $ 225 000 Amazon: $ 180 000 Airline reservation: $ 89 000 Cell phone activation: $ 41 000 (Source: Internet Week 4/3/2000) State of the art: DB servers  99.99 to 99.999 % Internet e-service (e.g., Ebay)  99 % Stationary availability =

What Do We Need to Optimize for? Correctness (and simplicity) in the presence of many subtleties Fast restart (for high availability) by bounding the log and minimizing page fetches Low overhead during normal operation by minimizing forced log writes (and page flushes) High transaction concurrency during normal operation Tradeoffs & Tensions

State of the Art: Houdini’s Tricks tricks, patents, and recovery gurus: very little work on verification of (toy) algorithms: C. Wallace / N. Soparkar / Y. Gurevich D. Kuo / A. Fekete / N. Lynch C. Martin / K. Ramamritham

Outline What and Why? Correct Undo Where and How? Correct and Efficient Redo Quantitative Guarantees Murphy‘s Law: Whatever can go wrong, will go wrong.

Data Structures for Logging & Recovery type Page: record of PageNo: id; PageSeqNo: id; Status: (clean, dirty); Contents: array [PageSize] of char; end; persistent var StableDatabase: set[PageNo] of Page; var DatabaseCache: set[PageNo] of Page; type LogEntry: record of LogSeqNo: id; TransId: id; PageNo: id; ActionType:(write, full-write, begin, commit, rollback); UndoInfo: array of char; RedoInfo: array of char; PreviousSeqNo: id; end; persistent var StableLog: list[LogSeqNo] of LogEntry; var LogBuffer: list[LogSeqNo] of LogEntry; modeled in functional manner with test op  state: write s on page p  StableDatabase  StableDatabase[p].PageSeqNo  s write s on page p  CachedDatabase  ( ( p  DatabaseCache  DatabaseCache[p].PageSeqNo  s )  ( p  DatabaseCache  StableDatabase[p].PageSeqNo  s ) )

Correctness Criterion A crash recovery algorithm is correct if it guarantees that, after a system failure, the cached database will eventually be equivalent (i.e., reducible) to a serial order of the committed transactions that coincides with the serialization order of the history.

Simple Redo Pass redo pass ( ): min := LogSeqNo of oldest log entry in StableLog; max := LogSeqNo of most recent log entry in StableLog; for i := min to max do if StableLog[i].TransId not in losers then pageno = StableLog[i].PageNo; fetch (pageno); case StableLog[i].ActionType of full-write: full-write (pageno) with contents from StableLog[i].RedoInfo; write: if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; end /*if*/; end /*case*/; end /*for*/;

Correctness of Simple Redo & Undo Invariants during redo pass (compatible with serialization): Invariant of undo pass (based on serializability / reducibility of history):

Redo Optimization 1: Log Truncation & Checkpoints continuously advance the log begin (garbage collection): for redo, can drop all entries for page p that precede the last flush action for p =: RedoLSN (p)  SystemRedoLSN := min{RedoLSN (p) | dirty page p} track dirty cache pages and periodically write DirtyPageTable (DPT) into checkpoint log entry: type DPTEntry: record of PageNo: id; RedoSeqNo: id; end; var DirtyPages: set[PageNo] of DPTEntry; + add potentially dirty pages during analysis pass + flush-behind demon flushes dirty pages in background

Redo Pass with CP and DPT redo pass ( ): cp := MasterRecord.LastCP; SystemRedoLSN := min{cp.DirtyPages[p].RedoSeqNo}; max := LogSeqNo of most recent log entry in StableLog; for i := SystemRedoLSN to max do if StableLog[i].ActionType = write or full-write and StableLog[i].TransId not in losers then pageno := StableLog[i].PageNo; if pageno in DirtyPages and i >= DirtyPages[pageno].RedoSeqNo then fetch (pageno); if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; else DirtyPages[pageno].RedoSeqNo := DatabaseCache[pageno].PageSeqNo + 1; end; end; end;

Example for Redo with CP and DPT 1st crash analysis redo t1t1 t2t2 t3t3 t4t4 t5t5 flush(d)flush(a) 1st restart 10:w(a) 20:w(b) 30:w(c) 40:w(d) 50:w(d) 60:w(a) 70:w(d) 80:w(b) flush(b) 80 7 50 60 b a c d Stable DB Stable Log DPT: b: RedoLSN=20 c: RedoLSN=30 CP d: RedoLSN=70 skip redo

Benefits of Redo Optimization can save page fetches can plan prefetching schedule for dirty pages (order and timing of page fetches) for minization of disk-arm seek time and rotational latency seek opt.:rot. opt.: global opt.:  TSP based on t seek + t rot

Redo Optimization 2: Flush Log Entries during normal operation: log flush actions (without forcing the log buffer) during analysis pass of restart: construct more recent DPT analysis pass ( ) returns losers, DirtyPages:... if StableLog[i].ActionType = write or full-write and StableLog[i].PageNo not in DirtyPages then DirtyPages += StableLog[i].PageNo; DirtyPages[StableLog[i].PageNo].RedoSeqNo := i; end; if StableLog[i].ActionType = flush then DirtyPages -= StableLog[i].PageNo; end;...  can save many page fetches  allows better prefetch scheduling advantages:

Example for Redo with Flush Log Entries 1st crash analysis redo t1t1 t2t2 t3t3 t4t4 t5t5 flush(d)flush(a) 1st restart 10:w(a) 20:w(b) 30:w(c) 40:w(d) 50:w(d) 60:w(a) 70:w(d) 80:w(b) flush(b) 80 7 50 60 b a c d Stable DB Stable Log DPT: b: RedoLSN=20 c: RedoLSN=30 CP d: RedoLSN=70 skip redo

Redo Optimization 3: Full-write Log Entries during normal operation: „occasionally“ log full after-image of a page p (absorbs all previous writes to that page) during analysis pass of restart: construct enhanced DPT DirtyPages[p].RedoSeqNo := LogSeqNo of full-write during redo pass of restart: can ignore all log entries of a page that precede its most recent full-write log entry  can skip many log entries  can plan better schedule for page fetches  allows better log truncation advantages:

Correctness of Optimized Redo Invariant during optimized redo pass: builds on correct DPT construction during analysis pass

Why is Page-based Redo Logging Good Anyway? much faster than logical redo  logical redo would require replaying high-level operations with many page fetches (random IO) can be applied to selective pages independently  can exploit perfectly scalable parallelism for redo of large multi-disk db  can efficiently reconstruct corrupted pages when disk tracks have errors (without full media recovery for entire db) testable state with very low overhead

Outline What and Why? Correct Undo Where and How? Correct and Efficient Redo Quantitative Guarantees Corollary to Murphy‘s Law: Murphy was an optimist.

Winners Follow Losers crash t1t1 10:w(a) t2t2 20:w(a) t3t3 30:w(a) t4t4 50:w(a)40:w(b) rollback restart complete t5t5 60:w(a) occurs because of: rollbacks during normal operation repeated crashes soft crashes between backup and media failure concurrency control with commutative ops on same page

Winners Follow Losers 2 a redo (10: w(a), t 1 ) crash t1t1 10:w(a) t2t2 20:w(a) t3t3 30:w(a) t4t4 50:w(a)40:w(b) rollback restart complete t5t5 60:w(a) Problems: old losers prevent log truncation LSN-based testable state infeasible 2 a 10 a redo (30: w(a), t 3 ) 30 a redo (60: w(a), t 5 ) 60 a undo (50: w(a), t 4 ) 49 a a

10:w(a) 20:w(a) 30:w(a) 50:w(a)40:w(b) 60:w(a) Treat Completed Losers as Winners and Redo History t1t1 t2t2 t3t3 t4t4 t5t5 Solution: Undo generates compensation log entries (CLEs) Undo steps are redone during restart 25: w -1 (a) 55: w -1 (a) 57: w -1 (b)

CLE Backward Chaining for Bounded Log Growth (& More Goodies)... 40: w(b) 50: w(a) t1t1 10:w(a) t2t2 20:w(a) t3t3 30:w(a) t4t4 50:w(a)40:w(b) 25: w -1 (a) 55: w -1 (a) 57: w -1 (b) CLE 55: w -1 (a) CLE 57: w -1 (b) commit t 4 begin t 4 CLE points to predecessor of inverted write

CLE Backward Chaining for Bounded Log Growth (& More Goodies)... 40: w(b) 50: w(a) t1t1 10:w(a) t2t2 20:w(a) t3t3 30:w(a) t4t4 50:w(a)40:w(b) 25: w -1 (a) 55: w -1 (a) CLE 55: w -1 (a) begin t 4 CLE points to predecessor of inverted write

Undo Algorithm... while exists t in losers with losers[t].LastSeqNo <> nil do nexttrans = TransNo in losers such that losers[nexttrans].LastSeqNo = max {losers[x].LastSeqNo | x in losers}; nextentry := losers[nexttrans].LastSeqNo; case StableLog[nextentry].ActionType of compensation: losers[nexttrans].LastSeqNo := StableLog[nextentry].NextUndoSeqNo; write:... newentry.LogSeqNo := new sequence number; newentry.ActionType := compensation; newentry.PreviousSeqNo := ActiveTrans[transid].LastSeqNo; newentry.NextUndoSeqNo := nextentry.PreviousSeqNo; ActiveTrans[transid].LastSeqNo := newentry.LogSeqNo; LogBuffer += newentry;...

Correctness of Undo Pass Invariant of undo pass (assuming tree reducibility of history):

Outline What and Why? Correct Undo Where and How? Correct and Efficient Redo Quantitative Guarantees There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) 97.3 % of all statistics are made up. (Anonymous)

Towards Quantitative Guarantees Online control for given system configuration: observe length of stable log and dirty page fraction estimate redo time based on observations take countermeasures if redo time becomes too long (log truncation, flush-behind demon priority) Planning of system configuration: (stochastically) predict (worst-case) redo time as a function of workload and system characteristics configure system (cache size, caching policy, #disks, flush-behind demon, etc.) to guarantee upper bound

Key Factors for Bounded Redo Time page 1 t in cache & dirty in cache & clean not in cache page 2 page 3 page 4 checkpointcrash            write access     1) # dirty pages  # db I/Os for redo  log scan time 2) longest dirty lifetime  # redo steps 3) # log entries for dirty pages at crash time  

Analysis of DB Page Fetches During Redo N pages with: reference prob.  1...  N (  i   i+1 ) write prob.  1...  N flush-behind demon I/O rate f cache residence prob. c 1... c N per page Markov model: 1:out2:clean3:dirty page reference arrival rate per disk I/O rate disk p 33 =c i (1-f i ) p 32 =c i f i p 31 =(1-c i ) p 21 =(1-c i ) p 11 =(1-c i ) p 12 =c i p 23 =c i  i P[flush in time 1/ ]: f 1... f N p 22 =c i (1-  i ) solve stationary equations:  d i := P[page i is dirty] =  3 k,j  {1,2,3}   with pages permuted such that d i  d i+1

Subproblem: Analysis of Cache Residence time now 3213221323 bKbK with LRU-K cache replacement (e.g., K=2) and cache size M: N pages with: reference prob.  1...  N (  i   i+1 ) write prob.  1...  N cache residence prob. c 1... c N

Subproblem: Analysis of Flush Prob. t request arrivals request departures exhaustive-service vacations: whenever disk queue is empty invoke flush-behind demon for time period T M/G/1 vacation server model for each disk response time R vacation time T service time S wait time Laplace transforms (see Takagi 1991): with utilization choose max. T s.t.(Chernoff bound) (e.g., t=10ms,  =0.05) response time R

So What? predictability of workload  configuration  performance !!! !!! ??? e.g.: allows us to auto-tune system for workload  configuration  performance goals !!! ??? !!!

Outline What and Why? Correct Undo Where and How? Correct and Efficient Redo Quantitative Guarantees More than any time in history mankind faces a crossroads. One path leads to despair and utter hopelessness, the other to total extinction. Let us pray that we have the wisdom to choose correctly. (Woody Allen)

Where Can We Contribute? turn sketch of correctness reasoning into verified recovery code turn sketch of restart time analysis into quality guarantee address all other aspects of recovery bigger picture of dependability and quality guarantees

L1LogRecMaster *InitL1LogIOMgr (Segment *logsegment,int LogSize,int LogPageSize,int newlog) { L1LogRecMaster *recmgr = NULL; BitVector p = 1; int i; NEW (recmgr, 1); recmgr->Log = logsegment; A_LATCH_INIT (&(recmgr->BufferLatch)); A_LATCH_INIT (&(recmgr->UsedListLatch)); A_LATCH_INIT (&(recmgr->FixCountLatch)); recmgr->FixCount = 0; recmgr->StoreCounter = 0; recmgr->PendCount = 0; A_SEM_INIT (&(recmgr->PendSem),0); NEW (recmgr->Frame,BUFFSIZE); recmgr->Frame[0] = (LogPageCB *)MALLOC (LogPageSize * BUFFSIZE); for (i = 1;i < BUFFSIZE;i++) recmgr->Frame[i] = (LogPageCB *)(((char*)recmgr->Frame[i - 1]) + LogPageSize); recmgr->InfoPage = (InfoPageCB *) MALLOC (LogPageSize); recmgr->UsedList = InitHash (HASHSIZE, HashPageNo, NULL, NULL, ComparePageNo, 4); recmgr->ErasedFragList = InitHash (HASHSIZE, HashPageNo,NULL,NULL,ComparePageNo, 4); for (i = 0; i < sizeof(BitVector) * 8; i++) { recmgr->Mask[i] = p; p = p * 2; } if (newlog) { recmgr->InfoPage->LogSize = LogSize; recmgr->InfoPage->PageSize = LogPageSize; recmgr->InfoPage->Key = 0; recmgr->InfoPage->ActLogPos = 1; recmgr->InfoPage->ActBufNr = 0; recmgr->InfoPage->ActSlotNr = 0; InitLogFile (recmgr,logsegment); LogIOWriteInfoPage (recmgr); } else ReadInfoPage (recmgr); return (recmgr); } /* end LogIOInit */

static int log_fill(dblp, lsn, addr, len) DB_LOG *dblp; DB_LSN *lsn; void *addr; u_int32_t len; {... while (len > 0) {/* Copy out the data. */ if (lp->b_off == 0) lp->f_lsn = *lsn; if (lp->b_off == 0 && len >= bsize) { nrec = len / bsize; if ((ret = __log_write(dblp, addr, nrec*bsize)) != 0) return (ret); addr = (u_int8_t *)addr + nrec * bsize; len -= nrec * bsize; ++lp->stat.st_wcount_fill; continue; } /* Figure out how many bytes we can copy this time. */ remain = bsize - lp->b_off; nw = remain > len ? len : remain; memcpy(dblp->bufp + lp->b_off, addr, nw); addr = (u_int8_t *)addr + nw; len -= nw; lp->b_off += nw; /* If we fill the buffer, flush it. */ if (lp->b_off == bsize) { if ((ret = __log_write(dblp, dblp->bufp, bsize)) != 0) return (ret); lp->b_off = 0; ++lp->stat.st_wcount_fill; } return (0); }

Where Can We Contribute? turn sketch of correctness reasoning into verified recovery code turn sketch of restart time analysis into quality guarantee address all other aspects of recovery bigger picture of dependability and quality guarantees

What Else is in the Jungle? more optimizations for logging and recovery of specific data structures (e.g., B + tree indexes) incremental restart (early admission of new transactions while redo or undo is still in progress) failover recovery for clusters with shared disks and distributed memory media recovery, remote replication, and configuration for availability goals  M. Gillmann process and message recovery for failure masking to applications and users  G. Shegalov  collab with MSR

Where Can We Contribute? turn sketch of correctness reasoning into verified recovery code turn sketch of restart time analysis into quality guarantee address all other aspects of recovery sw engineering  sw engineering for verification sys architecture  sys architecture for predictability bigger picture of dependability and quality guarantees

Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum (Rationale,

Similar presentations

Presentation on theme: "Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum (Rationale,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum (Rationale,

Similar presentations

Presentation on theme: "Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science Gerhard Weikum (Rationale,"— Presentation transcript:

Similar presentations

About project

Feedback