Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge.

Similar presentations


Presentation on theme: "1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge."— Presentation transcript:

1

2 1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge

3 2 Big Picture Redo Recovery Redo Recovery requires Good db state Replay of the right operations Good state updates: Good state updates: conflict order not required Write-read conflicts can be ignored Some db “variables” irrelevant (don’t need to update them) Synchronize State updateops replayed Synchronize State update & ops replayed Invariant Captured in recovery Invariant maintaining invariant  recovery We prove that maintaining invariant  recovery Current recovery methods: Current recovery methods: maintain invariant Show how current methods work (e.g. ARIES redo) Show how “new” methods could work Much simpler Much simpler than our VLDB’95 paper

4 3 Conflict State Graph (CSG) Conflict graph (“Borrowed” from Concurrency Control) Conflict graph (“Borrowed” from Concurrency Control) Nodes are log operations; Edges: conflicts (RW, WR, WW) Nodes are log operations; Edges: conflicts (RW, WR, WW) State graph SG State graph SG Add writes(node): { …} of vars updated Add writes(node): { …} of vars updated State for SG: { | in writes(n) and n is last node in state graph with x in vars(n)} State for SG: { | in writes(n) and n is last node in state graph with x in vars(n)} Final state S final of CSG is desired recovered state Final state S final of CSG is desired recovered state Any prefix of a state graph is a state graph Any prefix of a state graph is a state graph Prefix: node in prefix  predecessor in prefix Prefix: node in prefix  predecessor in prefix State of any can be recovered by State of any prefix of CSG can be recovered by Replaying operations in suffix in conflict graph order Replaying operations in suffix in conflict graph order We will relax CSG requirements

5 4 Conflict State Graph & States O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=1,y=0 S final : x=3, y=2 x=0,y=0 x=1, y=2 Write-read edge Write-read & write-write & read-write edge Read-write edge

6 5 Installation Graph Example: Initial stable state: { } Example: Initial stable state: { } O: x ← x+1 O: x ← x+1 P: y ← x+1 P: y ← x+1 After O,P, state is {, } After O,P, state is {, } Flush y to disk- Stable state is { } Flush y to disk- Stable state is { } Replay O- generates correct state {, } Replay O- generates correct state {, } O’s readset x unchanged by P’s installation O’s readset x unchanged by P’s installation Even though Write-Read edge orders P after O Even though Write-Read edge orders P after O Installation graph: Installation graph: conflict graph without write-read edges conflict graph without write-read edges Installation state graph (ISG): Installation state graph (ISG): same writes(n) for node n as conflict state graph same writes(n) for node n as conflict state graph State of any prefix of ISG can be recovered State of any prefix of ISG can be recovered More prefixes (states) because of fewer edges More prefixes (states) because of fewer edges y written by P

7 6 Installation State Graph & States x=0,y=0 O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=1,y=0 x=3, y=2 x=1, y=2 x=0,y=2 Removed write-read edge Retained read-write edge Retained write-write & read-write edge ISG recoverable state

8 7 Exposed Variables Example Example O1: x ← z+1 O1: x ← z+1 O2: x ← 25 O2: x ← 25 After O2, we don’t care about x value of O1 After O2, we don’t care about x value of O1 Variable x is unexposed after ops I ({O1} here) if Variable x is unexposed after ops I ({O1} here) if min conflict op in Ops(log) – I writes x min conflict op in Ops(log) – I writes x Without reading it Without reading it x’s value is a “don’t care” when x is unexposed x’s value is a “don’t care” when x is unexposed This is example of Physical Logging This is example of Physical Logging Prefix of installation graph explains state S if values of exposed variables in S are the same as values in state of prefix of ISG Prefix of installation graph explains state S if values of exposed variables in S are the same as values in state of prefix of ISG

9 8 Potentially Recoverable State Potentially recoverable state: state that Potentially recoverable state: state that by the replay of a subset of operations of the conflict graph, in conflict order, will produce the recovered state S final by the replay of a subset of operations of the conflict graph, in conflict order, will produce the recovered state S final Theorem: If S is a state explained by a prefix of the installation graph, then S is potentially recoverable Theorem: If S is a state explained by a prefix of the installation graph, then S is potentially recoverable

10 9 REDO Test & Recovery Procedure REDO: tests op’s in conflict order log scan REDO: tests op’s in conflict order log scan Yes (true): replay operation Yes (true): replay operation No (false): bypass operation No (false): bypass operation redo_set = {O|REDO(O..) & O on scanned log} redo_set = {O|REDO(O..) & O on scanned log} Recover Procedure: Recover Procedure: Set log scan point to “checkpoint” Set log scan point to “checkpoint” while not at log end while not at log end O ← current log operation O ← current log operation State = if REDO(O,State,Log,Analysis) State = if REDO(O,State,Log,Analysis) Then O(State) Then O(State) Else State Else State Advance log scan point to next operation Advance log scan point to next operation End End

11 10 Recovery Recoverable system: a system with Recoverable system: a system with a potentially recoverable state S pot a potentially recoverable state S pot Replay of O’s in redo_set from S pot produces S final Replay of O’s in redo_set from S pot produces S final Inv : ops(Log)-redo_set defines prefix of the installation state graph that explains State Inv : ops(Log)-redo_set defines prefix of the installation state graph that explains State Every system change must be atomic transition maintaining Inv Every system change must be atomic transition maintaining Inv Corollary: Given a state, log, checkpoint, and an execution of Recover (identifying redo_set) Corollary: Given a state, log, checkpoint, and an execution of Recover (identifying redo_set) If Inv holds If Inv holds Then System is recoverable Then System is recoverable specific potentially recoverable state Only specific potentially recoverable state is recoverable

12 11 Write Graph Write graph: start from nstallation state graph Write graph: start from installation state graph Collapse set of nodes (acyclic) merges nodes Collapse set of nodes (acyclic) merges nodes Add new node for next operation Add new node for next operation Add edge (collapse cycles) Add edge (collapse cycles) Remove a write of an unexposed variable Remove a write of an unexposed variable We do not care about values of unexposed variables We do not care about values of unexposed variables Write graph captures entire system state Write graph captures entire system state Prefix that is stable Prefix that is stable Suffix in cache Suffix in cache Cache Manager uses write graph Cache Manager uses write graph To maintain potentially recoverable state To maintain potentially recoverable state Usually by collapsing suffix node into stable prefix Usually by collapsing suffix node into stable prefix

13 12 Removed write-read edge Write graph remains acyclic Based on installation graph Write Graph {via Node Collapse} Fewer States O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=3, y=2 x=0,y=0 x=0,y=2 Collapsed Node n x=1, y=2 x=1, y=0 Ops(n) = {O,P} Writes(n) = { } Keep only one version of each variable in cache Retained read-write edge translates to flush order for cache manager

14 13 Managing Recovery Stable State Write Graph Prefix Usually Single Node Log O1 O2 O3 O1O2O3 Volatile State Suffix of Write Graph In Cache Collapse to “Install” X Updating State Removing O3 from redo_set Atomi c

15 14 Physiological Recovery Physiological recovery (e.g. ARIES) Physiological recovery (e.g. ARIES) Operation Form: read A, write A Operation Form: read A, write A Log Op has LSN Log Op has LSN Variable tagged: LSN of last log op writing it Variable tagged: LSN of last log op writing it REDO: op’s LSN > variable LSN  “Yes” (Replay) REDO: op’s LSN > variable LSN  “Yes” (Replay) Our explanation Our explanation Ops writing variable collapsed to one cache node Ops writing variable collapsed to one cache node Flushing page to stable state (root of write graph) Flushing page to stable state (root of write graph) Collapses cache node into stable state node Collapses cache node into stable state node Keeps state potentially recoverable Keeps state potentially recoverable redo test  node’s ops removed from redo_set redo test  node’s ops removed from redo_set Maintains invariant Inv Maintains invariant Inv [state change; redo_set change] is atomic [state change; redo_set change] is atomic Physical and Logical Recovery described in paper

16 15 Extended LSN Method Generalize physiological ops Generalize physiological ops read/write multiple variables read/write multiple variables Our example: ops can read X, write Y (like P) Our example: ops can read X, write Y (like P) also read X, write X also read X, write X LSNs still effective for REDO test LSNs still effective for REDO test Flush synchronizes change to state and redo_set Flush synchronizes change to state and redo_set Cache management Cache management Now requires flush of one variable before another Now requires flush of one variable before another Our theory captures this careful write requirement Our theory captures this careful write requirement Consider B-tree split: (B link -tree) * Consider B-tree split: (B link -tree) * Next slide shows “half split” graphically Next slide shows “half split” graphically Must also post index term for new node Must also post index term for new node

17 16 Extended Recovery {B link -tree Split} x=0,y=0 P: readset{x} writes{ } x=3, y=2 x=0,y=2 x=1, y=2 x=1, y=0 New Node Y Old Node X Move half to node Y Read X, write Y O: readset{x} writes{ } Q: readset{x} writes{ } Collapsed Node Ops(n) = {O,P} Writes(n) = { } Update node X remove Y records Update Node X Flush Y before X In SqlServer 6.0

18 17 Recoverable Systems Summary Cache management keeps state potentially recoverable Cache management keeps state potentially recoverable Very generally via write graph Very generally via write graph Derived from installation state graph Derived from installation state graph Maintains invariant INV Maintains invariant INV so that replayed operations are correct set so that replayed operations are correct set By synchronizing changes to redo_set with changes to state By synchronizing changes to redo_set with changes to state

19 18 Questions?

20 19 Outline Foundation Foundation Conflict graph, state graphs, recovered state Conflict graph, state graphs, recovered state Abstract Recovery Abstract Recovery Cache Management: maintaining state Cache Management: maintaining state Installation order: weaker update order than conflict order Installation order: weaker update order than conflict order Recovery Recovery Recovery procedure, redo test Recovery procedure, redo test Invariant: guarantees correct recovery Invariant: guarantees correct recovery Coordinating state before failure with recovery execution after failure Coordinating state before failure with recovery execution after failure Recoverable Systems Recoverable Systems Write graphs for maintaining potentially recoverable state Write graphs for maintaining potentially recoverable state Maintaining recovery invariant Maintaining recovery invariant Explaining current recovery methods Explaining current recovery methods

21 20 Managing the Cache Stable state: prefix of write graph Stable state: prefix of write graph Usually a single node Usually a single node Means stable state potentially recoverable Means stable state potentially recoverable Cache: usually contains write graph suffix Cache: usually contains write graph suffix Volatile state- which is lost during system crash Volatile state- which is lost during system crash Usually collapsing nodes so that one node per “variable” Usually collapsing nodes so that one node per “variable” State update: move a minimum write graph node in cache to stable state atomically State update: move a minimum write graph node in cache to stable state atomically Start with potentially recoverable state Start with potentially recoverable state Atomic transition – frequently node collapse Atomic transition – frequently node collapse New potentially recoverable state New potentially recoverable state

22 21 Maintaining Recovery Invariant Potentially recoverable state only “half” of job Potentially recoverable state only “half” of job Ops(log) – Redo_set must explain state Ops(log) – Redo_set must explain state Jobs need to be synchronized to enforce INV Jobs need to be synchronized to enforce INV Examples: Stable state is root of write graph Examples: Stable state is root of write graph Logical recovery (in paper) Logical recovery (in paper) Physical recovery (in paper) Physical recovery (in paper) Physiological recovery * Physiological recovery * Extended recovery * Extended recovery *

23 22 Logical Recovery Logical recovery with arbitrary log ops — System R Logical recovery with arbitrary log ops — System R Quiesce and write shadow “checkpoint” to disk Quiesce and write shadow “checkpoint” to disk By dumping cache contents to disk shadow pages By dumping cache contents to disk shadow pages Disk shadow is installed atomically Disk shadow is installed atomically Replacing old versions of shadow variables Replacing old versions of shadow variables Our explanation Our explanation Shadow coalesced on disk is single write graph node Shadow coalesced on disk is single write graph node Encompassing all changes from last checkpoint Encompassing all changes from last checkpoint Hence is a write graph prefix Hence is a write graph prefix Shadow “installed” atomically” via pointer swing Shadow “installed” atomically” via pointer swing Accomplished by writing new pointer in checkpoint record to log Accomplished by writing new pointer in checkpoint record to log Log is truncated with the writing checkpoint record Log is truncated with the writing checkpoint record All prior records are added to checkpoint All prior records are added to checkpoint Which “installs” all earlier operations simultaneously with stable state update, hence maintaining Inv Which “installs” all earlier operations simultaneously with stable state update, hence maintaining Inv

24 23 Physical Recovery Physical recovery writes entire page Physical recovery writes entire page Pages are written back to disk Pages are written back to disk When prefix of log contains only pages already written back, log is truncated When prefix of log contains only pages already written back, log is truncated Via checkpoint record indicating redo pass start Via checkpoint record indicating redo pass start All records scanned during recovery are replayed All records scanned during recovery are replayed REDO(op) always is “yes” REDO(op) always is “yes” Our explanation Our explanation Operations are blind writes of single variable- read set is empty Operations are blind writes of single variable- read set is empty All variables with operations not in checkpoint are unexposed All variables with operations not in checkpoint are unexposed These operations are replayed during recovery These operations are replayed during recovery They never read They never read Writing to those variables leaves them unexposed Writing to those variables leaves them unexposed However, they are now set to be installed However, they are now set to be installed Installation occurs when checkpoint record is written Installation occurs when checkpoint record is written Operations now not part of redo scan are thus installed Operations now not part of redo scan are thus installed

25 24 Our Goal REDO Recovery explanation (Not all of recovery) REDO Recovery explanation (Not all of recovery) Cache management: stage data to stable state Cache management: stage data to stable state Goal: fewer writes & less constrained order Goal: fewer writes & less constrained order Some methods require careful write ordering– why? Some methods require careful write ordering– why? Recovery: which ops to replay Recovery: which ops to replay And how to coordinate state changes with replay changes And how to coordinate state changes with replay changes Provably ensure “recoverability” Provably ensure “recoverability” Disclaimers Disclaimers Abstract story- real recovery needs more Abstract story- real recovery needs more Simpler operation model than past work Simpler operation model than past work Not everything is explained: Not everything is explained: All actually used recovery techniques are handled All actually used recovery techniques are handled But not all recovery techniques we know of are “quite” captured But not all recovery techniques we know of are “quite” captured

26 25 System Model State: { …} State: { …} Operation: Operation: readset(O): set of variables read by O readset(O): set of variables read by O writeset(O): set of variables written by O writeset(O): set of variables written by O Operations are atomic– system must ensure atomicity Operations are atomic– system must ensure atomicity Operation Sequence Operation Sequence Sequence of ops O 1,O 2,…O k … O final Sequence of ops O 1,O 2,…O k … O final State Sequence State Sequence Sequence of states S 1, S 2,… S k … S final generated by op seg from S 0 Sequence of states S 1, S 2,… S k … S final generated by op seg from S 0 O k precedes (leads to) S k when executed “against” S k-1 O k precedes (leads to) S k when executed “against” S k-1 Recovery goal Recovery goal From some state and a record of operations (on log) From some state and a record of operations (on log) Reproduce last state in sequence S final Reproduce last state in sequence S final


Download ppt "1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge."

Similar presentations


Ads by Google