1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge.

Slides:



Advertisements
Similar presentations
4/8/14CS161 Spring FFS Recovery: Soft Updates Learning Objectives Explain how to enforce write-ordering without synchronous writes. Identify and.
Advertisements

Recovery Amol Deshpande CMSC424.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
Introduction to Database Systems1 Concurrency Control CC.Lecture 1.
1 CSIS 7102 Spring 2004 Lecture 9: Recovery (approaches) Dr. King-Ip Lin.
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Topic 6.3: Transactions and Concurrency Control Hari Uday.
CS 440 Database Management Systems Lecture 10: Transaction Management - Recovery 1.
1 CS216 Advanced Database Systems Shivnath Babu Notes 11: Concurrency Control.
1 Supplemental Notes: Practical Aspects of Transactions THIS MATERIAL IS OPTIONAL.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture X: Transactions.
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Crash Recovery.
Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
1 Concurrency Control and Recovery Module 6, Lecture 1.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 SYSTEM FAILURES Lecture based on [GUW ,
Final Exam Review Last Lecture R&G - All Chapters Covered The end crowns all, And that old common arbitrator, Time, Will one day end it. William Shakespeare.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
©Silberschatz, Korth and Sudarshan17.1Database System Concepts Chapter 17: Recovery System Failure Classification Storage Structure Recovery and Atomicity.
1 Implementing Atomicity and Durability Chapter 25.
Crash recovery All-or-nothing atomicity & logging.
©Silberschatz, Korth and Sudarshan17.1Database System Concepts 3 rd Edition Chapter 17: Recovery System Failure Classification Storage Structure Recovery.
1 CS 541 Database Systems Implementation of Undo- Redo.
TRANSACTIONS A sequence of SQL statements to be executed "together“ as a unit: A money transfer transaction: Reasons for Transactions : Concurrency control.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
1 Transaction Management Overview Chapter Transactions  Concurrent execution of user programs is essential for good DBMS performance.  Because.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 18.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Lecture 21 Ramakrishnan - Chapter 18.
HANDLING FAILURES. Warning This is a first draft I welcome your corrections.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
Database Systems/COMP4910/Spring05/Melikyan1 Transaction Management Overview Unit 2 Chapter 16.
1 Transaction Management Overview Chapter Transactions  Concurrent execution of user programs is essential for good DBMS performance.  Because.
Recovery system By Kotoua Selira. Failure classification Transaction failure : Logical errors: transaction cannot complete due to some internal error.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Recovery.
Chapter 17: Recovery System
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 17: Recovery System.
Transaction Management and Recovery, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 18.
1 Controlled concurrency Now we start looking at what kind of concurrency we should allow We first look at uncontrolled concurrency and see what happens.
Motivation for Recovery Atomicity: –Transactions may abort (“Rollback”). Durability: –What if DBMS stops running? (Causes?) crash! v Desired Behavior after.
Database Applications (15-415) DBMS Internals- Part XIV Lecture 25, April 17, 2016 Mohammad Hammoud.
1 Database Systems ( 資料庫系統 ) January 3, 2005 Chapter 18 By Hao-hua Chu ( 朱浩華 )
1 Concurrency Control. 2 Why Have Concurrent Processes? v Better transaction throughput, response time v Done via better utilization of resources: –While.
Database Recovery Techniques
Database Recovery Techniques
DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
Implementing Atomicity and Durability
Transaction Management Overview
Database Recovery Techniques
File Processing : Recovery
Chapter 10 Recover System
Transaction Management Overview
Database Systems (資料庫系統)
Assignment 4 - Solution Problem 1
Transaction Management Overview
Recovery System.
Chapter 19: Recovery System
Database Recovery 1 Purpose of Database Recovery
Transaction Management Overview
Transaction Management Overview
Presentation transcript:

1 A Theory of Redo Recovery David Lomet Microsoft Research, Redmond Mark Tuttle HP Research, Cambridge

2 Big Picture Redo Recovery Redo Recovery requires Good db state Replay of the right operations Good state updates: Good state updates: conflict order not required Write-read conflicts can be ignored Some db “variables” irrelevant (don’t need to update them) Synchronize State updateops replayed Synchronize State update & ops replayed Invariant Captured in recovery Invariant maintaining invariant  recovery We prove that maintaining invariant  recovery Current recovery methods: Current recovery methods: maintain invariant Show how current methods work (e.g. ARIES redo) Show how “new” methods could work Much simpler Much simpler than our VLDB’95 paper

3 Conflict State Graph (CSG) Conflict graph (“Borrowed” from Concurrency Control) Conflict graph (“Borrowed” from Concurrency Control) Nodes are log operations; Edges: conflicts (RW, WR, WW) Nodes are log operations; Edges: conflicts (RW, WR, WW) State graph SG State graph SG Add writes(node): { …} of vars updated Add writes(node): { …} of vars updated State for SG: { | in writes(n) and n is last node in state graph with x in vars(n)} State for SG: { | in writes(n) and n is last node in state graph with x in vars(n)} Final state S final of CSG is desired recovered state Final state S final of CSG is desired recovered state Any prefix of a state graph is a state graph Any prefix of a state graph is a state graph Prefix: node in prefix  predecessor in prefix Prefix: node in prefix  predecessor in prefix State of any can be recovered by State of any prefix of CSG can be recovered by Replaying operations in suffix in conflict graph order Replaying operations in suffix in conflict graph order We will relax CSG requirements

4 Conflict State Graph & States O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=1,y=0 S final : x=3, y=2 x=0,y=0 x=1, y=2 Write-read edge Write-read & write-write & read-write edge Read-write edge

5 Installation Graph Example: Initial stable state: { } Example: Initial stable state: { } O: x ← x+1 O: x ← x+1 P: y ← x+1 P: y ← x+1 After O,P, state is {, } After O,P, state is {, } Flush y to disk- Stable state is { } Flush y to disk- Stable state is { } Replay O- generates correct state {, } Replay O- generates correct state {, } O’s readset x unchanged by P’s installation O’s readset x unchanged by P’s installation Even though Write-Read edge orders P after O Even though Write-Read edge orders P after O Installation graph: Installation graph: conflict graph without write-read edges conflict graph without write-read edges Installation state graph (ISG): Installation state graph (ISG): same writes(n) for node n as conflict state graph same writes(n) for node n as conflict state graph State of any prefix of ISG can be recovered State of any prefix of ISG can be recovered More prefixes (states) because of fewer edges More prefixes (states) because of fewer edges y written by P

6 Installation State Graph & States x=0,y=0 O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=1,y=0 x=3, y=2 x=1, y=2 x=0,y=2 Removed write-read edge Retained read-write edge Retained write-write & read-write edge ISG recoverable state

7 Exposed Variables Example Example O1: x ← z+1 O1: x ← z+1 O2: x ← 25 O2: x ← 25 After O2, we don’t care about x value of O1 After O2, we don’t care about x value of O1 Variable x is unexposed after ops I ({O1} here) if Variable x is unexposed after ops I ({O1} here) if min conflict op in Ops(log) – I writes x min conflict op in Ops(log) – I writes x Without reading it Without reading it x’s value is a “don’t care” when x is unexposed x’s value is a “don’t care” when x is unexposed This is example of Physical Logging This is example of Physical Logging Prefix of installation graph explains state S if values of exposed variables in S are the same as values in state of prefix of ISG Prefix of installation graph explains state S if values of exposed variables in S are the same as values in state of prefix of ISG

8 Potentially Recoverable State Potentially recoverable state: state that Potentially recoverable state: state that by the replay of a subset of operations of the conflict graph, in conflict order, will produce the recovered state S final by the replay of a subset of operations of the conflict graph, in conflict order, will produce the recovered state S final Theorem: If S is a state explained by a prefix of the installation graph, then S is potentially recoverable Theorem: If S is a state explained by a prefix of the installation graph, then S is potentially recoverable

9 REDO Test & Recovery Procedure REDO: tests op’s in conflict order log scan REDO: tests op’s in conflict order log scan Yes (true): replay operation Yes (true): replay operation No (false): bypass operation No (false): bypass operation redo_set = {O|REDO(O..) & O on scanned log} redo_set = {O|REDO(O..) & O on scanned log} Recover Procedure: Recover Procedure: Set log scan point to “checkpoint” Set log scan point to “checkpoint” while not at log end while not at log end O ← current log operation O ← current log operation State = if REDO(O,State,Log,Analysis) State = if REDO(O,State,Log,Analysis) Then O(State) Then O(State) Else State Else State Advance log scan point to next operation Advance log scan point to next operation End End

10 Recovery Recoverable system: a system with Recoverable system: a system with a potentially recoverable state S pot a potentially recoverable state S pot Replay of O’s in redo_set from S pot produces S final Replay of O’s in redo_set from S pot produces S final Inv : ops(Log)-redo_set defines prefix of the installation state graph that explains State Inv : ops(Log)-redo_set defines prefix of the installation state graph that explains State Every system change must be atomic transition maintaining Inv Every system change must be atomic transition maintaining Inv Corollary: Given a state, log, checkpoint, and an execution of Recover (identifying redo_set) Corollary: Given a state, log, checkpoint, and an execution of Recover (identifying redo_set) If Inv holds If Inv holds Then System is recoverable Then System is recoverable specific potentially recoverable state Only specific potentially recoverable state is recoverable

11 Write Graph Write graph: start from nstallation state graph Write graph: start from installation state graph Collapse set of nodes (acyclic) merges nodes Collapse set of nodes (acyclic) merges nodes Add new node for next operation Add new node for next operation Add edge (collapse cycles) Add edge (collapse cycles) Remove a write of an unexposed variable Remove a write of an unexposed variable We do not care about values of unexposed variables We do not care about values of unexposed variables Write graph captures entire system state Write graph captures entire system state Prefix that is stable Prefix that is stable Suffix in cache Suffix in cache Cache Manager uses write graph Cache Manager uses write graph To maintain potentially recoverable state To maintain potentially recoverable state Usually by collapsing suffix node into stable prefix Usually by collapsing suffix node into stable prefix

12 Removed write-read edge Write graph remains acyclic Based on installation graph Write Graph {via Node Collapse} Fewer States O: readset{x} writes{ } Q: readset{x} writes{ } P: readset{x} writes{ } x=3, y=2 x=0,y=0 x=0,y=2 Collapsed Node n x=1, y=2 x=1, y=0 Ops(n) = {O,P} Writes(n) = { } Keep only one version of each variable in cache Retained read-write edge translates to flush order for cache manager

13 Managing Recovery Stable State Write Graph Prefix Usually Single Node Log O1 O2 O3 O1O2O3 Volatile State Suffix of Write Graph In Cache Collapse to “Install” X Updating State Removing O3 from redo_set Atomi c

14 Physiological Recovery Physiological recovery (e.g. ARIES) Physiological recovery (e.g. ARIES) Operation Form: read A, write A Operation Form: read A, write A Log Op has LSN Log Op has LSN Variable tagged: LSN of last log op writing it Variable tagged: LSN of last log op writing it REDO: op’s LSN > variable LSN  “Yes” (Replay) REDO: op’s LSN > variable LSN  “Yes” (Replay) Our explanation Our explanation Ops writing variable collapsed to one cache node Ops writing variable collapsed to one cache node Flushing page to stable state (root of write graph) Flushing page to stable state (root of write graph) Collapses cache node into stable state node Collapses cache node into stable state node Keeps state potentially recoverable Keeps state potentially recoverable redo test  node’s ops removed from redo_set redo test  node’s ops removed from redo_set Maintains invariant Inv Maintains invariant Inv [state change; redo_set change] is atomic [state change; redo_set change] is atomic Physical and Logical Recovery described in paper

15 Extended LSN Method Generalize physiological ops Generalize physiological ops read/write multiple variables read/write multiple variables Our example: ops can read X, write Y (like P) Our example: ops can read X, write Y (like P) also read X, write X also read X, write X LSNs still effective for REDO test LSNs still effective for REDO test Flush synchronizes change to state and redo_set Flush synchronizes change to state and redo_set Cache management Cache management Now requires flush of one variable before another Now requires flush of one variable before another Our theory captures this careful write requirement Our theory captures this careful write requirement Consider B-tree split: (B link -tree) * Consider B-tree split: (B link -tree) * Next slide shows “half split” graphically Next slide shows “half split” graphically Must also post index term for new node Must also post index term for new node

16 Extended Recovery {B link -tree Split} x=0,y=0 P: readset{x} writes{ } x=3, y=2 x=0,y=2 x=1, y=2 x=1, y=0 New Node Y Old Node X Move half to node Y Read X, write Y O: readset{x} writes{ } Q: readset{x} writes{ } Collapsed Node Ops(n) = {O,P} Writes(n) = { } Update node X remove Y records Update Node X Flush Y before X In SqlServer 6.0

17 Recoverable Systems Summary Cache management keeps state potentially recoverable Cache management keeps state potentially recoverable Very generally via write graph Very generally via write graph Derived from installation state graph Derived from installation state graph Maintains invariant INV Maintains invariant INV so that replayed operations are correct set so that replayed operations are correct set By synchronizing changes to redo_set with changes to state By synchronizing changes to redo_set with changes to state

18 Questions?

19 Outline Foundation Foundation Conflict graph, state graphs, recovered state Conflict graph, state graphs, recovered state Abstract Recovery Abstract Recovery Cache Management: maintaining state Cache Management: maintaining state Installation order: weaker update order than conflict order Installation order: weaker update order than conflict order Recovery Recovery Recovery procedure, redo test Recovery procedure, redo test Invariant: guarantees correct recovery Invariant: guarantees correct recovery Coordinating state before failure with recovery execution after failure Coordinating state before failure with recovery execution after failure Recoverable Systems Recoverable Systems Write graphs for maintaining potentially recoverable state Write graphs for maintaining potentially recoverable state Maintaining recovery invariant Maintaining recovery invariant Explaining current recovery methods Explaining current recovery methods

20 Managing the Cache Stable state: prefix of write graph Stable state: prefix of write graph Usually a single node Usually a single node Means stable state potentially recoverable Means stable state potentially recoverable Cache: usually contains write graph suffix Cache: usually contains write graph suffix Volatile state- which is lost during system crash Volatile state- which is lost during system crash Usually collapsing nodes so that one node per “variable” Usually collapsing nodes so that one node per “variable” State update: move a minimum write graph node in cache to stable state atomically State update: move a minimum write graph node in cache to stable state atomically Start with potentially recoverable state Start with potentially recoverable state Atomic transition – frequently node collapse Atomic transition – frequently node collapse New potentially recoverable state New potentially recoverable state

21 Maintaining Recovery Invariant Potentially recoverable state only “half” of job Potentially recoverable state only “half” of job Ops(log) – Redo_set must explain state Ops(log) – Redo_set must explain state Jobs need to be synchronized to enforce INV Jobs need to be synchronized to enforce INV Examples: Stable state is root of write graph Examples: Stable state is root of write graph Logical recovery (in paper) Logical recovery (in paper) Physical recovery (in paper) Physical recovery (in paper) Physiological recovery * Physiological recovery * Extended recovery * Extended recovery *

22 Logical Recovery Logical recovery with arbitrary log ops — System R Logical recovery with arbitrary log ops — System R Quiesce and write shadow “checkpoint” to disk Quiesce and write shadow “checkpoint” to disk By dumping cache contents to disk shadow pages By dumping cache contents to disk shadow pages Disk shadow is installed atomically Disk shadow is installed atomically Replacing old versions of shadow variables Replacing old versions of shadow variables Our explanation Our explanation Shadow coalesced on disk is single write graph node Shadow coalesced on disk is single write graph node Encompassing all changes from last checkpoint Encompassing all changes from last checkpoint Hence is a write graph prefix Hence is a write graph prefix Shadow “installed” atomically” via pointer swing Shadow “installed” atomically” via pointer swing Accomplished by writing new pointer in checkpoint record to log Accomplished by writing new pointer in checkpoint record to log Log is truncated with the writing checkpoint record Log is truncated with the writing checkpoint record All prior records are added to checkpoint All prior records are added to checkpoint Which “installs” all earlier operations simultaneously with stable state update, hence maintaining Inv Which “installs” all earlier operations simultaneously with stable state update, hence maintaining Inv

23 Physical Recovery Physical recovery writes entire page Physical recovery writes entire page Pages are written back to disk Pages are written back to disk When prefix of log contains only pages already written back, log is truncated When prefix of log contains only pages already written back, log is truncated Via checkpoint record indicating redo pass start Via checkpoint record indicating redo pass start All records scanned during recovery are replayed All records scanned during recovery are replayed REDO(op) always is “yes” REDO(op) always is “yes” Our explanation Our explanation Operations are blind writes of single variable- read set is empty Operations are blind writes of single variable- read set is empty All variables with operations not in checkpoint are unexposed All variables with operations not in checkpoint are unexposed These operations are replayed during recovery These operations are replayed during recovery They never read They never read Writing to those variables leaves them unexposed Writing to those variables leaves them unexposed However, they are now set to be installed However, they are now set to be installed Installation occurs when checkpoint record is written Installation occurs when checkpoint record is written Operations now not part of redo scan are thus installed Operations now not part of redo scan are thus installed

24 Our Goal REDO Recovery explanation (Not all of recovery) REDO Recovery explanation (Not all of recovery) Cache management: stage data to stable state Cache management: stage data to stable state Goal: fewer writes & less constrained order Goal: fewer writes & less constrained order Some methods require careful write ordering– why? Some methods require careful write ordering– why? Recovery: which ops to replay Recovery: which ops to replay And how to coordinate state changes with replay changes And how to coordinate state changes with replay changes Provably ensure “recoverability” Provably ensure “recoverability” Disclaimers Disclaimers Abstract story- real recovery needs more Abstract story- real recovery needs more Simpler operation model than past work Simpler operation model than past work Not everything is explained: Not everything is explained: All actually used recovery techniques are handled All actually used recovery techniques are handled But not all recovery techniques we know of are “quite” captured But not all recovery techniques we know of are “quite” captured

25 System Model State: { …} State: { …} Operation: Operation: readset(O): set of variables read by O readset(O): set of variables read by O writeset(O): set of variables written by O writeset(O): set of variables written by O Operations are atomic– system must ensure atomicity Operations are atomic– system must ensure atomicity Operation Sequence Operation Sequence Sequence of ops O 1,O 2,…O k … O final Sequence of ops O 1,O 2,…O k … O final State Sequence State Sequence Sequence of states S 1, S 2,… S k … S final generated by op seg from S 0 Sequence of states S 1, S 2,… S k … S final generated by op seg from S 0 O k precedes (leads to) S k when executed “against” S k-1 O k precedes (leads to) S k when executed “against” S k-1 Recovery goal Recovery goal From some state and a record of operations (on log) From some state and a record of operations (on log) Reproduce last state in sequence S final Reproduce last state in sequence S final