Undo for Recovery: Approaches and Models

Slides:



Advertisements
Similar presentations
SDN Controller Challenges
Advertisements

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.
Chapter 15 Transaction Management. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Transaction basics Concurrency.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Exchange server Mail system Four components Mail user agent (MUA) to read and compose mail Mail transport agent (MTA) route messages Delivery agent.
Transactions and Recovery
Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
IT:Network:Applications Fall  Running one “machine” inside another “machine”  OS in Virtual machines sees ◦ CPU(s) ◦ Memory ◦ Disk ◦ USB ◦ etc.
Selling the Database Edition for Oracle on HP-UX November 2000.
IT – DBMS Concepts Relational Database Theory.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Test Organization and Management
Functions of a Database Management System
Recovery-Oriented Computing User Study Training Materials October 2003.
Building Undo for Operators: An Undoable Store Aaron Brown ROC Research Group University of California, Berkeley Winter 2003 ROC Research Retreat.
Enterprise Storage A New Approach to Information Access Darren Thomas Vice President Compaq Computer Corporation.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
Module 2: Installing and Maintaining ISA Server. Overview Installing ISA Server 2004 Choosing ISA Server Clients Installing and Configuring Firewall Clients.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.
A Quick Look At How Works Understanding the basics of how works can make life a lot easier for any user. Especially those who are interested.
Spheres of Undo: A Framework for Extending Undo Aaron Brown January 2004 ROC Retreat.
Please note that the session topic has changed
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Chapter 6 Protecting Your Files
SQL Database Management
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Wallpaper only – on screen during welcome and chat
Chapter Objectives In this chapter, you will learn:
Chapter 19: Network Management
Database Recovery Techniques
Database Recovery Techniques
Database Administration
SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.
8.6. Recovery By Hemanth Kumar Reddy.
Embracing Failure: A Case for Recovery-Oriented Computing
Self Healing and Dynamic Construction Framework:
SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.
Addressing Human Error with Undo
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Bringing Undo to system admin: a new paradigm for recovery
Database Systems: Design, Implementation, and Management Tenth Edition
Software Design and Architecture
Operating System Reliability
Operating System Reliability
Storage Virtualization
Managing Multi-user Databases
Recovery-Oriented Computing
An Introduction to Computer Networking
SpiraTest/Plan/Team Deployment Considerations
Operating System Reliability
Operating System Reliability
TRIP WIRE INTRUSION DETECTION SYSYTEM Presented by.
Chapter 6 – Architectural Design
Database Backup and recovery
Outline Chapter 2 (cont) OS Design OS structure
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
Systems Operations and Support
Operating System Reliability
Operating System Reliability
Presentation transcript:

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group ROC Retreat, January 2002

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Motivation Human operator error is a major cause of system failures people make mistakes, especially under stress systems are not tolerant of human error during system administration Undo is an effective paradigm for tolerating human error recovery-based: repairs unanticipated mistakes familiar model: ubiquitous in productivity applications but rarely found in sysadmin tools

Motivation (2) Undo “fringe benefits” makes sysadmin’s job easier, improving maintainability better maintainability => better dependability enables trial-and-error learning builds sysadmin’s understanding of system helps shift recovery burden from sysadmin to users export recovery to users via familiar undo model example: NetApp snapshots for file restores helps recover from more than just human error SW/HW failure, security breaches, virus infections, ...

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

An Undo paradigm ROC Undo combines time travel with repair Let’s talk about what undo should look like An Undo paradigm ROC Undo combines time travel with repair The Three R’s of Undo Rewind: roll system state backwards in time Repair: fix latent or active errors automatically or via human intervention Replay: roll system state forward, replaying user interactions lost during rewind we assume a service model with well-defined user actions All three steps are critical rewind enables undo repair lets user/administrator fix problems replay preserves updates, propagates fixes forward

Undo examples: email Coarse-grained Undo Fine-grained Undo Make sure to talk about replay in these examples! Undo examples: email Coarse-grained Undo roll back OS or app. upgrade without losing user data revert system-wide configuration changes “go back in time” to retroactively install virus filter on email server; effects of virus are squashed on redo Fine-grained Undo undo deletion of a user, mailbox, or email message reverse changes to a user’s profile or filtering rules maybe even unsend mail (?) Undo paradigm must support both granularities esp. if changes unknown

Context Traditional Undo gives only two R’s undo/repair or undo/replay includes models presented at last retreat What about checkpoints, reboot, DB logging? checkpoints gives Rewind only reboot may give Repair, but only for “Heisenbugs” logging can give all 3 R’s but RDBMS logging is insufficient, since system state changes are interdependent and non-transactional 3R-logging requires understanding of dependencies and attention to externalized events

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Challenges in 3R paradigm Integrated coarse- and fine-grained undo coarse-grained undo best handled by physical logging disk snapshots, non-overwriting disks, etc. operates below the OS/application layer fine-grained undo best handled by logical logging at application level, where event semantics are known best: hybrid system with physical logging for Rewind and logical logging for Replay caveat: limits full 3R semantics to logically-logged system state; allows simple undo/redo of unmonitored state i.e., redo of unmonitored state won’t propagate repairs

Challenges in 3R paradigm (2) Handling externalized events externalized event: event visible outside of system example: user downloads/reads email message example: user forwards email message over the Internet undo can invalidate externalized events repair can cause events to change/disappear on replay result: inconsistency between system and external env’t solutions depend on acceptable level of inconsistency human users willing to accept inconsistency in some apps issue compensating or explanatory events delay execution of externalized events for an undo window convert external to internal by expanding system boundary

Challenges in 3R paradigm (3) Managing state dependencies Rewind/Repair cycle can invalidate logged events example: retroactive installation of virus filter invalidates delivery of virus-laden messages Replay system must understand dependencies between logged state and state touched during repair Managing state logs what events and user interactions to log for redo? include administrative events like SW installs? how to present logs to human for editing during repair Minimizing performance and overhead penalties for now, we assume abundant disk and CPU resources

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Towards system models for undo Goal: abstract model for undo-capable system template for constructing undoable services needed to analyze generality and limitations of undo Model components state entities state update events (analogue of transactions) event queues and logs untracked system changes Assumptions storage layer that supports bidirectional time-travel via non-overwriting FS, snapshots, etc. Email as example application

Simple model Entire system is one state entity Email Service State User updates (IMAP) - user state - mailboxes - application - operating system Email delivery (SMTP) synch. untracked changes Time-travel storage Analysis simple, easy to implement, easier to trust, most general huge overhead for fine-grained undo operations serialization bottleneck at single queue/log difficult to distinguish different users’ events

Hierarchical model System composed of multiple state entities each state entity supports undo as in simple model state entities join hierarchically to give multiple granularities of undo Email delivery (SMTP) User updates (IMAP) untracked changes Time- travel store u s e r m u x virus filter Email Service State User 1 state TT store User 2 state TT store Analysis multiple undo granularities reduces overhead, bottlenecks distributed undo possible greater complexity; tricky to coordinate different layers

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Status Refining system model for undo hierarchical seems best bet, but many issues to solve feedback welcome! Learning about real-world email systems to help calibrate the undo model working with Sun/iPlanet Messaging Server team likely will get source code access Continuing maintainability benchmarking work helps illustrate what kinds of things need undo Preparing for proof-of-concept implementation in the context of iPlanet Messaging Server