Undo for Recovery: Approaches and Models

Slides:

Advertisements

Similar presentations

SDN Controller Challenges

Advertisements

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ

June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.

Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.

Chapter 15 Transaction Management. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Transaction basics Concurrency.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

Exchange server Mail system Four components Mail user agent (MUA) to read and compose mail Mail transport agent (MTA) route messages Delivery agent.

Transactions and Recovery

Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.

Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.

Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?

IT:Network:Applications Fall  Running one “machine” inside another “machine”  OS in Virtual machines sees ◦ CPU(s) ◦ Memory ◦ Disk ◦ USB ◦ etc.

Selling the Database Edition for Oracle on HP-UX November 2000.

IT – DBMS Concepts Relational Database Theory.

Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.

Test Organization and Management

Functions of a Database Management System

Recovery-Oriented Computing User Study Training Materials October 2003.

Building Undo for Operators: An Undoable Store Aaron Brown ROC Research Group University of California, Berkeley Winter 2003 ROC Research Retreat.

Enterprise Storage A New Approach to Information Access Darren Thomas Vice President Compaq Computer Corporation.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

Module 2: Installing and Maintaining ISA Server. Overview Installing ISA Server 2004 Choosing ISA Server Clients Installing and Configuring Firewall Clients.

Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.

INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.

Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.

A Quick Look At How Works Understanding the basics of how works can make life a lot easier for any user. Especially those who are interested.

Spheres of Undo: A Framework for Extending Undo Aaron Brown January 2004 ROC Retreat.

Please note that the session topic has changed

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Chapter 6 Protecting Your Files

SQL Database Management

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

Wallpaper only – on screen during welcome and chat

Chapter Objectives In this chapter, you will learn:

Chapter 19: Network Management

Database Recovery Techniques

Database Recovery Techniques

Database Administration

SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.

8.6. Recovery By Hemanth Kumar Reddy.

Embracing Failure: A Case for Recovery-Oriented Computing

Self Healing and Dynamic Construction Framework:

SMTP SMTP stands for Simple Mail Transfer Protocol. SMTP is used when is delivered from an client, such as Outlook Express, to an server.

Addressing Human Error with Undo

Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002

Bringing Undo to system admin: a new paradigm for recovery

Database Systems: Design, Implementation, and Management Tenth Edition

Software Design and Architecture

Operating System Reliability

Operating System Reliability

Storage Virtualization

Managing Multi-user Databases

Recovery-Oriented Computing

An Introduction to Computer Networking

SpiraTest/Plan/Team Deployment Considerations

Operating System Reliability

Operating System Reliability

TRIP WIRE INTRUSION DETECTION SYSYTEM Presented by.

Chapter 6 – Architectural Design

Database Backup and recovery

Outline Chapter 2 (cont) OS Design OS structure

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Operating System Reliability

Database Recovery 1 Purpose of Database Recovery

Systems Operations and Support

Operating System Reliability

Operating System Reliability

Presentation transcript:

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group ROC Retreat, January 2002

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Motivation Human operator error is a major cause of system failures people make mistakes, especially under stress systems are not tolerant of human error during system administration Undo is an effective paradigm for tolerating human error recovery-based: repairs unanticipated mistakes familiar model: ubiquitous in productivity applications but rarely found in sysadmin tools

Motivation (2) Undo “fringe benefits” makes sysadmin’s job easier, improving maintainability better maintainability => better dependability enables trial-and-error learning builds sysadmin’s understanding of system helps shift recovery burden from sysadmin to users export recovery to users via familiar undo model example: NetApp snapshots for file restores helps recover from more than just human error SW/HW failure, security breaches, virus infections, ...

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

An Undo paradigm ROC Undo combines time travel with repair Let’s talk about what undo should look like An Undo paradigm ROC Undo combines time travel with repair The Three R’s of Undo Rewind: roll system state backwards in time Repair: fix latent or active errors automatically or via human intervention Replay: roll system state forward, replaying user interactions lost during rewind we assume a service model with well-defined user actions All three steps are critical rewind enables undo repair lets user/administrator fix problems replay preserves updates, propagates fixes forward

Undo examples: email Coarse-grained Undo Fine-grained Undo Make sure to talk about replay in these examples! Undo examples: email Coarse-grained Undo roll back OS or app. upgrade without losing user data revert system-wide configuration changes “go back in time” to retroactively install virus filter on email server; effects of virus are squashed on redo Fine-grained Undo undo deletion of a user, mailbox, or email message reverse changes to a user’s profile or filtering rules maybe even unsend mail (?) Undo paradigm must support both granularities esp. if changes unknown

Context Traditional Undo gives only two R’s undo/repair or undo/replay includes models presented at last retreat What about checkpoints, reboot, DB logging? checkpoints gives Rewind only reboot may give Repair, but only for “Heisenbugs” logging can give all 3 R’s but RDBMS logging is insufficient, since system state changes are interdependent and non-transactional 3R-logging requires understanding of dependencies and attention to externalized events

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Challenges in 3R paradigm Integrated coarse- and fine-grained undo coarse-grained undo best handled by physical logging disk snapshots, non-overwriting disks, etc. operates below the OS/application layer fine-grained undo best handled by logical logging at application level, where event semantics are known best: hybrid system with physical logging for Rewind and logical logging for Replay caveat: limits full 3R semantics to logically-logged system state; allows simple undo/redo of unmonitored state i.e., redo of unmonitored state won’t propagate repairs

Challenges in 3R paradigm (2) Handling externalized events externalized event: event visible outside of system example: user downloads/reads email message example: user forwards email message over the Internet undo can invalidate externalized events repair can cause events to change/disappear on replay result: inconsistency between system and external env’t solutions depend on acceptable level of inconsistency human users willing to accept inconsistency in some apps issue compensating or explanatory events delay execution of externalized events for an undo window convert external to internal by expanding system boundary

Challenges in 3R paradigm (3) Managing state dependencies Rewind/Repair cycle can invalidate logged events example: retroactive installation of virus filter invalidates delivery of virus-laden messages Replay system must understand dependencies between logged state and state touched during repair Managing state logs what events and user interactions to log for redo? include administrative events like SW installs? how to present logs to human for editing during repair Minimizing performance and overhead penalties for now, we assume abundant disk and CPU resources

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Towards system models for undo Goal: abstract model for undo-capable system template for constructing undoable services needed to analyze generality and limitations of undo Model components state entities state update events (analogue of transactions) event queues and logs untracked system changes Assumptions storage layer that supports bidirectional time-travel via non-overwriting FS, snapshots, etc. Email as example application

Simple model Entire system is one state entity Email Service State User updates (IMAP) - user state - mailboxes - application - operating system Email delivery (SMTP) synch. untracked changes Time-travel storage Analysis simple, easy to implement, easier to trust, most general huge overhead for fine-grained undo operations serialization bottleneck at single queue/log difficult to distinguish different users’ events

Hierarchical model System composed of multiple state entities each state entity supports undo as in simple model state entities join hierarchically to give multiple granularities of undo Email delivery (SMTP) User updates (IMAP) untracked changes Time- travel store u s e r m u x virus filter Email Service State User 1 state TT store User 2 state TT store Analysis multiple undo granularities reduces overhead, bottlenecks distributed undo possible greater complexity; tricky to coordinate different layers

Outline Motivation The Three R’s: An Undo paradigm Undo challenges System models for Undo Status

Status Refining system model for undo hierarchical seems best bet, but many issues to solve feedback welcome! Learning about real-world email systems to help calibrate the undo model working with Sun/iPlanet Messaging Server team likely will get source code access Continuing maintainability benchmarking work helps illustrate what kinds of things need undo Preparing for proof-of-concept implementation in the context of iPlanet Messaging Server