Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.

Slides:



Advertisements
Similar presentations
Presented by Nikita Shah 5th IT ( )
Advertisements

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Draft-lemonade-imap-submit-01.txt “Forward without Download” Allow IMAP client to include previously- received message (or parts) in or as new message.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.
Keith Burns Microsoft UK Mission Critical Database.
Overview Distributed vs. decentralized Why distributed databases
Exchange 2010 Overview Name Title Group. What You Tell Us Communication overload Globally distributed customers and partners High cost of communications.
Transactions and Recovery
Implementing High Availability
Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
1 Motivation Goal: Create and document a black box availability benchmark Improving dependability requires that we quantify the ROC-related metrics.
IT:Network:Applications Fall  Running one “machine” inside another “machine”  OS in Virtual machines sees ◦ CPU(s) ◦ Memory ◦ Disk ◦ USB ◦ etc.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Database Systems: Design, Implementation, and Management Ninth Edition
Introduction to Systems Analysis and Design Trisha Cummings.
Module 10: Designing an AD RMS Infrastructure in Windows Server 2008.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
9/10/20151 Hyperion Enterprise 6.5 New Features & Functionality Robert Cybulski, CPA Finit Solutions.
Test Organization and Management
ShopKeeper was designed from the ground up to manage your entire fleet maintenance operations … from 1 user to 100, including full security features that.
IT:Network:Applications.  How messaging servers work  Initial tips for success Exchange management  Server roles  Exchange Server Management  Message.
Data Flow Diagrams.
Recovery-Oriented Computing User Study Training Materials October 2003.
Chapter Fourteen Windows XP Professional Fault Tolerance.
Building Undo for Operators: An Undoable Store Aaron Brown ROC Research Group University of California, Berkeley Winter 2003 ROC Research Retreat.
Undo: Update and Futures Aaron Brown ROC Research Group University of California, Berkeley Summer 2003 ROC Retreat 5 June 2003.
Architecture styles Pipes and filters Object-oriented design Implicit invocation Layering Repositories.
Distributed Computing COEN 317 DC2: Naming, part 1.
Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Computer Science and Engineering Computer System Security CSE 5339/7339 Session 21 November 2, 2004.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.
Database Security Cmpe 226 Fall 2015 By Akanksha Jain Jerry Mengyuan Zheng.
1 5/18/2007ã 2007, Spencer Rugaber Architectural Styles and Non- Functional Requirements Jan Bosch. Design and Use of Software Architectures. Addison-Wesley,
A Quick Look At How Works Understanding the basics of how works can make life a lot easier for any user. Especially those who are interested.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Spheres of Undo: A Framework for Extending Undo Aaron Brown January 2004 ROC Retreat.
TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter One Introduction to Exchange Server 2003.
Please note that the session topic has changed
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Prepare, Migrate, and Operate with Barracuda. Who is Barracuda Networks? NYSE: CUDA Founded in 2003, IPO 2013 Headquartered in Silicon Valley, offices.
Rewind, Repair, Replay: Three R’s to cope with operator error Aaron Brown UC Berkeley ROC Group IBM Almaden, 22 March 2002.
Fix: Windows 10 Error Code 0x in Mail App u/6/b/ /alexwaston14/reimage-system-repair/ /pages/Reimage-Repair-Tool/
Improve the Performance, Scalability, and Reliability of Applications in the Cloud with jetNEXUS Load Balancer for Microsoft Azure MICROSOFT AZURE ISV.
9 Systems Analysis and Design in a Changing World, Fifth Edition.
SQL Database Management
Database Administration
Embracing Failure: A Case for Recovery-Oriented Computing
draft-lemonade-imap-submit-01.txt “Forward without Download”
Addressing Human Error with Undo
Bringing Undo to system admin: a new paradigm for recovery
Storage Virtualization
Recovery-Oriented Computing
Undo for Recovery: Approaches and Models
TRIP WIRE INTRUSION DETECTION SYSYTEM Presented by.
Chapter 6 – Architectural Design
Time Gathering Systems Secure Data Collection for IBM System i Server
Systems Operations and Support
Authority on Demand Control Authority Rights & Emergency Access
Presentation transcript:

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group

2 Motivation Human operator error is a major cause of system failures –systems are not tolerant of human error during system administration Undo effectively tolerates human error –recovery-based: repairs unanticipated mistakes –familiar model: ubiquitous in productivity applications Undo has “fringe benefits” –makes sysadmin’s job easier, improving maintainability –enables trial-and-error learning –helps shift recovery burden from sysadmin to users –helps recover from more than just human error »SW/HW failure, security breaches, virus infections,...

3 An Undo paradigm ROC Undo combines time travel with repair The Three R’s of Undo –Rewind: roll system state backwards in time –Repair: fix latent or active errors »automatically or via human intervention –Replay: roll system state forward, replaying user interactions lost during rewind »we assume a service model with well-defined user actions All three steps are critical –rewind enables undo –repair lets user/administrator fix problems –replay preserves updates, propagates fixes forward Let’s talk about what undo should look like

4 Undo examples: Coarse-grained Undo –roll back OS or app. upgrade without losing user data –revert system-wide configuration changes –“go back in time” to retroactively install virus filter on server; effects of virus are squashed on redo Fine-grained Undo –undo deletion of a user, mailbox, or message –reverse changes to a user’s profile or filtering rules –maybe even unsend mail (?) Undo paradigm must support both granularities esp. if changes unknown Make sure to talk about replay in these examples!

5 Challenges in 3R paradigm Handling externalized events –externalized event: event visible outside of system »example: user downloads/reads message »example: user forwards message over the Internet –undo can invalidate externalized events »repair can cause events to change/disappear on replay »result: inconsistency between system and external env’t –solutions depend on acceptable level of inconsistency »human users willing to accept inconsistency in some apps »issue compensating or explanatory events »delay execution of externalized events for an undo window »convert external to internal by expanding system boundary

6 Challenges in 3R paradigm (2) Integrated coarse- and fine-grained undo –coarse-grained undo best handled by physical logging –fine-grained undo best handled by logical logging –best: hybrid system with physical logging for Rewind and logical logging for Replay »caveat: limits full 3R semantics to logically-logged system state; allows simple undo/redo of unmonitored state i.e., redo of unmonitored state won’t propagate repairs Managing state dependencies –Rewind/Repair cycle can invalidate logged events –Replay system must understand dependencies between logged state and state touched during repair

7 Towards system models for undo Goal: abstract model for undo-capable system –template for constructing undoable services –needed to analyze generality and limitations of undo Model components –state entities –state update events (analogue of transactions) –event queues and logs –untracked system changes Assumptions –storage layer that supports bidirectional time-travel »via non-overwriting FS, snapshots, etc. as example application

8 Simple model –Analysis +simple, easy to implement, easier to trust, most general –huge overhead for fine-grained undo operations –serialization bottleneck at single queue/log –difficult to distinguish different users’ events Entire system is one state entity Service State - user state - mailboxes - application - operating system Time-travel storage delivery (SMTP) User updates (IMAP) synch. untracked changes

9 Hierarchical model System composed of multiple state entities –each state entity supports undo as in simple model –state entities join hierarchically to give multiple granularities of undo –Analysis +multiple undo granularities reduces overhead, bottlenecks +distributed undo possible –greater complexity; tricky to coordinate different layers delivery (SMTP) User updates (IMAP) untracked changes Time- travel store User 1 state TT store User 2 state TT store useruser muxmux virus filter Service State

10 Status Refining system model for undo –hierarchical seems best bet, but many issues to solve –feedback welcome! Learning about real-world systems –to help calibrate the undo model –working with Sun/iPlanet Messaging Server team »likely will get source code access Continuing maintainability benchmarking work –helps illustrate what kinds of things need undo Preparing for proof-of-concept implementation –in the context of iPlanet Messaging Server