Addressing Human Error with Undo

Slides:



Advertisements
Similar presentations
© 2006 DataCore Software Corp DataCore Traveller Travel in Time : Do More with Time The Continuous Protection and Recovery (CPR) Solution Time Optimized.
Advertisements

SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Protect Your Business and Simplify IT with Symantec and VMware Presenter, Title, Company Date.
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Data protection and disaster recovery.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
Slide 1 To Err is Human Aaron Brown and David A. Patterson Computer Science Division University of California at Berkeley First EASY Workshop 1 July 2001.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice HP StorageWorks LeftHand update Marcus.
Installing software on personal computer
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 16 Slide 1 User interface design.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.
Irwin/McGraw-Hill Copyright © 2000 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS5th Edition.
Oracle Database High Availability Brandon Kuschel Jian Liu Source: Oracle Database 11g Release 2 High Availability An Oracle White Paper November 2010.
IT:Network:Applications Fall  Running one “machine” inside another “machine”  OS in Virtual machines sees ◦ CPU(s) ◦ Memory ◦ Disk ◦ USB ◦ etc.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Chapter Fourteen Windows XP Professional Fault Tolerance.
It is one of the techniques to create a stand by server. Introduced in SQL 2000,enhanced in It is a High Availability as well as Disaster recovery.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Chapter 1 In-lab Quiz Next week
SIOS – Comprehensive High Availability Options for your VMware Environment.
NOAA WEBShop A low-cost standby system for an OAR-wide budgeting application Eugene F. Burger (NOAA/PMEL/JISAO) NOAA WebShop July Philadelphia.
UbiStore: Ubiquitous and Opportunistic Backup Architecture. Feiselia Tan, Sebastien Ardon, Max Ott Presented by: Zainab Aljazzaf.
Introduction Journal Analysis and Optimization Journaling Uses and Benefits Understanding Costs and Implications Ongoing Management and Administration.
MD Digital Government Summit, June 26, Maryland Project Management Oversight & System Development Life Cycle (SDLC) Robert Krauss MD Digital Government.
The protection of the DB against intentional or unintentional threats using computer-based or non- computer-based controls. Database Security – Part 2.
Software Architecture
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Continuous Backup for Business CrashPlan PRO offers a paradigm of backup that includes a single solution for on-site and off-site backups that is more.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Database Storage Structures
High Availability in DB2 Nishant Sinha
Spheres of Undo: A Framework for Extending Undo Aaron Brown January 2004 ROC Retreat.
Backup Exec System Recovery. 2 Outline Introduction Challenges Solution Implementation Results Recommendations Q & A.
Virtual Machine Movement and Hyper-V Replica
Solving Today’s Data Protection Challenges with NSB 1.
Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.
Chapter 6 Protecting Your Files
Oracle Database High Availability
Database recovery contd…

Chapter Objectives In this chapter, you will learn:
COMPSCI 110 Operating Systems
Database Administration
Added Value of HA/DR Processing
Embracing Failure: A Case for Recovery-Oriented Computing
Get to know SQL Manager SQL Server administration done right 
Oracle Database Administration
Maximum Availability Architecture Enterprise Technology Centre.
Bringing Undo to system admin: a new paradigm for recovery
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Oracle Database High Availability
Database Management System (DBMS)
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Undo for Recovery: Approaches and Models
Microsoft Virtual Academy
Microsoft Virtual Academy
Minimize Unplanned Downtime and Data Loss with OpenEdge
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Database Backup and recovery
Bethesda Cybersecurity Club
Chapter 11 user support.
Database Management Systems
Database management systems
Presentation transcript:

Addressing Human Error with Undo Aaron Brown ROC Retreat, June 2001

Outline Motivation: importance of human error during system maintenance Challenge: providing recovery from human error Solution: undo defining an undo paradigm for system administration implementation techniques for sysadmin undo Status and plans

Motivation: human error is important Half of system failures are from human error Oracle: half of DB failures due to human error (1999) Gray/Tandem: 42% of failures from human administrator errors (1986) Murphy/Gent study of VAX systems (1993): Causes of system crashes Time (1985-1993) % of System Crashes System management Software failure Hardware failure Other 18% 53% 18% 10%

Human error is important (2) More data: telephone network failures FCC records, 1992-4; from Kuhn, Computer 30(4), ’97 half of outages, outage-minutes are human-related about 25% are direct result of maintenance errors by phone company workers

Don’t just blame the operator! Typical response of system designers is to blame the operator. But... Don’t just blame the operator! Psychology shows that human errors are inevitable [see J. Reason, Human Error, 1990] humans prone to slips & lapses even on familiar tasks 60% of errors are on “skill-based” automatic tasks also prone to mistakes when tasks become difficult 30% of errors on “rule-based” reasoning tasks 10% of errors on “knowledge-based” tasks that require novel reasoning from first principles Allowing human error can even be beneficial mistakes are a part of trial-and-error reasoning trial & error is needed to solve knowledge-based tasks fear of error can stymie innovation and learning

Outline Motivation: importance of human error during system maintenance Challenge: providing recovery from human error Solution: undo defining an undo paradigm for system administration implementation techniques for sysadmin undo Status and plans

Recovery from human error ROC principle: recovery from human error, not avoidance accepts inevitability of errors promotes better human-system interaction by enabling trial-and-error improves other forms of system recovery Recovery mechanism: Undo ubiquitous and well-proven in productivity applications unusual in system maintenance primitive versions exist (backup, standby machines, ...) but not well-matched to human error or interaction patterns

Outline Motivation: importance of human error during system maintenance Challenge: providing recovery from human error Solution: undo defining an undo paradigm for system administration implementation techniques for sysadmin undo Status and plans

Undo paradigms An effective undo paradigm matches the needs of its target environment cannot reuse existing undo paradigms for system maintenance We need a new undo paradigm for maintenance plan: lay out the design space pick a tentative undo paradigm carry out experiments to validate the paradigm Underlying assumption: service model single application users access via well-defined network requests

Issue #1: Choice of undo model Branching is important for trial and error Issue #1: Choice of undo model Undo model defines the view of past history Spectrum of model options: simplicity flexibility multiple linear undo/redo branching undo/redo w/deletion single undo single undo/redo multiple linear undo linearized branching undo/redo branching MS Office emacs Important choices: undo only, or undo/redo? single, linear, or branching? deletion or no deletion? Tentative choice for maintenance undo multiple linear undo/redo branching undo/redo w/deletion single undo single undo/redo multiple linear undo linearized branching undo/redo branching Important choices: undo only, or undo/redo? single, linear, or branching? deletion or no deletion? 5 4 3 2 1 u trial-and-error history pattern

More undo issues 2) Representation 3) Selection of undo points does undo act on states or actions? how are the states/actions named? TBD 3) Selection of undo points granularity: undo points at each state change/action? or at checkpoints of some granularity? are undo points administrator- or system-defined? Tentative maintenance undo choices in red 2) Representation does undo act on states or actions? how are the states/actions named? 3) Selection of undo points granularity: undo points at each state change/action? or at checkpoints of some granularity? are undo points administrator- or system-defined?

More undo issues (2) 4) Scope of undo 4) Scope of undo “what state can be recovered by undo?” single-node, multi-node, multi-node+network? on each node: system hardware state: BIOS, hardware configs? disk state: user, application, OS/system? soft state: process, OS, full-machine checkpoints? tentative maintenance undo goals in red 4) Scope of undo “what state can be recovered by undo?” single-node, multi-node, multi-node+network? on each node: system hardware state: BIOS, hardware configs? disk state: user, application, OS/system? soft state: process, OS, full-machine checkpoints?

More undo issues (3) 5) Transparency to service user ideally: undo of system state preserves user data & updates user always sees consistent, forward-moving timeline undo has no user-visible impact on data or service availability

Context: other undo mechanisms Design axis Undo model Representation Undo-point selection Scope Trans-parency Desired maintenance-undo semantics branching undo/redo state, naming TBD automatic checkpoints all disk & HW, all nodes & network high Geoplex site failover single undo state, unnamed varies; usu. automatic checkpoints entire system Tape backup single or multiple linear undo state ad-hoc naming manual checkpoints disk (1 FS), single node low GoBack® linearized branching undo/redo state, temporal naming disk (all), single node low-medium Netapp Snapshots multiple linear undo state, temporal naming disk (all), single server DBMS logging (for txn abort) hybrid, unnamed single txn, app-level Undo mech.

Implementing maintenance undo Saving state: disk apply snapshot or logging techniques to disk state e.g., NetApp- or VMware-style block snapshots, or LFS all state, including OS, application binaries, config files leverage excess of cheap, fast storage integrate “time travel” with native storage mechanism for efficiency Saving state: hardware periodically discover and log hardware configuration can’t automatically undo all hardware changes, but can direct administrator to restore configuration

Implementing maintenance undo (2) Providing transparency queue & log user requests at edge of system, in format of original request protocol correlate undo points to points in request log snoop/replay log to satisfy user requests during undo time ---> ... <--- user requests R E Q u current (real) time undo invoked system logical time An undo UI should visually display branching structure must provide way to name and select undo points, show changes between points

Outline Motivation: importance of human error during system maintenance Challenge: providing recovery from human error Solution: undo defining an undo paradigm for system administration implementation techniques for sysadmin undo Status and plans

Status and plans Status Plans starting human experiments to pin down undo paradigm subjects are asked to configure and upgrade a 3-tier e-commerce system using HOWTO-style documentation we monitor their mistakes and identify where and how undo would be useful experiments also used to evaluate existing undo mechanisms like those in GoBack and VMware Plans finalize choice of undo paradigm build proof-of-concept implementation in Internet email service on ROC-1 cluster evaluate effectiveness and transparency with further experiments