Aaron Brown CS ROC Seminar

Slides:



Advertisements
Similar presentations
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Advertisements

Slide 1 To Err is Human Aaron Brown and David A. Patterson Computer Science Division University of California at Berkeley First EASY Workshop 1 July 2001.
API Design CPSC 315 – Programming Studio Fall 2008 Follows Kernighan and Pike, The Practice of Programming and Joshua Bloch’s Library-Centric Software.
Ch 11 Cognitive Walkthroughs and Heuristic Evaluation Yonglei Tao School of Computing and Info Systems GVSU.
Good Design Visibility is key to interaction design Take advantage of affordances/constraints Provide a good conceptual model for the user.
SIM5102 Software Evaluation
Design and Evaluation of Iterative Systems n For most interactive systems, the ‘design it right first’ approach is not useful. n The 3 basic steps in the.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Computer Security: Principles and Practice
Building Knowledge-Driven DSS and Mining Data
User Centered Design Lecture # 5 Gabriel Spitz.
Backup and Recovery Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 2 Slide 1 Systems engineering 1.
Transactions and Recovery
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
[ §6 : 1 ] 6. Basic Methods II Overview 6.1 Models 6.2 Taxonomy 6.3 Finite State Model 6.4 State Transition Model 6.5 Dataflow Model 6.6 User Manual.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software Reliability Categorising and specifying the reliability of software systems.
1 ISE 412 Human-Computer Interaction Design process Task and User Characteristics Guidelines Evaluation.
©Ian Sommerville 2003Human FallibilitySlide 1 Human Fallibility.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Requirements Engineering
1 BTEC HNC Systems Support Castle College 2007/8 Systems Analysis Lecture 9 Introduction to Design.
Cognitive Task Analysis and its Application to Restoring System Security by Robin Podmore, IncSys Frank Greitzer, PNNL.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
Recovery-Oriented Computing User Study Training Materials October 2003.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Made by: Sambit Pulak XI-IB. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence of data.
Topic (1)Software Engineering (601321)1 Introduction Complex and large SW. SW crises Expensive HW. Custom SW. Batch execution.
JENNIFER WONG CHAPTER 7: USER – CENTERED DESIGN. The point of the book was to advocate a user- centered design which is a philosophy that things should.
Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.
Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.
Reliability and Security in Database Servers By Samuel Njoroge.
HANDLING FAILURES. Warning This is a first draft I welcome your corrections.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliability and Recovery CS Introduction to Operating Systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 Introduction to Software Testing. Reading Assignment P. Ammann and J. Offutt “Introduction to Software Testing” ◦ Chapter 1 2.
U SER I NTERFACE L ABORATORY Human Error 1.Introduction ○ To err is human, to forgive is the role of the computer interface ○ Norman (1988) in “The Psychology.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Human Error The James Reason Model AST 425 AST 425 Dr. Barnhart.
Yonglei Tao School of Computing & Info Systems GVSU Ch 7 Design Guidelines.
Fall 2002CS/PSY Preventing Errors An Ounce of Prevention Errors  Avoiding and preventing  Identifying and understanding  Handling and recovering.
Human Reliability HUMAN RELIABILITY HUMAN ERROR
Fall 2002CS/PSY UI Design Principles Categories  Learnability Support for learning for users of all levels  Flexibility Support for multiple ways.
Leveson Chapters 5 & 6 Gus Scheidt Joel Winstead November 4, 2002.
A disciplined approach to analyzing malfunctions –Provides feedback into the redesign process 1.Play protocol, searching for malfunctions 2.Answer four.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Troubleshooting Windows Vista Lesson 11. Skills Matrix Technology SkillObjective DomainObjective # Troubleshooting Installation and Startup Issues Troubleshoot.
Cognitive Errors in Medicine Alireza Monajemi, MD-PhD Philosophy of Science & Technology Department IHCS.
Computer Security: Principles and Practice First Edition by William Stallings and Lawrie Brown Lecture slides by Lawrie Brown Chapter 17 – IT Security.
Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.
Human Performance Improvement/ HRO
[The Design of Everyday Things, Don Norman, Ch 7]
Sources of Failure in the Public Switched Telephone Network
Inspecting Software Requirement Document
Database Administration
Addressing Human Error with Undo
Bringing Undo to system admin: a new paradigm for recovery
UI Design Principles Categories
Recovery-Oriented Computing
Undo for Recovery: Approaches and Models
Fault Tolerance Distributed Web-based Systems
To Err is Human Owen Brennan.
User-Centered Design Data Entry CS 4640 Programming Languages for Web Applications [The Design of Everyday Things, Don Norman, Ch 7]
User-Centered Design Data Entry CS 4640 Programming Languages for Web Applications [The Design of Everyday Things, Don Norman, Ch 7]
Presentation transcript:

Aaron Brown CS 294-4 ROC Seminar An Overview of Human Error Drawn from J. Reason, Human Error, Cambridge, 1990 Aaron Brown CS 294-4 ROC Seminar

Outline Human error and computer system failures A theory of human error Human error and accident theory Addressing human error

Dependability and human error Industry data shows that human error is the largest contributor to reduced dependability HP HA labs: human error is #1 cause of failures (2001) Oracle: half of DB failures due to human error (1999) Gray/Tandem: 42% of failures from human administrator errors (1986) Murphy/Gent study of VAX systems (1993): Causes of system crashes Time (1985-1993) % of System Crashes System management Software failure Hardware failure Other 53% 18% 10%

Learning from other fields: PSTN FCC-collected data on outages in the US public-switched telephone network metric: breakdown of customer calls blocked by system outages (excluding natural disasters). Jan-June 2001 Human error accounts for 56% of all blocked calls comparison with 1992-4 data shows that human error is the only factor that is not improving over time

Learning from other fields: PSTN PSTN trends: 1992-1994 vs. 2001 Cause Trend 1992-94 2001 Human error: company 98 176 Human error: external 100 75 Hardware 49 Software 15 12 Overload 314 60 Vandalism 5 3 Minutes (millions of customer minutes/month)

Learning from experiments Human error rates during maintenance of software RAID system participants attempt to repair RAID disk failures by replacing broken disk and reconstructing data each participant repeated task several times data aggregated across 5 participants Error type Windows Solaris Linux Fatal Data Loss M MM Unsuccessful Repair System ignored fatal input User Error – Intervention Required User Error – User Recovered MMMM Total number of trials 35 33 31

Learning from experiments Errors occur despite experience: Training and familiarity don’t eliminate errors types of errors change: mistakes vs. slips/lapses System design affects error-susceptibility

Outline Human error and computer system failures A theory of human error Human error and accident theory Addressing human error

A theory of human error (distilled from J. Reason, Human Error, 1990) Preliminaries: the three stages of cognitive processing for tasks 1) planning a goal is identified and a sequence of actions is selected to reach the goal 2) storage the selected plan is stored in memory until it is appropriate to carry it out 3) execution the plan is implemented by the process of carrying out the actions specified by the plan

A theory of human error (2) Each cognitive stage has an associated form of error slips: execution stage incorrect execution of a planned action example: miskeyed command lapses: storage stage incorrect omission of a stored, planned action examples: skipping a step on a checklist, forgetting to restore normal valve settings after maintenance mistakes: planning stage the plan is not suitable for achieving the desired goal example: TMI operators prematurely disabling HPI pumps

Origins of error: the GEMS model GEMS: Generic Error-Modeling System an attempt to understand the origins of human error GEMS identifies three levels of cognitive task processing skill-based: familiar, automatic procedural tasks usually low-level, like knowing to type “ls” to list files rule-based: tasks approached by pattern-matching from a set of internal problem-solving rules “observed symptoms X mean system is in state Y” “if system state is Y, I should probably do Z to fix it” knowledge-based: tasks approached by reasoning from first principles when rules and experience don’t apply

GEMS and errors Errors can occur at each level skill-based: slips and lapses usually errors of inattention or misplaced attention rule-based: mistakes usually a result of picking an inappropriate rule caused by misconstrued view of state, over-zealous pattern matching, frequency gambling, deficient rules knowledge-based: mistakes due to incomplete/inaccurate understanding of system, confirmation bias, overconfidence, cognitive strain, ... Errors can result from operating at wrong level humans are reluctant to move from RB to KB level even if rules aren’t working

Error frequencies In raw frequencies, SB >> RB > KB 61% of errors are at skill-based level 27% of errors are at rule-based level 11% of errors are at knowledge-based level But if we look at opportunities for error, the order reverses humans perform vastly more SB tasks than RB, and vastly more RB than KB so a given KB task is more likely to result in error than a given RB or SB task

Error detection and correction Basic detection mechanism is self-monitoring periodic attentional checks, measurement of progress toward goal, discovery of surprise inconsistencies, ... Effectiveness of self-detection of errors SB errors: 75-95% detected, avg 86% but some lapse-type errors were resistant to detection RB errors: 50-90% detected, avg 73% KB errors: 50-80% detected, avg 70% Including correction tells a different story: SB: ~70% of all errors detected and corrected RB: ~50% detected and corrected KB: ~25% detected and corrected

Outline Human error and computer system failures A theory of human error Human error and accident theory Addressing human error

Human error and accident theory Major systems accidents (“normal accidents”) start with an accumulation of latent errors most of those latent errors are human errors latent slips/lapses, particularly in maintenance example: misconfigured valves in TMI latent mistakes in system design, organization, and planning, particularly of emergency procedures example: flowcharts that omit unforeseen paths invisible latent errors change system reality without altering operator’s models seemingly-correct actions can then trigger accidents

Accident theory (2) Accidents are exacerbated by human errors made during operator response RB errors made due to lack of experience with system in failure states training is rarely sufficient to develop a rule base that captures system response outside of normal bounds KB reasoning is hindered by system complexity and cognitive strain system complexity prohibits mental modeling stress of an emergency encourages RB approaches and diminishes KB effectiveness system visibility limited by automation and “defense in depth” results in improper rule choices and KB reasoning

Outline Human error and computer system failures A theory of human error Human error and accident theory Addressing human error general guidelines the ROC approach: system-level undo

Addressing human error Challenges humans are inherently fallible and errors are inevitable hard-to-detect latent errors can be more troublesome than front-line errors human psychology must not be ignored especially the SB/RB/KB distinction and human behavior at each level General approach: error-tolerance rather than error-avoidance “It is now widely held among human reliability specialists that the most productive strategy for dealing with active errors is to focus upon controlling their consequences rather than upon striving for their elimination.” (Reason, p. 246)

The Automation Irony Automation is not the cure for human error automation addresses the easy SB/RB tasks, leaving the complex KB tasks for the human humans are ill-suited to KB tasks, especially under stress automation hinders understanding and mental modeling decreases system visibility and increases complexity operators don’t get hands-on control experience rule-set for RB tasks and models for KB tasks are weak automation shifts the error source from operator errors to design errors harder to detect/tolerate/fix design errors

Building robustness to human error Discover and correct latent errors must overcome human nature to wait until emergency to respond Increase system visibility don’t hide complexity behind automated mechanisms Take errors into account in operator training include error scenarios promote exploratory trial & error approaches emphasize positive side of errors: learning from mistakes

Building robustness to human error Reduce opportunities for error (Don Norman): get good conceptual model to user by consistent design design tasks to match human limits: working memory, problem solving abilities make visible what the options are, and what are the consequences of actions exploit natural mappings: between intentions and possible actions, actual state and what is perceived, … use constraints to guide user to next action/decision design for errors. Assume their occurrence. Plan for error recovery. Make it easy to reverse action and make hard to perform irreversible ones. when all else fails, standardize: ease of use more important, only standardize as last resort

Building robustness to human error Acknowledge human behavior in system design: interfaces should allow user to explore via experimentation to help at KB level, provide tools do experiments/test hypotheses without having to do them on high-risk irreversible plant. Or make system state always reversible. provide feedback to increase error observability (RB level) at RB level, provide symbolic cues and confidence measures for RB, try to give more elaborate, integrated cues to avoid “strong-but-wrong” RB error provide overview displays at edge of periphery to avoid attentional capture at SB level simultaneously present data in forms useful for SB/RB/KB provide external memory aids to help at KB level, including externalized representation of different options/schemas

Human error: the ROC approach ROC is focusing on system-level techniques for human error tolerance complimentary to UI innovations Goal: provide forgiving operator environment expect human error and tolerate it allow operator to experiment safely, test hypotheses make it possible to detect and fix latent errors Approach: undo for system administration

Repairing the Past with Undo The Three R’s: undo meets time travel Rewind: roll system state backwards in time Repair: fix latent or active error automatically or via human intervention Redo: roll system state forward, replaying user interactions lost during rewind This is not your ordinary word-processor undo! allows sysadmin to go back in time to fix latent errors after they’re manifested

Undo details Examples where Undo would help: reverse the effects of a mistyped command (rm –rf *) roll back a software upgrade without losing user data retroactively install virus filter on email server; effects of virus are squashed on redo The 3 R’s vs. checkpointing, reboot, logging checkpointing gives Rewind only reboot may give Repair, but only for “Heisenbugs” logging can give all 3 R’s but need more than RDBMS logging, since system state changes are interdependent and non-transactional 3R-logging requires careful dependency tracking, and attention to state granularity and externalized events

Summary Humans are critical to system dependability human error is the single largest cause of failures Human error is inescapable: “to err is human” yet we blame the operator instead of fixing systems Human error comes in many forms mistakes, slips, lapses at KB/RB/SB levels of operation but is nearly always detectable Best way to address human error is tolerance through mechanisms like undo human-aware UI design can help too