Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

Large-Scale Distributed Systems Andrew Whitaker CSE451.
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
VoipNow Core Solution capabilities and business value.
A Simple Way to Estimate the Cost of Downtime Dave Patterson EECS Department University of California, Berkeley
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Psychological Aspects of Risk Management and Technology – G. Grote ETHZ, Fall09 Psychological Aspects of Risk Management and Technology – Overview.
Slide 1 Dave Patterson University of California at Berkeley January 2002 A New Focus for a New Century: Availability and Maintainability.
Slide 1 Dave Patterson University of California at Berkeley HPCA 8 Keynote February 2002
ROC Solid: A Recovery Oriented Computing Perspective Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James.
SWE Introduction to Software Engineering
MSIS 110: Introduction to Computers; Instructor: S. Mathiyalakan1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Slide 1 Dave Patterson University of California at Berkeley January 2002 A New Focus for a New Century: Availability and Maintainability.
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
Recovery Oriented Computing (ROC) Dave Patterson, with a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †,Mike Chen, James Cutler †, Patricia.
1 Today More on random testing + symbolic constraint solving (“concolic” testing) Using summaries to explore fewer paths (SMART) While preserving level.
CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
Slide 1 Dave Patterson University of California at Berkeley FAST Keynote January 2002
Slide 1 Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley In cooperation.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
Dr.Backup Online Backup Service (888) (toll free)
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
Written by Ben Edwards Georgia CTAE Resource Network 2010 Basic Troubleshooting Or: Repairing the Inevitable System Crash.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.
CS 4001Mary Jean Harrold 1 Can We Trust the Computer?
Chapter 11 Maintaining the System System evolution Legacy systems Software rejuvenation.
CompSci Self-Managing Systems Shivnath Babu.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Server Virtualization
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
CompSci Self-Managing Systems Shivnath Babu.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
What is virtualization? virtualization is a broad term that refers to the abstraction of computer resources in order to work with the computer’s complexity.
Principles of Information Systems, Sixth Edition 1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
Forensic Software Engineering: Are Software Failures Symptomatic of Systemic Problems? Chris Johnson, University of Glasgow My name is Elisabeth.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.
CSE SW Measurement and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M15 version 5.09Slide 1 SMU CSE.
CS223: Software Engineering Lecture 2: Introduction to Software Engineering.
Week 5-6 MondayTuesdayWednesdayThursdayFriday Testing III No reading Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.
1 The System life cycle 16 The system life cycle is a series of stages that are worked through during the development of a new information system. A lot.
Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.
Rewind, Repair, Replay: Three R’s to cope with operator error Aaron Brown UC Berkeley ROC Group IBM Almaden, 22 March 2002.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Can We Trust the Computer? FIRE, Chapter 4. What Can Go Wrong? What are the risks and reasons for computer failures? How much risk must or should we accept?
Tutorial 4 IT323.  Q1. As a software project manager in a company that specializes in the development of software for the offshore oil industry, you.
INTRODUCTION TO DESKTOP SUPPORT
Adam Backman Chief Cat Wrangler – White Star Software
Welcome to the Winter 2004 ROC Retreat
Embracing Failure: A Case for Recovery-Oriented Computing
Large Distributed Systems
University of California at Berkeley
Recovery-Oriented Computing
Recovery Oriented Computing: A New Research Agenda for a New Century
Recovery Oriented Computing: A New Research Agenda for a New Century
CS240: Advanced Programming Concepts
Information Systems, Ninth Edition
Software Verification, Validation, and Acceptance Testing
Basic Troubleshooting Or: Repairing the Inevitable System Crash
Recovery Oriented Computing: A New Research Agenda for a New Century
University of California at Berkeley
Presentation transcript:

Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando Fox †, Emre Kıcıman †, David Oppenheimer, and Jonathan Traupman U.C. Berkeley, † Stanford University April 2003

Slide 2 Outline The past: where we have been The present: new realities and challenges A future: Recovery-Oriented Computing (ROC) ROC techniques and principles

Slide 3 The past: research goals and assumptions of last 20 years Goal #1: Improve performance Goal #2: Improve performance Goal #3: Improve cost-performance Simplifying Assumptions –Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) –Software will eventually be bug free (Hire better programmers!) –Hardware MTBF is already very large (~100 years between failures), and will continue to increase –Maintenance costs irrelevant vs. Purchase price (maintenance a function of price, so cheaper helps)

Slide 4 Learning from other fields: disasters Common threads in accidents ~3 Mile Island 1.More multiple failures than you believe possible, because latent errors accumulate 2. Operators cannot fully understand system because errors in implementation, measurement system, warning systems. Also complex, hard to predict interactions 3.Tendency to blame operators afterwards (60-80%), but they must operate with missing, wrong information 4.The systems are never all working fully properly: bad warning lights, sensors out, things in repair 5.Emergency Systems are often flawed. At 3 Mile Island, 2 valves in wrong position; parts of a redundant system used only in an emergency. Facility running under normal operation masks errors in error handling Source: Charles Perrow, Normal Accidents: Living with High Risk Technologies, Perseus Books, 1990

Slide 5 Learning from other fields: human error Two kinds of human error 1) slips/lapses: errors in execution 2) mistakes: errors in planning –errors can be active (operator error) or latent (design error, management error) Human errors are inevitable –“humans are furious pattern-matchers” »sometimes the match is wrong –cognitive strain leads brain to think up least-effort solutions first, even if wrong Humans can self-detect errors –about 75% of errors are immediately detected Source: J. Reason, Human Error, Cambridge, 1990.

Slide 6 Human error Human operator error is the leading cause of dependability problems in many domains Operator error cannot be eliminated –humans inevitably make mistakes: “to err is human” –automation irony tells us we can’t eliminate the human Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD , March Public Switched Telephone Network Average of 3 Internet Sites Sources of Failure

Slide 7 The ironies of automation Automation doesn’t remove human influence –shifts the burden from operator to designer »designers are human too, and make mistakes »unless designer is perfect, human operator still needed Automation can make operator’s job harder –reduces operator’s understanding of the system »automation increases complexity, decreases visibility »no opportunity to learn without day-to-day interaction –uninformed operator still has to solve exceptional scenarios missed by (imperfect) designers »exceptional situations are already the most error-prone Need tools to help, not replace, operator Source: J. Reason, Human Error, Cambridge University Press, mention human-aware automation

Slide 8 Learning from others: Bridges 1800s: 1/4 iron truss railroad bridges failed! Safety is now part of Civil Engineering DNA Techniques invented since 1800s: –Learn from failures vs. successes –Redundancy to survive some failures –Margin of safety 3X-6X vs. calculated load –(CS&E version of safety margin?) What will people of future think of our computers?

Slide 9 Where we are today MAD TV, “Antiques Roadshow, 3005 AD” VALTREX: “Ah ha. You paid 7 million Rubex too much. My suggestion: beam it directly into the disposal cube. These pieces of crap crashed and froze so frequently that people became violent! Hargh!” “Worthless Piece of Crap: 0 Rubex”

Slide 10 Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with them Improving recovery/repair improves availability –UnAvailability = MTTR MTTF –1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) ROC also helps with maintenance/TCO –since major Sys Admin job is recovery after failure Since TCO is 5-10X HW/SW $, if necessary spend disk/DRAM/CPU resources for recovery

Slide 11 ROC Summary 21 st Century Research challenge is Synergy with Humanity, Dependability, Security/Privacy 2002: Peres’s Law greater than Moore’s Law? –Must cope with fact that people, SW, HW fail –Industry may soon compete on recovery time v. SPEC Recovery Oriented Computing is one path for operator synergy, dependability for servers –Failure data collection + Benchmarks to evaluate –Partitioning, Redundandy, Diagnosis, Partial Restart, Input/Fault Insertion, Undo, Margin of Safety Significantly reducing MTTR (people/SW/HW) => better Dependability & Cost of Ownership

Slide 12 Interested in ROCing? More research opportunities than 2 university projects can cover. Many could help with: –Failure data collection, analysis, and publication –Create/Run Recovery benchmarks: compare (by vendor) databases, files systems, routers, … –Invent, evaluate techniques to reduce MTTR and TCO in computation, storage, and network systems –(Lots of low hanging fruit) “If it’s important, how can you say it’s impossible if you don’t try?” Jean Monnet, a founder of European Union