Recovery Oriented Computing (ROC) Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung, James Cutler †, Armando Fox †, Archana Ganapathi*,

Slides:

Advertisements

Similar presentations

1 Effective, secure and reliable hosted security and continuity solution.

Advertisements

Enterprise Architectures

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Developing Contextualized Material Five Step Process to develop contextualized material by Hector Valenzuela Lake Washington Institute of Technology

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Software Rejuvenation: Analysis, Module and Applications Yennun Huang Chandra Kintala Nick Kolettis N. Dudley Fulton Chris L. Del Checcolo.

A Simple Way to Estimate the Cost of Downtime Dave Patterson EECS Department University of California, Berkeley

Toward Recovery-Oriented Computing Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens.

CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

ROC Solid: A Recovery Oriented Computing Perspective Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James.

Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server

Slide 1 Dave Patterson University of California at Berkeley January 2002 A New Focus for a New Century: Availability and Maintainability.

J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)

McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Business Plug-In B4 Enterprise Architecture.

Failure Analysis of Two Internet Services Archana Ganapathi

Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.

Recovery Oriented Computing (ROC) Dave Patterson, with a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †,Mike Chen, James Cutler †, Patricia.

1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.

Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.

Slide 1 Dave Patterson University of California at Berkeley FAST Keynote January 2002

Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.

Chapter 13 Organizing Information System Resources MIS Department Centralization and Decentralization Outsourcing Computer Facilities and Services.

CompSci Self-Managing Systems Shivnath Babu.

1 Autonomic Computing An Introduction Guenter Kickinger.

Recovery-Oriented Computing User Study Training Materials October 2003.

Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,

Undo: Update and Futures Aaron Brown ROC Research Group University of California, Berkeley Summer 2003 ROC Retreat 5 June 2003.

FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California,

Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University.

Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell

CompSci Self-Managing Systems Shivnath Babu.

Peer Pressure: Distributed Recovery in Gnutella Pedram Keyani Brian Larson Muthukumar Senthil Computer Science Department Stanford University.

Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Recovery-Oriented Computing Discovering Correctness Constraints for Self-Management of System Configuration Emre Kıcıman and Yi-Min Wang

Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.

Downtime Reduction Ideas for a Symposium Presentation and Training

CompSci Self-Managing Systems Shivnath Babu.

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

© 2008 Pearson Prentice Hall, Experiencing MIS, David Kroenke Slide 1 Chapter 11 Information Systems Management Read this unit prior to the presentation.

Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.

Rewind, Repair, Replay: Three R’s to cope with operator error Aaron Brown UC Berkeley ROC Group IBM Almaden, 22 March 2002.

What do System Administrators Do? William Kakes Calvin Ling Leonard Chung Aaron Brown EECS Computer Science Division University of California, Berkeley.

 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.

A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988

Welcome to the Winter 2004 ROC Retreat

Embracing Failure: A Case for Recovery-Oriented Computing

Large Distributed Systems

Relationships between MTTF & MTTR

Fault Tolerance & Reliability CDA 5140 Spring 2006

Maximum Availability Architecture Enterprise Technology Centre.

Bringing Undo to system admin: a new paradigm for recovery

University of California at Berkeley

Recovery-Oriented Computing

A (prototype) Shiny app for QCing continuous stream sensor data

Progression of Test Categories

Self-healing systems – What are they?

University of California at Berkeley

Presentation transcript:

Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*, Andy Huang †, Billy Kakes, Ben Ling †, Calvin Ling, Emre Kıcıman †, David Oppenheimer, David Patterson, and Jonathan Traupman U.C. Berkeley, † Stanford University January 2003 (*Looking for jobs)

Slide 2 Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with them Improving recovery/repair improves availability –UnAvailability = MTTR MTTF –1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) ROC also helps with maintenance/TCO –since major Sys Admin job is recovery after failure Since TCO is 5-10X HW/SW $, if necessary spend disk/DRAM/CPU resources for recovery

Slide 3 MTTR more valuable than MTTF??? Threshold => non-linear return on improvement –8 to 11 second abandonment threshold on Internet –30 second NFS client/server threshold –Satellite tracking and 10 minute vs. 2 minute MTTR Ebay 4 hour outage, 1 st major outage in year –More people in single event worse for reputation? –One 4-hour outage/year => NY Times => stock? –What if 1-minute outage/day for a year? (250X improvement in MTTR, 365X worse in MTTF) MTTF normally predicted vs. observed –Include environmental error operator error, app bug? –Much easier to verify MTTR than MTTF!

Slide 4 Five “ROC Solid” Principles 1.Given errors occur, design to recover rapidly 2.Given humans make errors, build tools to help operator find and repair problems –e.g., undo; hot swap; graceful, gradual SW upgrade 3.Extensive sanity checks during operation –To discover failures quickly (and to help debug) –Report to operator (and remotely to developers) 4.Any error message in HW or SW can be routinely invoked, scripted for regression test –To test emergency routines during development –To validate emergency routines in field –To train operators in field 5.Recovery benchmarks to measure progress –Recreate performance benchmark competition

Slide 5 Recent Publications 1/4 Patterson, D. A. A simple way to estimate the cost of downtime. 16th Systems Administration Conference (LISA), Nov Oppenheimer, D., Aaron B. Brown, Jonathan Traupman, Pete Broadwell, and David A. Patterson. Practical issues in dependability benchmarking. Second Workshop on Evaluating and Architecting System Dependability (EASY), October Oppenheimer, D. and D. A. Patterson. Architecture, operation, and dependability of large-scale Internet services: three case studies. IEEE Internet Computing, Sept./Oct 2002.

Slide 6 Recent Publications 2/4 Brown, A. and D. A. Patterson. Rewind, Repair, Replay: Three R's to Dependability. 10th ACM SIGOPS European Workshop, Saint-Emilion, France, September George Candea and Armando Fox. A Utility- Centered Approach to Building Dependable Infrastructure Services, 10th ACM SIGOPS European Workshop (EW-2002), Saint- Émilion, France, September Oppenheimer, D. and D. A. Patterson. Studying and using failure data from large- scale Internet services. 10th ACM SIGOPS European Workshop, Saint-Emilion, France, September 2002.

Slide 7 Recent Publications 3/4 George Candea, James Cutler, Armando Fox, Rushabh Doshi, Priyank Garg, Rakesh Gowda. Reducing Recovery Time in a Small Recursively Restartable System. International Conference on Dependable Systems and Networks (DSN- 2002), Washington, D.C., June Merzbacher, M and Dan Patterson. Measuring End-User Availability on the Web: Practical Experience. International Performance and Dependability Symposium, Washington DC, June 2002

Slide 8 Recent Publications 4/4 Broadwell, P., N. Sastry and J. Traupman. FIG: A Prototype Tool for Online Verification of Recovery Mechanisms. Workshop on Self- Healing, Adaptive and self-MANaged Systems (SHAMAN), New York, NY, June Talks: “Recovery Oriented Computing.” David Patterson. Presented at Princeton University, University of Illinois, and University of Michigan, October 2002.