Recovery-Oriented Computing

Slides:



Advertisements
Similar presentations
Making the System Operational
Advertisements

Ed Duguid with subject: MACE Cloud
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 20 Systems Operations and Support.
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Slide 1 Dave Patterson University of California at Berkeley January 2002 A New Focus for a New Century: Availability and Maintainability.
Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server
Chapter 11: Maintaining and Optimizing Windows Vista
CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
AgVantage IT Services Systems Management Team Partnered with You and IBM® Agenda Disaster Recovery Service Disaster Recovery Service IT Visors IT Visors.
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
1 Motivation Goal: Create and document a black box availability benchmark Improving dependability requires that we quantify the ROC-related metrics.
Slide 1 Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley In cooperation.
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Installing Windows Vista Lesson 2. Skills Matrix Technology SkillObjective DomainObjective # Performing a Clean Installation Set up Windows Vista as the.
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
Chapter 11 Maintaining the System System evolution Legacy systems Software rejuvenation.
 Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). 
CompSci Self-Managing Systems Shivnath Babu.
Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001.
Reliability and Recovery CS Introduction to Operating Systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
CompSci Self-Managing Systems Shivnath Babu.
Paperless Timesheet Management Project Anant Pednekar.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.
IT Applications Theory Slideshows By Mark Kelly Vceit.com Version 2 – updated for 2016 CLOUD COMPUTING CLOUD COMPUTING.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.
Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.
Installing Windows 7 Lesson 2.
Database recovery contd…
Welcome to the Winter 2004 ROC Retreat
Supporting Windows 8.1 Krystle Portocarrero | Training Experts Inc.
Hardware & Software Reliability
Embracing Failure: A Case for Recovery-Oriented Computing
Build a low-touch, highly scalable cloud with IBM SmartCloud Provisioning Academic Initiative © 2011 IBM Corporation.
Large Distributed Systems
Addressing Human Error with Undo
Andy Wang CIS 5930 Computer Systems Performance Analysis
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Fault Tolerance & Reliability CDA 5140 Spring 2006
Maximum Availability Architecture Enterprise Technology Centre.
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Bringing Undo to system admin: a new paradigm for recovery
Software Development Life Cycle
How to Create Mac OS X Recovery Partition?
Basic Computer Maintenance
Aliexpress Clone Script to Empower Your eCommerce Business
Recovery-Oriented Computing
How To Fix AOL Desktop Update Error AOL Helpline Number
Information Technology
Relate to Clients on a business level
Undo for Recovery: Approaches and Models
Mattan Erez The University of Texas at Austin July 2015
Workshop.
Hardware-less Testing for RAS Software
Traditional Virtualized Infrastructure
Systems Operations and Support
The Troubleshooting theory
Test Cases, Test Suites and Test Case management systems
Database Backup and Recovery
Windows 10 An Operating System
Presentation transcript:

Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley {patterson,abrown}@cs.berkeley.edu In cooperation with Armando Fox, Stanford University fox@cs.stanford.edu http://roc.CS.Berkeley.EDU/ December 2001

The past: goals and assumptions of last 15 years Goal #1: Improve performance Goal #2: Improve performance Goal #3: Improve cost-performance Assumptions Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) Software will eventually be bug free (good programmers write bug-free code, debugging works) Hardware MTBF is already very large (~100 years between failures), and will continue to increase

Today, after 15 years of improving performance Availability is now the vital metric for servers near-100% availability is becoming mandatory for e-commerce, enterprise apps, online services, ISPs but, service outages are frequent 65% of IT managers report that their websites were unavailable to customers over a 6-month period 25%: 3 or more outages outage costs are high social effects: negative press, loss of customers who “click over” to competitor $500,000 to $5,000,000 per hour in lost revenues Source: InternetWeek 4/3/2000

New goals: ACME Availability Change Maintainability 24x7 delivery of service to users Change support rapid deployment of new software, apps, UI Maintainability reduce burden on system administrators (cost of ownership ~5X cost of purchase) provide helpful, forgiving sysadmin environments Evolutionary Growth allow easy system expansion over time without sacrificing availability or maintainability

Where does ACME stand today? Availability: failures are common Traditional fault-tolerance doesn’t solve the problems Change In back-end system tiers, software upgrades difficult, failure-prone, or ignored For application service over WWW, daily change Maintainability human operator error is single largest failure source? system maintenance environments are unforgiving Evolutionary growth 1U-PC cluster front-ends scale, evolve well back-end scalability still limited

Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres Failures are a fact, and recovery/repair is how we cope with them Improving recovery/repair improves availability UnAvailability = MTTR MTTF 1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) Since major Sys Admin job is recovery after failure, ROC also helps with maintenance If necessary, start with clean slate, sacrifice disk space and performance for ACME

ROC Approach Change system administration environment to make it forgiving Develop ACME benchmarks to test old systems and new ideas to measure improvement Fastest to Recover from Failures v. Fastest on SPEC Work with companies to get real data on failure causes and patterns, feedback on approach Cluster technology that enables partition systems, insert faults, test outputs

One idea: Undo for Sysadmin Major goal of ROC is to provide an Undo for system administration to create an environment that forgives operator error to let sysadmins fix latent errors even after they’re manifested this is no ordinary word processor undo! The Three R’s: undo meets time travel Rewind: roll system state backwards in time Repair: fix latent or active error automatically or via human intervention Redo: roll system state forward, replaying user interactions lost during rewind

Discussion Topics Focus on recovery: appropriate? How much of a problem is human error? Undo: would it be useful? How hard is it to diagnose problems? Benchmarks to evaluate progress in this area? Future events on this topic?

Backup Slides for Questions

Discussion Topics (long version) Focus on recovery: appropriate? is it an important part of what you do? what other parts of the sysadmin experience (for Internet services) should we look into? How much of a problem is human error? are there more user errors or sysadmin errors? Undo: would it be useful? how far back in time would it need to go? should it be exported to users, e.g. undo for email? How hard is it to diagnose problems? would you use or trust automated diagnosis tools?

Discussion Topics (long version), p2 Benchmarks to evaluate progress in this area measuring dependability of human-operated systems how to measure sysadmin satisfaction with system? Future events on this topic would you come to a workshop at the next LISA? what do you think about a hands-on workshop where we spend the morning using various systems then spend the afternoon analyzing and discussing results?

ROC Enabler: ACME benchmarks Traditional benchmarks focus on performance assume perfect hardware, software, human operators New benchmarks needed to drive progress toward ACME, evaluate ROC success How else convince skeptics to adopt new technology? Need workload of typical failures normal behavior (99% conf.) QoS degradation failure Repair Time

Tools for Recovery: Detection System enables input insertion, output check of all modules (including fault insertion) To check module sanity to find failures faster To test correctness of recovery mechanisms insert (random) faults and known-incorrect inputs also enables availability benchmarks To expose & remove latent errors from system To train/expand experience of operator Periodic reports to management on skills To discover if warning systems are broken

Repairing the Past (2) 3 cases needing Undo reverse the effects of a mistyped command (rm –rf *) roll back a software upgrade without losing user data “go back in time” to retroactively install virus filter on email server; effects of virus are squashed on redo The 3 R’s vs. checkpointing, reboot, logging checkpointing gives Rewind only reboot may give Repair, but only for “Heisenbugs” logging can give all 3 R’s but need more than RDBMS logging, since system state changes are interdependent and non-transactional 3R-logging requires careful dependency tracking, and attention to state granularity and externalized events

Organizations we’re working with Traditional Computer Companies: HP, IBM, Microsoft, ... Internet companies: Amazon, Google, Yahoo, ... Startup companies (wouldn’t recognize names) Stanford University (Armando Fox) E.g., CS 294-4 class Visit them + twice a year, 3-day retreats at Lake Tahoe to review progress, get new insights, build team spirit, have fun, ...

Systems using for ROC ROC-I: Cluster of 64 PCs modified with ability for HW isolation, fault insertion, monitoring Cluster of 40 IBM PCs: each with 2 GB DRAM, 1 gigabit Ethernet, gigabit switch,HW monitor, each running Vmware virtual machine monitor (software layer)

Why ROC? Working on relevant, important systems topic with no jeopardy from near-term products New, exciting, research field, so lots of opportunities Interacting with many interesting companies, research labs, Stanford University Contact Dave Patterson (patterson@cs) or Aaron Brown (abrown@cs) or David Oppenheimer (davidopp@cs) See http://ROC.cs.berkeley.edu