Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

Configuration management

IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.

The Lucernex Cloud: A software-as-a-service solution delivered via the Cloud What is the Cloud? Cloud Computing is the future of all software applications,

The Basics of Information Systems

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 20 Systems Operations and Support.

Distributed Systems Topics What is a Distributed System?

Mecanismos de alta disponibilidad con Microsoft SQL Server 2008 Por: ISC Lenin López Fernández de Lara.

Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,

Fabián E. Bustamante, Winter 2006 Autonomic Computing The vision of autonomic computing, J. Kephart and D. Chess, IEEE Computer, Jan Also - A.G.

Yingping Huang and Gregory Madey University of Notre Dame A W S utonomic eb-based imulation Presented by Tariq M. King Published by the IEEE Computer Society.

Distributed components

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

Experience with some Principles for Building an Internet-Scale Reliable System Mike Afergan (Akamai and MIT) Joel Wein (Akamai and Polytechnic University,

Database Administration Chapter Six DAVID M. KROENKE’S DATABASE CONCEPTS, 2 nd Edition.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 30 Slide 1 Security Engineering.

Autonomic Computing Shafay Shamail Malik Jahan Khan.

Challenges in Large Enterprise Data Management James Hamilton Microsoft SQL Server

Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.

Managing Information Systems Information Systems Security and Control Part 2 Dr. Stephania Loizidou Himona ACSC 345.

Irwin/McGraw-Hill Copyright © 2004 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS6th Edition.

Slide 1 ISTORE: An Introspective Storage Architecture for Network Service Applications Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Jim.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 30 Slide 1 Security Engineering.

Oracle Database Administration. Rana Almurshed 2 course objective After completing this course you should be able to: install, create and administrate.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Porcupine: A Highly Available Cluster- based Mail Service Y. Saito, B. Bershad, H. Levy U. Washington.

National Manager Database Services

INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.

H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

PMIT-6102 Advanced Database Systems

Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.

FMEA-technique of Web Services Analysis and Dependability Ensuring Anatoliy Gorbenko Vyacheslav Kharchenko Olga Tarasyuk National Aerospace University.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Simple introduction to HDFS Jie Wu. Some Useful Features –File permissions and authentication. –Rack awareness: to take a node's physical location into.

Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.

Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.

CompSci Self-Managing Systems Shivnath Babu.

Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

The Relational Model1 Transaction Processing Units of Work.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

20-1 Systems support is the on-going technical support for users, as well as the maintenance required to fix any errors, omissions, or new requirements.

Lesson 19-E-Commerce Security Needs. Overview Understand e-commerce services. Understand the importance of availability. Implement client-side security.

Software Maintenance Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.

CS 360 Lecture 17.  Software reliability:  The probability that a given system will operate without failure under given environmental conditions for.

Slide 1 Security Engineering. Slide 2 Objectives l To introduce issues that must be considered in the specification and design of secure software l To.

CS223: Software Engineering Lecture 2: Introduction to Software Engineering.

Spheres of Undo: A Framework for Extending Undo Aaron Brown January 2004 ROC Retreat.

Lecturer: Eng. Mohamed Adam Isak PH.D Researcher in CS M.Sc. and B.Sc. of Information Technology Engineering, Lecturer in University of Somalia and Mogadishu.

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

Undo for Recovery: Approaches and Models Aaron Brown UC Berkeley ROC Group.

Self-Contained Systems

Embracing Failure: A Case for Recovery-Oriented Computing

N-Tier Architecture.

Oracle Database Administration

Addressing Human Error with Undo

Bringing Undo to system admin: a new paradigm for recovery

Security Engineering.

Chapter 18 MobileApp Design

Fault Tolerance In Operating System

Recovery-Oriented Computing

Article Source:

Undo for Recovery: Approaches and Models

Systems Operations and Support

Presentation transcript:

Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery- oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable store, USENIX ATC 2003 (Best paper)

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 2 Availability and today’s apps Availability is the most important metric for modern computer systems Availability used to be a solved problem –Expensive fault-tolerance server –Vendor-supplied high-availability database system –All behind a box well firewalled Today’s apps are quire different –Distributed, heterogeneous environment –Conglomeration of interconnected systems: databases, application servers, middleware, web servers So – 65% of surveyed sties suffered a customer- visible outage at least once in 6-month; 25% 3+ in same period

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 3 Problem with assumptions Basic model –Hardware and software can be built w/ negligible failure rates –Failure modes of systems can be predicted and tolerated –Maintenance and repair are error-free procedures More realistically –Hardware and software failures are inevitable –Human failures are inevitable –Unanticipated failures are inevitable Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC)

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 4 HW & SW failures are inevitable Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem!

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 5 Human failures are inevitable Large systems rely on human beings for –Maintenance and repair –Software configuration and upgrading –Performance tuning –Diagnosing and fixing failures Human beings make mistakes –At a rate of % under stress –70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. But modern systems do not into account the possibility of human failure

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 6 Unanticipated failures are inevitable Could you solve this w/ good engineering? –Not really Perrow’s work on high-risk technology –Large servers - complex, reasonably-tightly- coupled systems, performing complex tasks under human guidance … prone to “normal accidents” –Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 7 Recovery Oriented Computing Focus on repair instead of avoiding failures Recovery needs to be a first-class part of the system It must –Ensure problems are detected fast (for containment) –Provide assistance in diagnosing root-cause of them –Repair mechanisms should be trustworthy –Should tolerate errors during recovery –It’s really complementary to fault-tolerance (redundancy is thus necessary) –Should automatically track the health of all components – so it should include fault-injection mechanisms –…

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 8 Undoable store You have undos for Office, but not for admins?! Undo operator incorporates three steps –Rewind – physically rolled back to before the damage –Repair – not constraint admins on what repair they can do –Replay – logically (to incorporate the repair) bring it back Two challenges in the 3Rs model –Timeline management – record system timeline so that you can edit it during repair and re-execute during replay –Keep the system consistent from an external observer’s point of view (even ‘after’ repair)

CS 395/495 Autonomic Computing Systems EECS, Northwestern University 9 Undo system architecture User Undo Proxy Service App Time travel storage Timeline log Undo Manager Control UI Control Verbs To be able to roll-back the system Service specific In part to make the undo manager generic