Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat.

Slides:



Advertisements
Similar presentations
Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi.
Advertisements

Large-Scale Distributed Systems Andrew Whitaker CSE451.
Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
How to Optimize Your Existing Regression Testing Arthur Hicken May 2012.
DStore: Recovery-friendly, self-managing clustered hash table Andy Huang and Armando Fox Stanford University.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Presentation on Clustering Paper: Cluster-based Scalable Network Services; Fox, Gribble et. al Internet Services Suman K. Grandhi Pratish Halady.
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
Failures in the System  Two major components in a Node Applications System.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Software Reliability SEG3202 N. El Kadri.
Software Testing Testing principles. Testing Testing involves operation of a system or application under controlled conditions & evaluating the results.
1 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router.
Distributed Systems: Concepts and Design Chapter 1 Pages
Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University.
Marcos K. Aguilera Microsoft Research Silicon Valley No Time for Asynchrony Michael Walfish UCL, Stanford, UT Austin.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Software Assurance Session 13 INFM 603. Bugs, process, assurance Software assurance: quality assurance for software Particularly assurance of security.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
Combining Statistical Monitoring and Predictable Recovery for Self-Management Armando Fox, Emre Kıcıman, Stanford University Dave Patterson, Mike Jordan,
Statistical Process Control04/03/961 What is Variation? Less Variation = Higher Quality.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Deadlock Detection and Recovery
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
Testing OO software. State Based Testing State machine: implementation-independent specification (model) of the dynamic behaviour of the system State:
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
Latency as a Performability Metric: Experimental Results Pete Broadwell
Recovery-Oriented Computing Detecting and Diagnosing Application-Level Failures in Internet Services Emre Kıcıman and Armando Fox {emrek,
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Conceiving “Availability” 1. It seems like the basic objective “All” a network does is make stuff available. – We view with suspicion networks that transform.
Refactoring and Integration Testing or Strategy, introduced reliably by TDD The power of automated tests.
Mick Badran Using Microsoft Service Fabric to build your next Solution with zero downtime – Lvl 300 CLD32 5.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.
Practical IT Research that Drives Measurable Results 1Info-Tech Research Group Get Moving with Server Virtualization.
Basic Concepts of Software Architecture. What is Software Architecture? Definition: – A software system’s architecture is the set of principal design.
Lecturer: Eng. Mohamed Adam Isak PH.D Researcher in CS M.Sc. and B.Sc. of Information Technology Engineering, Lecturer in University of Somalia and Mogadishu.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Week#3 Software Quality Engineering.
Introduction to: The Architecture of the Internet
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Welcome to the Winter 2004 ROC Retreat
The Case for a Session State Storage Layer
Large Distributed Systems
Introduction to: The Architecture of the Internet
EEC 688/788 Secure and Dependable Computing
Introduction to: The Architecture of the Internet
RM3G: Next Generation Recovery Manager
EEC 688/788 Secure and Dependable Computing
Web Application Server 2001/3/27 Kang, Seungwoo. Web Application Server A class of middleware Speeding application development Strategic platform for.
Introduction to: The Architecture of the Internet
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
Presentation transcript:

Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat

© 2003 Armando Fox Recovery Should Be Free, and Can Be n Already espouse arguments about lowering MTTR: l Mitigates impact on service as a whole [Fox & Patterson, 2002] l Results in higher end-user-perceived availability, given same overall availability [Xie et al. 2002] l etc l Tim Chou, Oracle: maybe more important to make recovery predictable (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can optimize its speed

© 2003 Armando Fox Real win: Recovery management is hard n Determining when to recover is hard l How to detect that something’s wrong? l How do you know when recovery is really necessary? (fail-stutter, etc.) l Will recovery make things worse? (cascading recovery) n Knowing what happens when you recover is hard l Will a particular recovery technique work? (the machinery needed to perform the recovery may also be broken) l What is the effect on online performance? (recovery can be expensive) l What if you needlessly “over-recover”? (cost of making a mistake is high) n If recovery were predictable and fast, it would simplify both failure detection and recovery management.

© 2003 Armando Fox Simplifying Recovery Management: Crash-Only Software n Goal: enforce simple invariants on recovery behavior, from outside the component(s) being recovered n Crash-only component provides PWR switch: stop = crash: l clean shutdown = loss of power = kernel panic =... n One way to go down  one way to come up: start = recover n Power switch is external  uniform behavior kill -9, “turning off” (process kill) a VM, pull power cord kill -9, “turning off” (process kill) a VM, pull power cord l Intuition: the “infrastructure” supporting the power switch is usually simpler than the applications using it, and common across all those applications n Can crash-only software actually be built, and if so, how? l (a) provide building blocks l (b) formalize C/O definition and provide developer

© 2003 Armando Fox Crash-only Building Blocks n JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., WIAPP 2003] l Micro-reboots used for recovery, application-generic failure-path inference used for determining recovery strategy l Significantly improves performability relative to whole-app redeploy n SSM: a CO session state manager [Ling, Fox, AMS 2003] n DStore: a CO persistent single-key state manager [Huang, Fox, submitted to SRDS 2003] l Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003] n Common features of both SSM and DStore: l Redundancy used for persistence l Workload semantics exploited to simplify consistency model & recovery l Recovery=restart, safe to reboot any node at any time l Safe to coerce any failure to a crash (fail-stop) at any time

© 2003 Armando Fox Building blocks, cont. n Pinpoint, statistical-anomaly-based failure detection l Standard tension: accuracy vs. precision (false positives problem) n Different clustering techniques seem to be good at detecting different kinds of problems l Surprising result from a CS241 project: character-frequency histograms are a good app-generic way to detect end-user-visible failures l Mostly integrated with JAGR and SSM l On burner: discussions with BEA Systems for integrating into WebLogic Server n Insight: if cost of “over-recovering” is low, aggressive statistics- based failure detection becomes more appealing

© 2003 Armando Fox Toward a crash-only formalism n Component frameworks force you into certain app-writing patterns l Inter-EJB calls through runtime-managed level of indirection l Restrictions on how persistent state mgt can be expressed l Restrictions on state sharing: difficult to do without using explicit external store l Hypothesis: these are the elements that allow C/O to work n Ongoing work: formalize crash-only SW l One possibility: observational equivalence with respect to a request stream l Can be expressed using a design pattern or denotational semantics l Ideally, will lead to a tool (“co-lint”) telling you whether your component is crash-only

© 2003 Armando Fox Summary: Toward a Crash-only World n Goal: simplify recovery management l diagnosis: statistical methods even more appealing if the cost of making a mistake is low l recovery: crash-only enforces invariants about what happens when recovery is attempted l allows aggressive use of fault model enforcement [Martin et al 2002] n Good progress on providing building blocks for app writers l JAGR: J2EE app server that allows fast recovery via micro-reboots and application-generic fault injection l SSM: a crash-only session state store (in process of integrating with JAGR) l DStore: a crash-only persistent single-key store l PinPoint: statistics-based failure detection (integrated with JAGR, mostly integrated with SSM)

© 2003 Armando Fox Xie et al: MTTR and End-User Availability Let A U =user-perceived unavailability, A S =system unavailability n Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher availability n When retry rate is sufficiently frequent, A U approaches A S (for A S =99.3%, this threshold is sec) n Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methods n Finding: Given 2 systems with same A S, the one with shorter MTTR (even though it also has lower MTTF) appears better to the user. n Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)

© 2003 Armando Fox User perceived unavailability vs retry rate “sweet spot” Higher user retry rates yields little improvement in perceived availability.

© 2003 Armando Fox “sweet spot” At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability! Variable MTTR, but fixed system availability (low MTTR -> low MTTF) Surprise! MTTF eventually catches up with you

© 2003 Armando Fox Optimization Choices Fixed MTTF Fixed MTTR System Unavailability User Perceived Unavailability

© 2003 Armando Fox Results Summary n We can find a “sweet spot” (for a given system availability) beyond which higher user retry rates yield little benefit. n For two systems of a given availability, the one with lower MTTR does not always yield better user perceived availability. n For a given system, we can determine whether improving MTTR or MTTF will yield more user-visible benefits.

© 2003 Armando Fox “Clean” shutdown vs. restart? n Impractical to guarantee zero crashes  robust systems must be crash-safe anyway l In that case, why support any other kind of shutdown? l Historically, for performance (avoid synchronous writes, do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... Crash-only software must: (a) be crash-safe & (b) recover quickly n Total recovery time may be shorter even if crash is forced n WinXP can be (mostly) crash- rebooted for upgrades n VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)

© 2003 Armando Fox Why Crash-Only Simplifies Recovery n “Hardware works, software doesn’t” l Hardware interlocks, timers, etc. have small state spaces of behavior, hence high confidence they will work as designed l Crash-only PWR switch is a way to approach that same property for software n Crash-only makes recovery policies easier to reason about l Opportunity to aggressively apply SW rejuvenation l “Recovery” code exercised on every restart; no exotic-but-rarely- used code paths l “Over-recovery” may be OK from performability standpoint: if recovery is free (performance & correctness), you stop thinking about it as recovery and start thinking about it as normal aspect of operation

© 2003 Armando Fox Towards a Crash-Only World n Existing software that is crash-only or near-crash-only l Stateless apps: most Web servers l Most RDBMS’s: crash-safe, but long recovery l Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main codepath l Some appliance storage devices: separate but pretty fast recovery path n Our goals... l Focus on Internet (“3 tier”) applications; already “crash-mostly” except for persistence tier(s) l Make the app server, middle-tier persistence, and back-end tier (to the extent possible) truly crash-only l Deploy application-generic failure detection techniques (which may over- recover, but the goal is to make that OK) l Quantify improvement (we hope!) in performability resulting from these changes l By doing it in the middleware, any app on that middleware can benefit