Download presentation
Presentation is loading. Please wait.
Published byAndrea Cain Modified over 9 years ago
1
Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat
2
© 2003 Armando Fox Recovery Should Be Free, and Can Be n Already espouse arguments about lowering MTTR: l Mitigates impact on service as a whole [Fox & Patterson, 2002] l Results in higher end-user-perceived availability, given same overall availability [Xie et al. 2002] l etc l Tim Chou, Oracle: maybe more important to make recovery predictable (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can optimize its speed
3
© 2003 Armando Fox Real win: Recovery management is hard n Determining when to recover is hard l How to detect that something’s wrong? l How do you know when recovery is really necessary? (fail-stutter, etc.) l Will recovery make things worse? (cascading recovery) n Knowing what happens when you recover is hard l Will a particular recovery technique work? (the machinery needed to perform the recovery may also be broken) l What is the effect on online performance? (recovery can be expensive) l What if you needlessly “over-recover”? (cost of making a mistake is high) n If recovery were predictable and fast, it would simplify both failure detection and recovery management.
4
© 2003 Armando Fox Simplifying Recovery Management: Crash-Only Software n Goal: enforce simple invariants on recovery behavior, from outside the component(s) being recovered n Crash-only component provides PWR switch: stop = crash: l clean shutdown = loss of power = kernel panic =... n One way to go down one way to come up: start = recover n Power switch is external uniform behavior kill -9, “turning off” (process kill) a VM, pull power cord kill -9, “turning off” (process kill) a VM, pull power cord l Intuition: the “infrastructure” supporting the power switch is usually simpler than the applications using it, and common across all those applications n Can crash-only software actually be built, and if so, how? l (a) provide building blocks l (b) formalize C/O definition and provide developer
5
© 2003 Armando Fox Crash-only Building Blocks n JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., WIAPP 2003] l Micro-reboots used for recovery, application-generic failure-path inference used for determining recovery strategy l Significantly improves performability relative to whole-app redeploy n SSM: a CO session state manager [Ling, Fox, AMS 2003] n DStore: a CO persistent single-key state manager [Huang, Fox, submitted to SRDS 2003] l Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003] n Common features of both SSM and DStore: l Redundancy used for persistence l Workload semantics exploited to simplify consistency model & recovery l Recovery=restart, safe to reboot any node at any time l Safe to coerce any failure to a crash (fail-stop) at any time
6
© 2003 Armando Fox Building blocks, cont. n Pinpoint, statistical-anomaly-based failure detection l Standard tension: accuracy vs. precision (false positives problem) n Different clustering techniques seem to be good at detecting different kinds of problems l Surprising result from a CS241 project: character-frequency histograms are a good app-generic way to detect end-user-visible failures l Mostly integrated with JAGR and SSM l On burner: discussions with BEA Systems for integrating into WebLogic Server n Insight: if cost of “over-recovering” is low, aggressive statistics- based failure detection becomes more appealing
7
© 2003 Armando Fox Toward a crash-only formalism n Component frameworks force you into certain app-writing patterns l Inter-EJB calls through runtime-managed level of indirection l Restrictions on how persistent state mgt can be expressed l Restrictions on state sharing: difficult to do without using explicit external store l Hypothesis: these are the elements that allow C/O to work n Ongoing work: formalize crash-only SW l One possibility: observational equivalence with respect to a request stream l Can be expressed using a design pattern or denotational semantics l Ideally, will lead to a tool (“co-lint”) telling you whether your component is crash-only
8
© 2003 Armando Fox Summary: Toward a Crash-only World n Goal: simplify recovery management l diagnosis: statistical methods even more appealing if the cost of making a mistake is low l recovery: crash-only enforces invariants about what happens when recovery is attempted l allows aggressive use of fault model enforcement [Martin et al 2002] n Good progress on providing building blocks for app writers l JAGR: J2EE app server that allows fast recovery via micro-reboots and application-generic fault injection l SSM: a crash-only session state store (in process of integrating with JAGR) l DStore: a crash-only persistent single-key store l PinPoint: statistics-based failure detection (integrated with JAGR, mostly integrated with SSM)
9
© 2003 Armando Fox Xie et al: MTTR and End-User Availability Let A U =user-perceived unavailability, A S =system unavailability n Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher availability n When retry rate is sufficiently frequent, A U approaches A S (for A S =99.3%, this threshold is 200-300 sec) n Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methods n Finding: Given 2 systems with same A S, the one with shorter MTTR (even though it also has lower MTTF) appears better to the user. n Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)
10
© 2003 Armando Fox User perceived unavailability vs retry rate “sweet spot” Higher user retry rates yields little improvement in perceived availability.
11
© 2003 Armando Fox “sweet spot” At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability! Variable MTTR, but fixed system availability (low MTTR -> low MTTF) Surprise! MTTF eventually catches up with you
12
© 2003 Armando Fox Optimization Choices Fixed MTTF Fixed MTTR System Unavailability User Perceived Unavailability
13
© 2003 Armando Fox Results Summary n We can find a “sweet spot” (for a given system availability) beyond which higher user retry rates yield little benefit. n For two systems of a given availability, the one with lower MTTR does not always yield better user perceived availability. n For a given system, we can determine whether improving MTTR or MTTF will yield more user-visible benefits.
14
© 2003 Armando Fox “Clean” shutdown vs. restart? n Impractical to guarantee zero crashes robust systems must be crash-safe anyway l In that case, why support any other kind of shutdown? l Historically, for performance (avoid synchronous writes, do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... Crash-only software must: (a) be crash-safe & (b) recover quickly n Total recovery time may be shorter even if crash is forced n WinXP can be (mostly) crash- rebooted for upgrades n VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)
15
© 2003 Armando Fox Why Crash-Only Simplifies Recovery n “Hardware works, software doesn’t” l Hardware interlocks, timers, etc. have small state spaces of behavior, hence high confidence they will work as designed l Crash-only PWR switch is a way to approach that same property for software n Crash-only makes recovery policies easier to reason about l Opportunity to aggressively apply SW rejuvenation l “Recovery” code exercised on every restart; no exotic-but-rarely- used code paths l “Over-recovery” may be OK from performability standpoint: if recovery is free (performance & correctness), you stop thinking about it as recovery and start thinking about it as normal aspect of operation
16
© 2003 Armando Fox Towards a Crash-Only World n Existing software that is crash-only or near-crash-only l Stateless apps: most Web servers l Most RDBMS’s: crash-safe, but long recovery l Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main codepath l Some appliance storage devices: separate but pretty fast recovery path n Our goals... l Focus on Internet (“3 tier”) applications; already “crash-mostly” except for persistence tier(s) l Make the app server, middle-tier persistence, and back-end tier (to the extent possible) truly crash-only l Deploy application-generic failure detection techniques (which may over- recover, but the goal is to make that OK) l Quantify improvement (we hope!) in performability resulting from these changes l By doing it in the middleware, any app on that middleware can benefit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.