Download presentation
Presentation is loading. Please wait.
Published byRussell Carson Modified over 9 years ago
1
ROC@Stanford Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang
2
© 2002 Armando Fox Philosophical Direction n Use only dynamic, observed behavior to determine recovery technique/policy n Application independent recovery techniques n Specialize designs for fast recovery n Putting it all together: all software should be crash-only
3
© 2002 Armando Fox Dynamic, Observed Behavior n A priori fault models are suspect. Base recovery strategy only on dynamically observed behavior. l Behavior may change as system or workload evolves => addresses a key difference between Internet-oriented ROC systems and traditional mission-critical systems n Kinds of observations l PinPoint: use statistical analysis to determine which groups of components are correlated with observed external faults l Automatic failure-propagation inference: use fault injection and tracing to determine propagation paths and extent of different kinds of faults
4
© 2002 Armando Fox Making techniques application-generic n True application-generic recovery is hard [Lowell & Chen] l But that’s because “generic” applications are too unconstrained n Idea: if an application uses a particular “rich runtime”, that runtime may constrain application structure n Example: J2EE, a widely used enterprise app. framework l Modular Java applications, well defined component boundaries l Rich runtime system (“application server”) provides services for deployment/undeployment, naming, load balancing, integration with Web servers & databases, etc. l Instrument the platform with generic methods for fault injection and recovery (e.g., using Recursive Restartability) l Generic mechanisms: timeouts, exception propagation l Parametrizable mechanisms: progress counters, application-level pings
5
© 2002 Armando Fox Example: Automatic Failure Propagation Inference n When a failure occurs in a particular software component of an application, how far does it propagate? l i.e., what part(s) of the application must be recovered l Traditionally, failure propagation information is derived by hand n Our approach: modify J2EE application server to allow capture of failure-propagation information in any J2EE app n Automatic Failure-Propagation Inference (AFPI) for JBoss: + automatically and dynamically generates f-maps with no performance overhead + no application knowledge required + finds dependencies that other analyses might miss, omits “false” dependencies that don’t result in actual failure propagation
6
© 2002 Armando Fox Design for Fast Recovery n Recursive Restartability as a technique for recovery assumes... l For correctness: All components are independent and restartable (ie no data loss or other bad effects) l For performance: Restarts are relatively fast n For stateless components, this is “easy”; what about stateful components? l Correctness: eg, filesystems may suffer data loss if OS not cleanly shut down l Performance: eg, commercial RDBMS’s are crash-safe, but take a long time (minutes to hours) to recover
7
© 2002 Armando Fox Fast-Recovering State Stores n Isolate state exclusively in state store components; make all other “application logic” components stateless n Instead of building a general state store, specialize it for its intended use l Goal: identify combination of specializations that facilitates construction of a very-large-scale state store (O(10 3 ) requests/sec on O(10 6 ) entries) with near-zero recovery time n Possible axes for specialization… l Is state shared across clients or not? (user profile/session state vs. updating a message board) l How powerful must the query API be? (single-key lookup, free-text search, fully relational…) l What is the intended lifetime of state? (short/session, long/forever)
8
© 2002 Armando Fox Putting it together: crash-only software n Already assumed: software must be able to recover from a crash rapidly and correctly n But if it can do that…then why include separate code paths for “clean shutdown”? n All software should be crash-only; this makes it robust, easy to administer/upgrade, and amenable to RR as a recovery technique (among others) n Current explorations: l RR-ifying the platform (J2EE appserver) vs. individual applications l Improving ability to detect anomalies and failure correlations using path- based statistical analysis l Designing crash-only state stores for both session state and persistent state
9
© 2002 Armando Fox Outrageous Opinions session tomorrow n tomorrow after dinner: controversial ideas/opinions, open challenges, predicting the future,... l Please sign up on easel (coming this afternoon) l ~5-8 minutes per person to pound the pulpit and stimulate later discussion n Retreat proceedings, slides, etc. (mostly) online l Internet keyword “retreat” :-) or http://retreat or 10.0.0.1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.