Recovery-Oriented Computing Stanford ROC Updates Armando Fox.

Recovery-Oriented Computing Stanford ROC Updates Armando Fox

2 ROC Retreat, June 16-18, 2004 Emre Kıcıman Progress Graduations Graduations Ben Ling (SSM, cheap-recovery session state manager) Ben Ling (SSM, cheap-recovery session state manager) Jamie Cutler (refactoring satellite groundstation software architecture to apply ROC techniques) Jamie Cutler (refactoring satellite groundstation software architecture to apply ROC techniques) Andy Huang: DStore, a persistent cluster-based hash table (CHT) Andy Huang: DStore, a persistent cluster-based hash table (CHT) Consistency model concretized Consistency model concretized Cheap recovery exploited for fast recovery triggered by statistical monitoring Cheap recovery exploited for fast recovery triggered by statistical monitoring Cheap recovery exploited for online repartitioning Cheap recovery exploited for online repartitioning

3 ROC Retreat, June 16-18, 2004 Emre Kıcıman More progress George Candea: Microreboots at the EJB level in J2EE apps George Candea: Microreboots at the EJB level in J2EE apps Shown to recover from variety of injected faults Shown to recover from variety of injected faults J2EE app session state factored out into SSM, making the J2EE app crash-only J2EE app session state factored out into SSM, making the J2EE app crash-only Demo during poster session Demo during poster session Emre Kiciman: Pinpoint: further exploration of anomaly-based failure detection [in a minute] Emre Kiciman: Pinpoint: further exploration of anomaly-based failure detection [in a minute]

4 ROC Retreat, June 16-18, 2004 Emre Kıcıman Fast Recovery meets Anomaly Detection 1. Use anomaly detection techniques to infer (possible) failures 2. Act on alarms using low-overhead “micro-recovery” mechanisms Microreboots in EJB apps Microreboots in EJB apps Node- or process-level reboot in DStore or SSM Node- or process-level reboot in DStore or SSM 3. Occasional false positives OK since recovery is so cheap These ideas will be developed at Panel tonight, and form topics for Breakouts tomorrow These ideas will be developed at Panel tonight, and form topics for Breakouts tomorrow

Recovery-Oriented Computing Updates on PinPoint Emre Kıcıman and Armando Fox {emrek, fox}@cs.stanford.edu

6 ROC Retreat, June 16-18, 2004 Emre Kıcıman What Is This Talk About? Overview of recent Pinpoint experiments Overview of recent Pinpoint experiments Including observations on fault behaviors Including observations on fault behaviors Comparison with other app-generic fault detectors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Tests of Pinpoint limitations Status of deployment at real sites Status of deployment at real sites

7 ROC Retreat, June 16-18, 2004 Emre Kıcıman Pinpoint: Overview Goal: App-generic & High-level failure detection Goal: App-generic & High-level failure detection For app-level faults, detection is significant % of MTTR (75%!) For app-level faults, detection is significant % of MTTR (75%!) Existing monitors: hard to build/maintain or miss high-level faults Existing monitors: hard to build/maintain or miss high-level faults Approach: Monitor, aggregate, and analyze low-level behaviors that correspond to high-level semantics Approach: Monitor, aggregate, and analyze low-level behaviors that correspond to high-level semantics Component interactions Component interactions Structure of runtime paths Structure of runtime paths Analysis of per-node statistics (req/sec, mem usage,...), without a priori thresholds Analysis of per-node statistics (req/sec, mem usage,...), without a priori thresholds Assumption: Anomalies are likely to be faults Assumption: Anomalies are likely to be faults Look for anomalies over time, or across peers in the cluster. Look for anomalies over time, or across peers in the cluster.

8 ROC Retreat, June 16-18, 2004 Emre Kıcıman Recap: 3 Steps to Pinpoint 1. Observe low-level behaviors that reflect app-level behavior Likely to change iff application-behavior changes Likely to change iff application-behavior changes App-transparent instrumentation! App-transparent instrumentation! 2. Model normal behavior and look for anomalies Assume: most of system working most of the time Assume: most of system working most of the time Look for anomalies over time and across peers Look for anomalies over time and across peers No a priori app-specific info! No a priori app-specific info! 3. Correlate anomalous behavior to likely causes Assume: observed connection between anomaly and cause Assume: observed connection between anomaly and cause Finally, notify admin or reboot component Finally, notify admin or reboot component

9 ROC Retreat, June 16-18, 2004 Emre Kıcıman An Internet Service... Middleware HTTP Frontends Application Components Databases

10 ROC Retreat, June 16-18, 2004 Emre Kıcıman A Failure... Middleware HTTP Frontends Application Components Databases X Failures behave differently than normal Look for anomalies in patterns of internal behavior

11 ROC Retreat, June 16-18, 2004 Emre Kıcıman Patterns: Path-shapes Middleware HTTP Frontends Application Components Databases

12 ROC Retreat, June 16-18, 2004 Emre Kıcıman Patterns: Component Interactions Middleware HTTP Frontends Application Components Databases

13 ROC Retreat, June 16-18, 2004 Emre Kıcıman Outline Overview of recent Pinpoint experiments Overview of recent Pinpoint experiments Observations on fault behaviors Observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

14 ROC Retreat, June 16-18, 2004 Emre Kıcıman Compared to other anomaly-detection... Labeled and Unlabeled training sets Labeled and Unlabeled training sets If we know the end user saw a failure, Pinpoint can help with localization If we know the end user saw a failure, Pinpoint can help with localization But often we’re trying to catch failures that end-user-level detectors would miss But often we’re trying to catch failures that end-user-level detectors would miss “Ground truth” for the latter is HTML page checksums + database table snapshots “Ground truth” for the latter is HTML page checksums + database table snapshots Current analyses are done offline Current analyses are done offline Eventual goal is to move to online, with new models being trained and rotated in periodically Eventual goal is to move to online, with new models being trained and rotated in periodically Alarms must be actionable Alarms must be actionable Microreboots (tomorrow) allows acting on alarms even when false positives Microreboots (tomorrow) allows acting on alarms even when false positives

15 ROC Retreat, June 16-18, 2004 Emre Kıcıman Fault and Error Injection Behavior Injected 4 types of faults and errors Injected 4 types of faults and errors Declared and runtime exceptions Declared and runtime exceptions Method call omissions Method call omissions Source code bug injections (details on next page) Source code bug injections (details on next page) Results ranged in severity (% of requests affected) Results ranged in severity (% of requests affected) 60% of faults caused cascades, affecting secondary requests 60% of faults caused cascades, affecting secondary requests We fared most poorly on the “minor” bugs We fared most poorly on the “minor” bugs Fault type Num Severe (>90%) Major (>1%) Minor (<1%) Declared ex 4120%56%24% Runtime ex 4117%59%24% Call omission 415%73%22% Src code bug 4713%76%11%

16 ROC Retreat, June 16-18, 2004 Emre Kıcıman Experience w/Bug Injection Wrote a Java code modifier to inject bugs Wrote a Java code modifier to inject bugs Injects 6 kinds of bugs into code in Petstore 1.3 Injects 6 kinds of bugs into code in Petstore 1.3 Limited to bugs that would not be caught by compiler, and are easy to inject -> no major structural bugs Limited to bugs that would not be caught by compiler, and are easy to inject -> no major structural bugs Double-check fault existence by checksumming HTML output Double-check fault existence by checksumming HTML output Not trivial to inject bugs that turn into failures! Not trivial to inject bugs that turn into failures! 1 st try: inject 5-10 bugs into random spots in each component. 1 st try: inject 5-10 bugs into random spots in each component. Ran 100 experiments, only 4 caused any changes! Ran 100 experiments, only 4 caused any changes! 2 nd try: exhaustive enumeration of potential “bug spots” 2 nd try: exhaustive enumeration of potential “bug spots” Found total of 41 active spots out of 1000s. Found total of 41 active spots out of 1000s. Rest is straight-line code w/no trivial bug spots, or dead code. Rest is straight-line code w/no trivial bug spots, or dead code.

17 ROC Retreat, June 16-18, 2004 Emre Kıcıman Source Code Bugs (Detail) Loop Errors: Inverts loop conditions, injected 15. Loop Errors: Inverts loop conditions, injected 15. while(b) {stmt;} -> while(!b) {stmt;} while(b) {stmt;} -> while(!b) {stmt;} Misassignment: Replaces LHS of assignment, injected 1 Misassignment: Replaces LHS of assignment, injected 1 i=f(a); -> j=f(a); i=f(a); -> j=f(a); Misinitialization: Clears a variable initialization, injected 2 Misinitialization: Clears a variable initialization, injected 2 int i=20; -> int i=0; int i=20; -> int i=0; Misreference: Replaces a var reference, injected 6 Misreference: Replaces a var reference, injected 6 avail=onStock-Ordered; -> avail=onStock-onOrder; avail=onStock-Ordered; -> avail=onStock-onOrder; Off-by-one: Replaces comparison op, injected 17 Off-by-one: Replaces comparison op, injected 17 if(a > b) {...}; -> if(a >= b) {...}; if(a > b) {...}; -> if(a >= b) {...}; Synchronization: Removes synchronization code, injected 0 Synchronization: Removes synchronization code, injected 0 synchronized { stmt; } -> { stmt; } synchronized { stmt; } -> { stmt; }

18 ROC Retreat, June 16-18, 2004 Emre Kıcıman Outline Overview of recent Pinpoint experiments Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

19 ROC Retreat, June 16-18, 2004 Emre Kıcıman Metrics: Recall and Precision Recall = C/T, how much of target was identified Recall = C/T, how much of target was identified Precision = C/R, how much of results were correct Precision = C/R, how much of results were correct Also, precision = 1 – false positive rate Also, precision = 1 – false positive rate Results (R) Target (T) Correctly Identified (C)

20 ROC Retreat, June 16-18, 2004 Emre Kıcıman Metrics: Applying Recall and Precision Detection Detection Do failures in the system cause detectable anomalies? Do failures in the system cause detectable anomalies? Recall = % of failures actually detected as anomalies Recall = % of failures actually detected as anomalies Precision = 1 - (false positive rate); ~1.0 in our expts Precision = 1 - (false positive rate); ~1.0 in our expts Identification (given a failure is detected): Identification (given a failure is detected): recall = how many actually-faulty requests are returned recall = how many actually-faulty requests are returned precision = what % of requests returned are faulty = 1-(false positive rate) precision = what % of requests returned are faulty = 1-(false positive rate) using HTML page checksums as ground truth using HTML page checksums as ground truth Workload: PetStore 1.1 and 1.3 (significantly different versions), plus RUBiS Workload: PetStore 1.1 and 1.3 (significantly different versions), plus RUBiS

21 ROC Retreat, June 16-18, 2004 Emre Kıcıman Fault Detection: Recall (All fault types) Minor faults were hardest to detect Minor faults were hardest to detect especially for Component Interaction especially for Component Interaction

22 ROC Retreat, June 16-18, 2004 Emre Kıcıman FD Recall (Severe & Major Faults only) Major faults are those that affect > 1% of requests Major faults are those that affect > 1% of requests For these faults, Pinpoint has significantly higher recall than other low-level detectors For these faults, Pinpoint has significantly higher recall than other low-level detectors

23 ROC Retreat, June 16-18, 2004 Emre Kıcıman Detecting Source Code Bugs Source code bugs were hardest to detect Source code bugs were hardest to detect PS-analysis, CI-analysis individually detected 7-12% of all faults, 37% of major faults PS-analysis, CI-analysis individually detected 7-12% of all faults, 37% of major faults HTTP detected 10% of all faults HTTP detected 10% of all faults We did better than HTTP logs, but that’s no excuse We did better than HTTP logs, but that’s no excuse Other faults: PP strictly better than HTTP and HTML det. Other faults: PP strictly better than HTTP and HTML det. Src code bugs: complementary: together all detected 15% Src code bugs: complementary: together all detected 15%

24 ROC Retreat, June 16-18, 2004 Emre Kıcıman Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner Failures injected but not detected Failures detected, faulty requests identified as such Failures not detected, but low false positives (good requests marked faulty) Failures detected, but high rate of mis-identification of faulty requests (false positive)

25 ROC Retreat, June 16-18, 2004 Emre Kıcıman Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner

26 ROC Retreat, June 16-18, 2004 Emre Kıcıman Adjusting Precision  =1: recall=68% precision=14%  =1: recall=68% precision=14%  =4: recall=34% precision=93%  =4: recall=34% precision=93% Low recall for faulty request identification still detects 83% of fault experiments Low recall for faulty request identification still detects 83% of fault experiments

27 ROC Retreat, June 16-18, 2004 Emre Kıcıman Outline Overview of recent Pinpoint experiments Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Tests of Pinpoint limitations Status of deployment at real sites

28 ROC Retreat, June 16-18, 2004 Emre Kıcıman Outline Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites Status of deployment at real sites

29 ROC Retreat, June 16-18, 2004 Emre Kıcıman Status of Real-World Deployment Deploying parts of Pinpoint at 2 large sites Deploying parts of Pinpoint at 2 large sites Site 1 Site 1 Instrumenting middleware to collect request paths for path-shape and component interaction analysis Instrumenting middleware to collect request paths for path-shape and component interaction analysis Feasability completed, instrumentation in progress... Feasability completed, instrumentation in progress... Site 2 Site 2 Applying peer-analysis techniques developed for SSM and D-Store Applying peer-analysis techniques developed for SSM and D-Store Metrics (e.g., req/sec, memory usage,...) already being collected. Metrics (e.g., req/sec, memory usage,...) already being collected. Beginning analysis and testing... Beginning analysis and testing...

30 ROC Retreat, June 16-18, 2004 Emre Kıcıman Summary Fault injection experiments showed range of behavior Fault injection experiments showed range of behavior Cascading faults to other requests; range of severity. Cascading faults to other requests; range of severity. Pinpoint performed better than existing low-level monitors Pinpoint performed better than existing low-level monitors Detected ~90% of major component-level errors (exceptions, etc) Detected ~90% of major component-level errors (exceptions, etc) Even in worst-case expts (src code bugs) PP provided a complementary improvement to existing low-level monitors Even in worst-case expts (src code bugs) PP provided a complementary improvement to existing low-level monitors Currently, validating Pinpoint in two real-world services Currently, validating Pinpoint in two real-world services

31 ROC Retreat, June 16-18, 2004 Emre Kıcıman Detail Slides

32 ROC Retreat, June 16-18, 2004 Emre Kıcıman Limitations: Independent Requests PP assumes: request-reply w/independent requests PP assumes: request-reply w/independent requests Monitored RMI-based J2EE system (ECPerf 1.1) Monitored RMI-based J2EE system (ECPerf 1.1).. is request-reply, but requests not independent, nor is unit of work (UoW) well defined... is request-reply, but requests not independent, nor is unit of work (UoW) well defined. Assume: UoW = 1 RMI call. Assume: UoW = 1 RMI call. Most RMI calls resulted in short paths (1 comp) Most RMI calls resulted in short paths (1 comp) Injected faults do not change these short paths Injected faults do not change these short paths When anomalies occurred, rarely in faulty path... When anomalies occurred, rarely in faulty path... Solution? Redefine UoW as multiple RMI calls Solution? Redefine UoW as multiple RMI calls => paths capture more behavioral changes => paths capture more behavioral changes => redefined UoW is likely app-specific => redefined UoW is likely app-specific

33 ROC Retreat, June 16-18, 2004 Emre Kıcıman Limitations: Well-defined Peers PP assumes: component peer groups well-defined PP assumes: component peer groups well-defined But behavior can depend on context But behavior can depend on context Ex. Naming server in a cluster Ex. Naming server in a cluster Front-end servers mostly send lookup requests Front-end servers mostly send lookup requests Back-end servers mostly respond to lookups. Back-end servers mostly respond to lookups. Result: No component matches “average” behavior Result: No component matches “average” behavior Both front-end and back-end naming servers “anomalous”! Both front-end and back-end naming servers “anomalous”! Solution? Extend component-IDs to include logical location... Solution? Extend component-IDs to include logical location...

34 ROC Retreat, June 16-18, 2004 Emre Kıcıman Bonus Slides

35 ROC Retreat, June 16-18, 2004 Emre Kıcıman Ex. Application-Level Failure No itinerary is actually available on this page Ticket was bought in March for travel in April But, website (superficially) appears to be working. Heartbeat, pings, HTTP-GET tests are not likely to detect the problem

36 ROC Retreat, June 16-18, 2004 Emre Kıcıman Application-level Failures Application-level failures are common Application-level failures are common >60% of sites have user-visible (incl. app-level) failures [BIG-SF] >60% of sites have user-visible (incl. app-level) failures [BIG-SF] Detection is major portion of recovery time Detection is major portion of recovery time TellMe: detecting app-level failures is 75% of recovery time [CAK04] TellMe: detecting app-level failures is 75% of recovery time [CAK04] 65% of user-visible failures mitigable by earlier detection [OGP03] 65% of user-visible failures mitigable by earlier detection [OGP03] Existing monitoring techniques aren't good enough Existing monitoring techniques aren't good enough Low-level monitors: pings, heartbeats, http error monitoring Low-level monitors: pings, heartbeats, http error monitoring + app-generic/low maintenance, - miss high-level failures + app-generic/low maintenance, - miss high-level failures High-level, app-specific tests High-level, app-specific tests - app-specific/hard to maintain, + can catch many app-level failures, - app-specific/hard to maintain, + can catch many app-level failures, - test coverage problem - test coverage problem

37 ROC Retreat, June 16-18, 2004 Emre Kıcıman Testbed and Faultload Instrumented JBoss/J2EE middleware Instrumented JBoss/J2EE middleware J2EE: state mgt, naming, etc. -> Good layer of indirection J2EE: state mgt, naming, etc. -> Good layer of indirection JBoss: open-source; millions of downloads; real deployments JBoss: open-source; millions of downloads; real deployments Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI w/synchronous reporting: 2-40ms latency hit; 17% throughput decrease. w/synchronous reporting: 2-40ms latency hit; 17% throughput decrease. Testbed applications Testbed applications Petstore 1.3, Petstore 1.1, RUBiS, ECPerf Petstore 1.3, Petstore 1.1, RUBiS, ECPerf Test strategy: inject faults, measure detection rate Test strategy: inject faults, measure detection rate Declared and undeclared exceptions Declared and undeclared exceptions Omitted calls: app not likely to handle at all Omitted calls: app not likely to handle at all Source code bugs (e.g., off-by-one errors, etc) Source code bugs (e.g., off-by-one errors, etc)

38 ROC Retreat, June 16-18, 2004 Emre Kıcıman PCFGs Model Normal Path Shapes Probabilistic Context Free Grammar (PCFG) Probabilistic Context Free Grammar (PCFG) Represents likely calls made by each component Represents likely calls made by each component Learn probabilities of rules based on observed paths Learn probabilities of rules based on observed paths Anomalous path shapes Anomalous path shapes Score a path by summing the deviations of P(observed calls) from average. Score a path by summing the deviations of P(observed calls) from average. Detected 90% of faults in our experiments Detected 90% of faults in our experiments A BC A B C Sample Paths Learned PCFG p=1 S A p=.5 A B A BC p=.5 B C B $ p=1 C $

39 ROC Retreat, June 16-18, 2004 Emre Kıcıman Use PCFG to Score Paths Measure difference between observed path and avg Measure difference between observed path and avg Score(path) = ∑ 1/n i - P(r i ) Score(path) = ∑ 1/n i - P(r i ) Higher scores are anomalous Higher scores are anomalous Detected 90% of faults in our experiments Detected 90% of faults in our experiments A BC A B C Sample Paths Learned PCFG p=1 S A p=.5 A B A BC p=.5 B C B $ p=1 C $

40 ROC Retreat, June 16-18, 2004 Emre Kıcıman Separating Good from Bad Paths Use dynamic threshold to detect anomalies Use dynamic threshold to detect anomalies When unexpectedly many paths fall above N th percentile When unexpectedly many paths fall above N th percentile Normal distribution Distribution with faults

41 ROC Retreat, June 16-18, 2004 Emre Kıcıman Anomalies in Component Interaction Weighted links model component interaction Weighted links model component interaction w 0 =.4 w 1 =.3 w 2 =.2 w 3 =.1

42 ROC Retreat, June 16-18, 2004 Emre Kıcıman Scoring CI Models Score w/    test of goodness-of-fit: Score w/    test of goodness-of-fit: Probability that same process generated both Probability that same process generated both Makes no assumptions about shape of distribution Makes no assumptions about shape of distribution w 0 =.4 w 1 =.3 w 2 =.2 w 3 =.1 n 0 =30 n 1 =10 n 2 =40 n 3 =20 Normal Pattern Anomaly!

43 ROC Retreat, June 16-18, 2004 Emre Kıcıman Two Kinds of False Positives Algorithmic false positives Algorithmic false positives No anomaly exists No anomaly exists But statistical technique made a mistake... But statistical technique made a mistake... Semantic false positives Semantic false positives Correctly found an anomaly Correctly found an anomaly But anomaly is not a failure But anomaly is not a failure

44 ROC Retreat, June 16-18, 2004 Emre Kıcıman Resilient Against Semantic FP Test against normal changes Test against normal changes 1. Vary workload from “browse & purchase” to “only browse” 2. Minor upgrade from Petstore 1.3.1 to 1.3.2 Path-shape analysis found NO differences Path-shape analysis found NO differences Component interaction changes below threshold Component interaction changes below threshold For predictable, major changes: For predictable, major changes: Consider lowering Pinpoint sensitivity until retraining complete Consider lowering Pinpoint sensitivity until retraining complete -> Window-of-vulnerability, but better than false-positives. -> Window-of-vulnerability, but better than false-positives. Q: Rate of normal changes? How quickly can we retrain? Q: Rate of normal changes? How quickly can we retrain? Minor changes every day, but only to parts of site. Minor changes every day, but only to parts of site. Training speed -> how quickly is service exercised? Training speed -> how quickly is service exercised?

45 ROC Retreat, June 16-18, 2004 Emre Kıcıman Related Work Detection and Localization: Detection and Localization: Richardson: Performance failure detection Richardson: Performance failure detection Infospect: search for logical inconsistencies in observed configuration Infospect: search for logical inconsistencies in observed configuration Event/alarm correlation systems: use dependency models to quiesce/collapse correlated alarms. Event/alarm correlation systems: use dependency models to quiesce/collapse correlated alarms. Request Tracing Request Tracing Magpie: tracing for performance modeling/characterization Magpie: tracing for performance modeling/characterization Mogul: discovering majority behavior in black-box distrib. systems Mogul: discovering majority behavior in black-box distrib. systems Compilers & PL Compilers & PL DIDUCE: hypothesize invariants, report when they're broken DIDUCE: hypothesize invariants, report when they're broken Bug Isolation Proj.: correlate crashes w/state, across real runs Bug Isolation Proj.: correlate crashes w/state, across real runs Engler: Analyze static code for patterns and anomalies -> bugs Engler: Analyze static code for patterns and anomalies -> bugs

46 ROC Retreat, June 16-18, 2004 Emre Kıcıman Conclusions Monitoring path shapes and component interactions.. Monitoring path shapes and component interactions..... easy to instrument, app-generic... easy to instrument, app-generic... are likely to change when application fails... are likely to change when application fails Model normal pattern of behavior, look for anomalies Model normal pattern of behavior, look for anomalies Key assumption: most of system working most of time Key assumption: most of system working most of time Anomaly detection detects high-level failures, and is deployable Anomaly detection detects high-level failures, and is deployable Resilient to (at least some) normal changes to system Resilient to (at least some) normal changes to system Current status: Current status: Deploying in real, large Internet service. Deploying in real, large Internet service. Anomaly detection techniques for “structure-less” systems Anomaly detection techniques for “structure-less” systems

47 ROC Retreat, June 16-18, 2004 Emre Kıcıman More Information http://www.stanford.edu/~emrek/ Detecting Application-Level Failures in Component-Based Internet Services. Emre Kiciman, Armando Fox. In submission Session State: Beyond Soft State. Benjamin Ling, Emre Kiciman, Armando Fox. NSDI'04 Path-based Failure and Evolution Management Chen, Accardi, Kiciman, Lloyd, Patterson, Fox, Brewer. NSDI'04

48 ROC Retreat, June 16-18, 2004 Emre Kıcıman Localize Failures with Decision Tree Search for features that occur with bad items, but not good Search for features that occur with bad items, but not good Decision trees Decision trees Classification function Classification function Each branch in tree tests a feature Each branch in tree tests a feature Leaves of tree give classification Leaves of tree give classification Learn decision tree to classify good/bad examples Learn decision tree to classify good/bad examples But we won't use it for classification But we won't use it for classification Just look at learned classifier and extract questions as features Just look at learned classifier and extract questions as features

49 ROC Retreat, June 16-18, 2004 Emre Kıcıman Illustrative Decision Tree

50 ROC Retreat, June 16-18, 2004 Emre Kıcıman Results: Comparing Localization Rate

51 ROC Retreat, June 16-18, 2004 Emre Kıcıman Monitoring “Structure-less” Systems N replicated storage bricks handle read/write requests N replicated storage bricks handle read/write requests No complicated interactions or requests No complicated interactions or requests -> Cannot do structural anomaly detection! -> Cannot do structural anomaly detection! Alternative features (performance, mem usage, etc) Alternative features (performance, mem usage, etc) Activity statistics: How often did a brick do something? Activity statistics: How often did a brick do something? Msgs received/sec, dropped/sec, etc. Msgs received/sec, dropped/sec, etc. Same across all peers, assuming balanced workload Same across all peers, assuming balanced workload Use anomalies as likely failures Use anomalies as likely failures State statistics: What is current state of system State statistics: What is current state of system Memory usage, queue length, etc. Memory usage, queue length, etc. Similar pattern across peers, but may not be in phase Similar pattern across peers, but may not be in phase Look for patterns in time-series; differences in patterns indicate failure at a node. Look for patterns in time-series; differences in patterns indicate failure at a node.

52 ROC Retreat, June 16-18, 2004 Emre Kıcıman Surprising Patterns in Time-Series 1. Discretize time-series into string. [Keogh] [0.2, 0.3, 0.4, 0.6, 0.8, 0.2] -> “aaabba” 2. Calculate the frequencies of short substrings in the string. “aa” occurs twice; “ab”, “bb”, “ba” occurs once. 3. Compare frequencies to normal, look for substrings that occur much less or much more than normal.

53 ROC Retreat, June 16-18, 2004 Emre Kıcıman Inject Failures into Storage System Inject performance failure every 60s in one brick Inject performance failure every 60s in one brick Slow all request my 1ms Pinpoint detects failures in 1- 2 periods Pinpoint detects failures in 1- 2 periods Does not detect anomalies during normal behavior (including workload changes and GC) Current issues: Too many magic numbers Current issues: Too many magic numbers Working on improving these techniques to remove or automatically choose magic numbers

54 ROC Retreat, June 16-18, 2004 Emre Kıcıman Responding to Anomalies Want a policy for responding to anomalies Want a policy for responding to anomalies Cross-check for failure: Cross-check for failure: 1. If no cause is correlated with the anomaly -> not failure 1. If no cause is correlated with the anomaly -> not failure 2. Check user behavior for excessive reloads 2. Check user behavior for excessive reloads 3. Persistent anomaly? Check for recent state changes 3. Persistent anomaly? Check for recent state changes Recovery actions: Recovery actions: 1. Reboot component or app 1. Reboot component or app 2. Rollback failed request, try again. 2. Rollback failed request, try again. 3. Rollback software to last known good state. 3. Rollback software to last known good state. 4. Notify administrator 4. Notify administrator

Recovery-Oriented Computing Stanford ROC Updates Armando Fox.

Similar presentations

Presentation on theme: "Recovery-Oriented Computing Stanford ROC Updates Armando Fox."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recovery-Oriented Computing Stanford ROC Updates Armando Fox.

Similar presentations

Presentation on theme: "Recovery-Oriented Computing Stanford ROC Updates Armando Fox."— Presentation transcript:

Similar presentations

About project

Feedback