A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer.

A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer UC Berkeley, Tellme 1, Stanford Univ., eBay 2

NSDI 2004Slide 2 Need for Fast Recovery  Failures are common and costly –Daily partial site outages for large sites. –Downtime: $300K - $6million/hr.  Challenges: –Lots of potential sources of faults. –Multiple independent faults. –Distributed runtime behavior (e.g. load balancing)  Observation: very short outages are “free” –Cost of downtime is not linear.

NSDI 2004Slide 3 Need for Rapid Evolution  Competition drives demand for new features and bug fixes –Switching cost is low. –Single administrative domain lowers upgrade barrier.  Challenges: –Short release cycles Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes.Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes. –Distributed runtime behavior  Observation: trend towards application server frameworks –E.g. J2EE,.NET, etc.

NSDI 2004Slide 4  2 extremes of granularity  Problems: –Dispersed execution context –Local context often insufficient –“Blackbox” components Current Approaches to Understand Systems eBayeBayeBayeBay External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers granularity

NSDI 2004Slide 5  Captures the relationship between components and their aggregate behavior –Complements both end-to-end tools and “micro” analysis tools. “Macro” Approach eBayeBayeBayeBay WebServer “Micro” view e.g. code-level debuggers “Macro” view WS WS WS App App App DB External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers

NSDI 2004Slide 6 First Step: Path-based Analysis  Paths record runtime properties of requests –components used (name, version, etc) –timestamps  Two principles 1.Use paths as the core abstraction 2.Apply statistical analysis to a large number of paths  Focus on correctness –In addition to performance (MSR’s Magpie, HP’s WebMon and Project 5) Web A Web B App B App C DB A DB B App A path 1. Web A, t = 1 2. App A, t = 23 3. App B, t = 30 4. DB B, t = 56 …. request

NSDI 2004Slide 7 Architecture  Observation includes: –Component/resource names, version, … –Timestamps  Application-generic tracing –By instrumenting the application servers E.g. < 1K lines for JBoss, a J2EE app serverE.g. < 1K lines for JBoss, a J2EE app server –Request-centric Associate system events to user-visible eventsAssociate system events to user-visible events –Performance overhead 1-3% for eBay1-3% for eBay Web Tracer App Tracer Web Tracer App Tracer DB Tracer DB Tracer Aggregator Ops/QA/Dev request Storage Query interface Analysis Engines Detection Diagnosis Viz observation Path

NSDI 2004Slide 8 3 Path-based Frameworks Paths Framework SiteDescriptionPhysicalTiers # of Machines # of Requests Apps Hosted Pinpoint- Research prototype based on J2EE 2-3--Java ObsLogs Tellme Enterprise voice application network (5)Hundreds Millions per day VoiceXML SuperCAL eBayeBayeBayeBay Online auction 2-32000+ Billions per day C++, Java  eBay Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines

NSDI 2004Slide 9 Talk Outline  Motivation and Approach  Failure Management –Failure detection via path anomalies –Failure diagnosis using machine learning methods  Evolution Management –Application-generic dependency tracking –Detecting and diagnosing changes  Conclusions

NSDI 2004Slide 10 Failure Management  Goal: minimize impact of failures –User-visible failures => $$$ lost  78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failure timeline78%

NSDI 2004Slide 11 Fast Recovery Challenges  Many potential causes of failures –SW bugs, hardware, configuration, network, DB, … –Multiple independent failures  Lots of data –Many small, but tolerable failures –Real-time detection/diagnosis  Root cause might not be captured in logs –Tradeoff between logging granularity and overhead  Observation: exact root cause may not be required for many recovery techniques

NSDI 2004Slide 12 Failure Detection Concepts  Path collisions –Incomplete paths interrupted by other requests.  Structural anomalies –Learn a set of “good” paths, and flag unseen paths. –Extended to use probabilistic models. App B App C DB A DB B App A Web A Web B requestsrequests

NSDI 2004Slide 13 Structural Anomalies in Path Shapes  Probabilistic Context Free Grammar (PCFG) –Represents likely calls made by each component –Learn probabilities of rules based on observed paths  Anomalous path shapes –Score a path by calculating the deviations of P(observed calls) from average.  Detected 90% of injected faults in our experiments A B C A B C Sample Paths Learned PCFG p=1 S A p=.5 A B A BC p=.5 B C B $ p=1 C $

NSDI 2004Slide 14 Failure Diagnosis Concepts  Idea: all bad paths touch the root cause –Look for path properties common to failed requests E.g. components used in all failed pathsE.g. components used in all failed paths –Extended to use probabilistic models.  Limitation: –Inter-path dependency App B App C DB A DB B App A requestsrequests App B Web A Web B

NSDI 2004Slide 15 Failure Diagnosis  Summarize each path into:  What features of requests correlate with failures (e.g. NullPointerException)? –Request type, name, pool, host, version, DB, or a combination of these? –Different causes require different recovery techniques PathTypeNamePoolHostVersionDBStatus 1URLViewFeedbackCgi01341.2.1 FeedbackDB, UserDB, … NullPointer 2URLBidCgi22311.0.3PriceDBSuccess 3XML……………… Features

NSDI 2004Slide 16 Machine X Machine Y Machine Borrow Statistical Learning Techniques  Cast as feature selection problem in machine learning  Use decision trees because results are easily interpretable 1.Learn the tree from data (with failed paths) 2.The edges that lead to failed nodes are the candidates Success Null- Pointer Success Time- out Respond MyFeedback ViewFeedba ck Login Request Name Request Name TypeNameMachineStatus URL My- Feedback X Null- Pointer URLLoginXSuccess XML View- Feedback YSuccess URLRespondYTimeout ………… Diagnosis: 1) Machine X and MyFeedback 2) Machine Y and Respond FeaturesClass Label

NSDI 2004Slide 17 Diagnosis Results of Decision Trees  Recall vs precision tradeoff –Recall: % of true faults identified –Precision: 1 – false positive rate  Decision trees –C4.5 w/ adaptation A standard decision tree algorithmA standard decision tree algorithm –MinEntropy A greedy variant that finds one leaf with the most failuresA greedy variant that finds one leaf with the most failures Actual results from eBay deploymentActual results from eBay deployment –Association rules Data mining algorithm that computes the conditional probabilities for all combinations of featuresData mining algorithm that computes the conditional probabilities for all combinations of featuresperfect

NSDI 2004Slide 18 Talk Outline  Motivation and Approach  Failure Management  Evolution Management –Application-generic dependency tracking –Detecting and diagnosing expected and unexpected changes  Conclusions

NSDI 2004Slide 19 Tracking Dependency  Current approaches –Manual approaches are error-prone and slow –Static analysis captures possible system behavior vs. runtime analysis which captures the actual behavior  Paths directly captures application structure –Application-generic tracking of actual dependency Zero changes to applicationsZero changes to applications Rubis, a J2EE auction application, hosted on Pinpoint/JBoss

NSDI 2004Slide 20 Automatically Derived State Dependency  Paths associate requests with internal state –Coupling of requests through shared state Easily extended to track fine-grained (e.g. row-level) state sharingEasily extended to track fine-grained (e.g. row-level) state sharing Database Tables ProductSignonAccountBannerInventory VerifySigninRRR CartRRR/W CommitOrderRW CategoryR SearchRR ProductDetailsRR/W NewAccountRR CheckoutW Requests PetStore, a J2EE e-commerce application, hosted on Pinpoint/JBoss R – read W - write

NSDI 2004Slide 21 Detecting/Diagnosing Changes  Paths provides a flexible mechanism to profile any sub-path –Take the interval between any two observations –Drill down to identify problematic sub-paths  Statistical analysis simultaneously examines thousands of sub-paths –Use non-parametric tests (e.g. Mann-Whitney) –Thousands of sub-paths tested for every Tellme release observationobsobsobsobs path

NSDI 2004Slide 22 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drill down to diagnose specific slow sub-paths 2 versions of a Tellme application Lower quartile Median Upper quartile Outliers Change detected in 1 sub-path in 1 application

NSDI 2004Slide 23 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 2 versions of 2 Tellme applications and 3 sub-paths Lower quartile Median Upper quartile Outliers No changes

NSDI 2004Slide 24 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 3 versions of 2 Tellme applications and 3 sub-paths App fixed

NSDI 2004Slide 25  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Change detected in 1 sub-path in 1 application

NSDI 2004Slide 26  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Consistent changes across all apps

NSDI 2004Slide 27  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 3 versions of a Tellme platform platform fixed

NSDI 2004Slide 28 Lessons Learned  Separate the path analysis logic from observation instrumentation –Improves maintainability and extensibility  Data is cheap –Allows the use of simple statistical algorithms  Live workload –Important to support online use of tools  Record “attempts” –Failed components/resources may not record observations properly

NSDI 2004Slide 29 Summary  Paths + statistical analysis: –Improves failure detection and diagnosis to support fast recovery. –Automates dependency tracking and change analysis to support rapid and correct evolution.  Deployed and evaluated on real systems –Pinpoint, Tellme, and eBay  Future work: –Wide-area systems and systems that span multiple administrative domains

NSDI 2004Slide 30 Thank You  Acknowledgements –Berkeley/Stanford ROC Research Group –Professor Michael Jordan and Alice Zheng –Shepherd Miguel Castro and anonymous reviewers  For more info: –Google, Yahoo, or MSN Search for Mike Chen

NSDI 2004Slide 31 Backup Slides

NSDI 2004Slide 32 Recovery Time Saving  Expected Time Saved = E (Manual Diag. + Recovery) – E (Automated & Manual Diag. + Recovery) –Use diagnosis time based on experience: Diagnosis time: Automated = 1min, Manual (perfect) = 15minDiagnosis time: Automated = 1min, Manual (perfect) = 15min Recovery time (w/ verification) = 5 minRecovery time (w/ verification) = 5 min Time Saved (min) Noise Filtering Threshold $50K to $1million saved

NSDI 2004Slide 33 Show eBay’s Complex System Diagram  Show a few path examples

NSDI 2004Slide 34 Failure Management Process  Detection  Isolation  Diagnosis  Impact Analysis  Repair  Feedback

NSDI 2004Slide 35 MinEntropy  Entropy measures the randomness of data –E.g. if failure is evenly distributed (very random), then entropy is high  Rank features by the normalized entropy –E.g. if root cause is a machine failure, then entropy will be low in the host dimension. Since all types of requests will fail on that host, the entropy in the request type dimension will be higher.  Implemented at eBay –Greedy approach searches for the leaf node with most failures –Pros: fast (<1s for 100K txns and scales linearly) –Cons: Optimized for single faultsOptimized for single faults Features may not be independent (ie. pool and host)Features may not be independent (ie. pool and host)

NSDI 2004Slide 36 MinEntropy example TxTypeErrors URL4350 SQL47 EMAIL12 XSLT0 …… PoolErrorsCgi012 Cgi14002 Cgi230 Cgi38 Cgi45 …… MachineErrorsAttila1985 Lenin2002 Marcus4 Scipio0 …… URLErrorsMyEBay636 MyEBaySel ler 512 MyEBayLo gin 736 …… LabelErrorsE2933987 E29115 Alert: Build E293 causing URL error storm (not specific to any URL) in pool CGI1

NSDI 2004Slide 37 Association Rules  Data mining technique to compute item sets –e.g. Shoppers who bought this item also shopped for …  Metrics –Confidence: (# of A & B) / # of A Conditional probability of B given AConditional probability of B given A –Support: (# of A & B)/total # of txns  Generates rules for all possible sets –e.g. machine=abc, txn=login => status=NullPointerException (conf:0.1, support=0.02)  Applied to failure diagnosis –Find all rules that has failed status on the right –Pros: looks at combinations of features –Cons: generates many rules

NSDI 2004Slide 38 Adapting Association Rules  Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)  Rank by the size of item sets if support and conf are equal –TxnName = LeaveFeedback

NSDI 2004Slide 39 Failure Management  Goal: minimize impact of failure –User-visible failures => $$$ lost  Fast failure detection and diagnosis are critical to availability –78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failuretimeline

NSDI 2004Slide 40 PCFG Thresholding ● Set a threshold for declaring anomalies – Static threshold: any request > 99 th or 99.5 th percentile – Dynamic threshold: when proportions don't match known good.

NSDI 2004Slide 41 Failure Diagnosis Experiments  Data set –10 one-minute traces, 4 with 2 independent faults total of 14 independent faultstotal of 14 independent faults –About 1/8 of the whole site (640 potential single-faults)  Metrics –Recall: % of true faults identified = (# of identified faults) / (# of true faults) –Precision: 1 – false positive rate = (# of identified faults) / (# of predicted faults) TypeNamePoolMachineVersionDatabaseStatus 10300152607408 HostDB Host, Host Host, DB Host, SW DB, SW 241111

NSDI 2004Slide 42 eBay’s Site  2 physical tiers –Web server/app server + DB –Apps in both Java (WebSphere) and C++  SuperCAL (Centralized Application Logging) –API for app developer to log anything to CAL –Platform logs common path features: cookie, host, URL, DB table(s), status, etc.  Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines How to diagnose accurately and efficiently???

A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer.

Similar presentations

Presentation on theme: "A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer.

Similar presentations

Presentation on theme: "A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer."— Presentation transcript:

Similar presentations

About project

Feedback