Download presentation
Presentation is loading. Please wait.
Published byFelicity Carbine Modified over 9 years ago
1
A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer UC Berkeley, Tellme 1, Stanford Univ., eBay 2
2
NSDI 2004Slide 2 Need for Fast Recovery Failures are common and costly –Daily partial site outages for large sites. –Downtime: $300K - $6million/hr. Challenges: –Lots of potential sources of faults. –Multiple independent faults. –Distributed runtime behavior (e.g. load balancing) Observation: very short outages are “free” –Cost of downtime is not linear.
3
NSDI 2004Slide 3 Need for Rapid Evolution Competition drives demand for new features and bug fixes –Switching cost is low. –Single administrative domain lowers upgrade barrier. Challenges: –Short release cycles Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes.Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes. –Distributed runtime behavior Observation: trend towards application server frameworks –E.g. J2EE,.NET, etc.
4
NSDI 2004Slide 4 2 extremes of granularity Problems: –Dispersed execution context –Local context often insufficient –“Blackbox” components Current Approaches to Understand Systems eBayeBayeBayeBay External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers granularity
5
NSDI 2004Slide 5 Captures the relationship between components and their aggregate behavior –Complements both end-to-end tools and “micro” analysis tools. “Macro” Approach eBayeBayeBayeBay WebServer “Micro” view e.g. code-level debuggers “Macro” view WS WS WS App App App DB External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers
6
NSDI 2004Slide 6 First Step: Path-based Analysis Paths record runtime properties of requests –components used (name, version, etc) –timestamps Two principles 1.Use paths as the core abstraction 2.Apply statistical analysis to a large number of paths Focus on correctness –In addition to performance (MSR’s Magpie, HP’s WebMon and Project 5) Web A Web B App B App C DB A DB B App A path 1. Web A, t = 1 2. App A, t = 23 3. App B, t = 30 4. DB B, t = 56 …. request
7
NSDI 2004Slide 7 Architecture Observation includes: –Component/resource names, version, … –Timestamps Application-generic tracing –By instrumenting the application servers E.g. < 1K lines for JBoss, a J2EE app serverE.g. < 1K lines for JBoss, a J2EE app server –Request-centric Associate system events to user-visible eventsAssociate system events to user-visible events –Performance overhead 1-3% for eBay1-3% for eBay Web Tracer App Tracer Web Tracer App Tracer DB Tracer DB Tracer Aggregator Ops/QA/Dev request Storage Query interface Analysis Engines Detection Diagnosis Viz observation Path
8
NSDI 2004Slide 8 3 Path-based Frameworks Paths Framework SiteDescriptionPhysicalTiers # of Machines # of Requests Apps Hosted Pinpoint- Research prototype based on J2EE 2-3--Java ObsLogs Tellme Enterprise voice application network (5)Hundreds Millions per day VoiceXML SuperCAL eBayeBayeBayeBay Online auction 2-32000+ Billions per day C++, Java eBay Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines
9
NSDI 2004Slide 9 Talk Outline Motivation and Approach Failure Management –Failure detection via path anomalies –Failure diagnosis using machine learning methods Evolution Management –Application-generic dependency tracking –Detecting and diagnosing changes Conclusions
10
NSDI 2004Slide 10 Failure Management Goal: minimize impact of failures –User-visible failures => $$$ lost 78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failure timeline78%
11
NSDI 2004Slide 11 Fast Recovery Challenges Many potential causes of failures –SW bugs, hardware, configuration, network, DB, … –Multiple independent failures Lots of data –Many small, but tolerable failures –Real-time detection/diagnosis Root cause might not be captured in logs –Tradeoff between logging granularity and overhead Observation: exact root cause may not be required for many recovery techniques
12
NSDI 2004Slide 12 Failure Detection Concepts Path collisions –Incomplete paths interrupted by other requests. Structural anomalies –Learn a set of “good” paths, and flag unseen paths. –Extended to use probabilistic models. App B App C DB A DB B App A Web A Web B requestsrequests
13
NSDI 2004Slide 13 Structural Anomalies in Path Shapes Probabilistic Context Free Grammar (PCFG) –Represents likely calls made by each component –Learn probabilities of rules based on observed paths Anomalous path shapes –Score a path by calculating the deviations of P(observed calls) from average. Detected 90% of injected faults in our experiments A B C A B C Sample Paths Learned PCFG p=1 S A p=.5 A B A BC p=.5 B C B $ p=1 C $
14
NSDI 2004Slide 14 Failure Diagnosis Concepts Idea: all bad paths touch the root cause –Look for path properties common to failed requests E.g. components used in all failed pathsE.g. components used in all failed paths –Extended to use probabilistic models. Limitation: –Inter-path dependency App B App C DB A DB B App A requestsrequests App B Web A Web B
15
NSDI 2004Slide 15 Failure Diagnosis Summarize each path into: What features of requests correlate with failures (e.g. NullPointerException)? –Request type, name, pool, host, version, DB, or a combination of these? –Different causes require different recovery techniques PathTypeNamePoolHostVersionDBStatus 1URLViewFeedbackCgi01341.2.1 FeedbackDB, UserDB, … NullPointer 2URLBidCgi22311.0.3PriceDBSuccess 3XML……………… Features
16
NSDI 2004Slide 16 Machine X Machine Y Machine Borrow Statistical Learning Techniques Cast as feature selection problem in machine learning Use decision trees because results are easily interpretable 1.Learn the tree from data (with failed paths) 2.The edges that lead to failed nodes are the candidates Success Null- Pointer Success Time- out Respond MyFeedback ViewFeedba ck Login Request Name Request Name TypeNameMachineStatus URL My- Feedback X Null- Pointer URLLoginXSuccess XML View- Feedback YSuccess URLRespondYTimeout ………… Diagnosis: 1) Machine X and MyFeedback 2) Machine Y and Respond FeaturesClass Label
17
NSDI 2004Slide 17 Diagnosis Results of Decision Trees Recall vs precision tradeoff –Recall: % of true faults identified –Precision: 1 – false positive rate Decision trees –C4.5 w/ adaptation A standard decision tree algorithmA standard decision tree algorithm –MinEntropy A greedy variant that finds one leaf with the most failuresA greedy variant that finds one leaf with the most failures Actual results from eBay deploymentActual results from eBay deployment –Association rules Data mining algorithm that computes the conditional probabilities for all combinations of featuresData mining algorithm that computes the conditional probabilities for all combinations of featuresperfect
18
NSDI 2004Slide 18 Talk Outline Motivation and Approach Failure Management Evolution Management –Application-generic dependency tracking –Detecting and diagnosing expected and unexpected changes Conclusions
19
NSDI 2004Slide 19 Tracking Dependency Current approaches –Manual approaches are error-prone and slow –Static analysis captures possible system behavior vs. runtime analysis which captures the actual behavior Paths directly captures application structure –Application-generic tracking of actual dependency Zero changes to applicationsZero changes to applications Rubis, a J2EE auction application, hosted on Pinpoint/JBoss
20
NSDI 2004Slide 20 Automatically Derived State Dependency Paths associate requests with internal state –Coupling of requests through shared state Easily extended to track fine-grained (e.g. row-level) state sharingEasily extended to track fine-grained (e.g. row-level) state sharing Database Tables ProductSignonAccountBannerInventory VerifySigninRRR CartRRR/W CommitOrderRW CategoryR SearchRR ProductDetailsRR/W NewAccountRR CheckoutW Requests PetStore, a J2EE e-commerce application, hosted on Pinpoint/JBoss R – read W - write
21
NSDI 2004Slide 21 Detecting/Diagnosing Changes Paths provides a flexible mechanism to profile any sub-path –Take the interval between any two observations –Drill down to identify problematic sub-paths Statistical analysis simultaneously examines thousands of sub-paths –Use non-parametric tests (e.g. Mann-Whitney) –Thousands of sub-paths tested for every Tellme release observationobsobsobsobs path
22
NSDI 2004Slide 22 Detecting/Diagnosing App-level Changes Paths enables simultaneous testing of many sub-paths –drill down to diagnose specific slow sub-paths 2 versions of a Tellme application Lower quartile Median Upper quartile Outliers Change detected in 1 sub-path in 1 application
23
NSDI 2004Slide 23 Detecting/Diagnosing App-level Changes Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 2 versions of 2 Tellme applications and 3 sub-paths Lower quartile Median Upper quartile Outliers No changes
24
NSDI 2004Slide 24 Detecting/Diagnosing App-level Changes Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 3 versions of 2 Tellme applications and 3 sub-paths App fixed
25
NSDI 2004Slide 25 Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Change detected in 1 sub-path in 1 application
26
NSDI 2004Slide 26 Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Consistent changes across all apps
27
NSDI 2004Slide 27 Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 3 versions of a Tellme platform platform fixed
28
NSDI 2004Slide 28 Lessons Learned Separate the path analysis logic from observation instrumentation –Improves maintainability and extensibility Data is cheap –Allows the use of simple statistical algorithms Live workload –Important to support online use of tools Record “attempts” –Failed components/resources may not record observations properly
29
NSDI 2004Slide 29 Summary Paths + statistical analysis: –Improves failure detection and diagnosis to support fast recovery. –Automates dependency tracking and change analysis to support rapid and correct evolution. Deployed and evaluated on real systems –Pinpoint, Tellme, and eBay Future work: –Wide-area systems and systems that span multiple administrative domains
30
NSDI 2004Slide 30 Thank You Acknowledgements –Berkeley/Stanford ROC Research Group –Professor Michael Jordan and Alice Zheng –Shepherd Miguel Castro and anonymous reviewers For more info: –Google, Yahoo, or MSN Search for Mike Chen
31
NSDI 2004Slide 31 Backup Slides
32
NSDI 2004Slide 32 Recovery Time Saving Expected Time Saved = E (Manual Diag. + Recovery) – E (Automated & Manual Diag. + Recovery) –Use diagnosis time based on experience: Diagnosis time: Automated = 1min, Manual (perfect) = 15minDiagnosis time: Automated = 1min, Manual (perfect) = 15min Recovery time (w/ verification) = 5 minRecovery time (w/ verification) = 5 min Time Saved (min) Noise Filtering Threshold $50K to $1million saved
33
NSDI 2004Slide 33 Show eBay’s Complex System Diagram Show a few path examples
34
NSDI 2004Slide 34 Failure Management Process Detection Isolation Diagnosis Impact Analysis Repair Feedback
35
NSDI 2004Slide 35 MinEntropy Entropy measures the randomness of data –E.g. if failure is evenly distributed (very random), then entropy is high Rank features by the normalized entropy –E.g. if root cause is a machine failure, then entropy will be low in the host dimension. Since all types of requests will fail on that host, the entropy in the request type dimension will be higher. Implemented at eBay –Greedy approach searches for the leaf node with most failures –Pros: fast (<1s for 100K txns and scales linearly) –Cons: Optimized for single faultsOptimized for single faults Features may not be independent (ie. pool and host)Features may not be independent (ie. pool and host)
36
NSDI 2004Slide 36 MinEntropy example TxTypeErrors URL4350 SQL47 EMAIL12 XSLT0 …… PoolErrorsCgi012 Cgi14002 Cgi230 Cgi38 Cgi45 …… MachineErrorsAttila1985 Lenin2002 Marcus4 Scipio0 …… URLErrorsMyEBay636 MyEBaySel ler 512 MyEBayLo gin 736 …… LabelErrorsE2933987 E29115 Alert: Build E293 causing URL error storm (not specific to any URL) in pool CGI1
37
NSDI 2004Slide 37 Association Rules Data mining technique to compute item sets –e.g. Shoppers who bought this item also shopped for … Metrics –Confidence: (# of A & B) / # of A Conditional probability of B given AConditional probability of B given A –Support: (# of A & B)/total # of txns Generates rules for all possible sets –e.g. machine=abc, txn=login => status=NullPointerException (conf:0.1, support=0.02) Applied to failure diagnosis –Find all rules that has failed status on the right –Pros: looks at combinations of features –Cons: generates many rules
38
NSDI 2004Slide 38 Adapting Association Rules Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Rank by the size of item sets if support and conf are equal –TxnName = LeaveFeedback
39
NSDI 2004Slide 39 Failure Management Goal: minimize impact of failure –User-visible failures => $$$ lost Fast failure detection and diagnosis are critical to availability –78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failuretimeline
40
NSDI 2004Slide 40 PCFG Thresholding ● Set a threshold for declaring anomalies – Static threshold: any request > 99 th or 99.5 th percentile – Dynamic threshold: when proportions don't match known good.
41
NSDI 2004Slide 41 Failure Diagnosis Experiments Data set –10 one-minute traces, 4 with 2 independent faults total of 14 independent faultstotal of 14 independent faults –About 1/8 of the whole site (640 potential single-faults) Metrics –Recall: % of true faults identified = (# of identified faults) / (# of true faults) –Precision: 1 – false positive rate = (# of identified faults) / (# of predicted faults) TypeNamePoolMachineVersionDatabaseStatus 10300152607408 HostDB Host, Host Host, DB Host, SW DB, SW 241111
42
NSDI 2004Slide 42 eBay’s Site 2 physical tiers –Web server/app server + DB –Apps in both Java (WebSphere) and C++ SuperCAL (Centralized Application Logging) –API for app developer to log anything to CAL –Platform logs common path features: cookie, host, URL, DB table(s), status, etc. Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines How to diagnose accurately and efficiently???
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.