Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas.

Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas O’Neill University of Rochester Presented at the 2 nd USENIX Workshop on Hot Topics in System Dependability

2 Context Distributed server systems  Example: J2EE Application servers  Many system configurations Switches that control runtime execution  Wide range of workload conditions exogenous demands for system resources Example J2EE Runtime Conditions System Configurations Concurrency limit Component placement Workload Conditions Request rate

3 Presumptions Performance expectations based on knowledge of system design are reasonable  Lead developers–high-level algorithms  Administrators–day-to-day experience Example Expectation Little’s Law Average number of requests in the system equals the average arrival rate times service time

4 Throughput Anomalies Actual Expectation Component Placement Strategies Real Performance Anomalies Problem Statement Dependable performance is important for system management  QoS scheduling  SLA negotiations Performance Anomalies – runtime conditions in which performance falls below expectations – are not uncommon

5 Goals Previous Work: Anomaly characterization can aid the debugging process and guide online avoidance  [AGU-SOSP99, QUI-SOSP05, CHE-NSDI04, COH-SOSP05, KEL-WORLDS05]  Focused on specific runtime conditions (e.g., those encountered during a particular execution) We wish to depict all anomalous conditions Comprehensive depictions can:  Aid the debugging of production systems before distribution  Enable preemptive avoidance of anomalies in live systems

6 Approach  Our depictions are derived in a 3-step process: 1. Generate performance expectations by building a comprehensive whole-system performance model 2. Search for anomalous runtime conditions 3. Extrapolate a comprehensive anomaly depiction  Challenges:  Model must consider a wide-range of system configurations  Systematic method to determine anomaly error threshold  An appropriate method to detect correlations between runtime conditions and anomalies

7 Outline Performance expectations for a wide-range of configuration settings Determination of the anomaly error threshold Decision-tree based anomaly depiction Preliminary results Discussion/ Conclusion

8 Comprehensive Performance Expectations Modeling the configuration space is hard  Configurations have complex effects on performance  Considering a wide-range of configurations increases model complexity Our modeling methodology  Build performance models as a hierarchy of sub-models  Sub-models can be independently adjusted to consider new system configurations

9 Rules for Our Sub-Model Hierarchies The output of each sub-model is a workload property  Workload property – internal demands for resources (e.g., CPU consumption) The inputs to each sub-model are either 1. workload properties 2. system configuration settings Sub-models on the highest level produce performance expectations Workload properties at the lowest level, canonical workload properties, can be measured independent of system configurations

10 A Hierarchy of Sub-Models We leverage the workload properties of earlier work [STE-NSDI05] Advantages  Sub-models have meaning Limitations  Configuration dependencies may make sub-models complex Hierarchy of sub-models for J2EE application servers.

12 Determination of the Anomaly Error Threshold Sometimes slight discrepancies between actual and expected performance should be tolerated Leniency depends on the end-use of the depiction  For online avoidance: focus on error magnitude  Large errors may induce poor management decisions  Sensitivity analysis of system management functions  For debugging: focus on targeted performance bugs  Noisy depictions will mislead debuggers  Group anomalies with the same root cause

13 Anomaly Error Threshold for Debugging Observation: anomaly manifestations due to the same cause are more likely to share similar error magnitude than unrelated anomaly manifestations Root causes can be grouped by clustering based on the expectation error:

14 Anomaly Error Threshold for Debugging Knee-points mark clusters boundaries Knee-point selection  Higher magnitude emphasizes large anomalies  Low magnitude captures multiple anomalies Validation: we notice that knee points disappear when problems are resolved 100% 0 400 800 1200 1600 80% 60% 40% 20% 0% knee Response Time Tput Sample runtime conditions (sorted on expectation error) Expectation Error Clustering

16 Decision Tree Based Anomaly Depictions Decision trees correlate anomalies to problematic runtime conditions Interpretable Unlike Neural Nets, SVM, Perceptrons No prior knowledge Unlike Bayesian trees [COH-OSDI04] Versatile If a=0: anomaly If a=1,b=0: normal If a=1,b=1: anomaly White-box Usage for Debugging Hints Prefer shorter, easily interpreted trees Black-box Usage for Avoidance Prefer longer, more precise tree aa b bb cc cc Anomaly 80% prob. Normal 70% prob. Anomaly 90% prob. 0101 a=0, b=1, c=2,…. Anomaly Normal

17 Design Recap We wish to depict performance anomalies across a wide-range of system configurations and workload conditions 1. Derive performance expectations via a hierarchy of sub-models 2. Search for anomalous runtime conditions with carefully selected anomaly error threshold 3. Use decision trees to extrapolate a comprehensive anomaly depiction

19 Depiction Assisted Debugging System: JBoss  8 runtime conditions (including app type)  4 machine cluster, 2.66 GHz CPU Found and fixed 3 performance anomalies  One is shown in detail below  Depiction of a real performance anomaly. Misunderstood J2EE configuration which manifests when multiple components are placed on node 2

20 Discovered Anomalies 1. Misunderstood J2EE configuration caused remote invocations to unintentionally execute locally 2. A mishandled out-of-memory error under high concurrency caused the Tomcat 5.0 servlet container to drop requests 3. Circular dependency in the component invocation sequences caused connection timeouts under certain component placement strategies

22 Discussion Limitations  Cannot detect non-deterministic anomalies  Is it model inaccuracy or a performance anomaly? Requires manual investigation, but model is much less complex than the system  Debugging is still a manual process Future work  Short term: Investigate more system configurations  Short term: Depict anomalies in more systems  Long term: More systematic depiction-assisted debugging methods

23 Take Away Comprehensive depictions of performance anomalies on a wide-range of runtime conditions can aid debugging and avoidance We have designed and implemented an approach to:  Model a wide-range of system configurations  Determine anomalous conditions  Depict the anomalies in an easy-to-interpret fashion We have already used our approach to find 3 performance bugs

Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas.

Similar presentations

Presentation on theme: "Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas.

Similar presentations

Presentation on theme: "Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas."— Presentation transcript:

Similar presentations

About project

Feedback