Download presentation
Presentation is loading. Please wait.
1
Ira Cohen, Jeffrey S. Chase et al.
Correlating Instrumentation data to system states: A building block for automated diagnosis and control Ira Cohen, Jeffrey S. Chase et al.
2
Introduction Networked systems continue to grow in scale
Complex behavior stemming from interaction of Workload Software structure Hardware Traffic conditions System goals Pervasive System needed to manage such a system Examples? HP’s Openview IBM’s Tivoli (Aggregates + displays graphically)
3
Introduction Two approaches to build self managing systems
A priori models Event-condition-action rules Not based on real systems (Disadvantages?) Difficult and costly Unreliable, does not take account of all
4
Introduction Statistical learning techniques
Assumes little to no domain knowledge Hence “general” Problem! Still have to identify techniques that are powerful enough to induce effective models that are: Efficient Accurate Robust
5
Goals Automatically analyze instrumentation data from network services in order to Forecast Diagnose Repair failure conditions We use the Tree-Augmented Naïve Bayesian Networks (TANs) as the basis for Diagnosis Forcasting System-level instrumentations in a 3-tier network service. Widely used in various fields, but TANs are not used in the context of computer systems.
6
Goals Analyzed data from 124 metrics gathered from
3 tiered e-commerce site under synthetic load Httperf Java PetStore as platform TAN model select combination of metrics and threshold values that complies with Service Level Objectives for average response time. Results later
7
What is a TAN? Bayesian network is an annotated directed acyclic graph encoding a joint probability distribution Naïve Bayesian Network State var S is only parent of all other vertices Assumes all metrics are fully independent given S TANs consider relationships among metrics themselves, with constraint that each metric has only one other parent than S
8
Why Use a TAN? Based on premise that a relatively small subset of metrics and threshold values is sufficient to approximate the distribution accurately Outperforms generalized Bayesian networks and other alternatives in both Cost Accuracy
9
Why use a TAN? Useful for forecasting failures and violations
Possible to induce models that predict SLO violations in near future, even when system is stable Automated controller can invoke directly Identify impending violation Respond Loading Adding resources Cheap model to induce Possible to maintain multiple models Periodic refresh
10
Setup System is 3-tier webservice
Apache Middleware (BEA WebLogic) Oracle db 3 Servers with HP Openview to collect statistics Load Generator is httperf SLO indicator processes the logs to determine compliance
11
Interpretability and Modifiability
TANs offer other advantages Interpretability Modifiability Influence of each metric can be quantified in a probabilistic model Analysis catalogs each type of violation according to the metrics and values that correlate with observed instances Strength is given from prob value occurring in different states Gives insight to causes of violations and how to repair
12
Workloads Varies several characteristics
Aggregate req rate Number of concurrent connections Fraction of data-intensive vs app-intensive requests This is to exercise the model-induction methodology by providing it with a wide range of M,P pairs Where M = sample of values for system metrics P = vector of app-level performance measurements
13
Workloads RAMP: Increasing concurrency
STEP: Background + Step function Background constant traffic Bursty, hour long bursts BUGGY: Increasing aggregate req. rate
14
Results Varied SLO thresholds to explore effect on induced models
To eval accuracy of models under varying conditions Trained and evaled TAN classifier for each of 31 different SLO definitions Baseline: accuracy of 60-pctile SLO classifier (MOD) and CPU as metric.
15
Results Overall BA of TAN is 87-94% 90+% for all experiments
6% False alarm for 2 experiments, 17% for BUGGY Single metric is not sufficient to capture pattern of SLO violations (CPU) Small number of metrics is sufficient to capture pattern (3-8) Sensitive to workload and SLO definition (MOD always has high detection rate, but generate false alarms at increasing rate as SLO thresh increases)
16
Conclusion TANs are attractive for self-managing systems
Build system models automatically No a priori knowledge required Generalizes to wide range of conditions Zeroes in on most relevant metrics Practical
17
Conclusion Possible work to adapt this to changing conditions
Close the loop for automated diagnosis and control Ultimately most successful model is a hybrid of Automatically induced models A priori models
18
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.