April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt University Berlin Germany

Salfner, Malek -- Humboldt University Berlin2 Outline  Our goal  Description of the model  Validation of the model  Two applications using the failure predictor  Work in progress  Conclusions

Salfner, Malek -- Humboldt University Berlin3 Our Goal: Highly-Available Component-Based Software Systems System Comp 1 Res a Res b Res c Event-logs: t Event type High level failure prediction Fault detection Service A Service B Service C Comp 2Comp 3 … … …

Salfner, Malek -- Humboldt University Berlin4 Mathematical Model View t Stochastic Occurrence of Faults System Failures ttt Model t +  t Failure prediction t +  t Errors TS 1 TS n Fault detection t -  t

Salfner, Malek -- Humboldt University Berlin5 Model Description  The model contains patterns of events Failure prediction: patterns that lead to failures. Early fault detection: patterns that identify and locate faults. Patterns reflect temporal behavior of the system. Patterns are modeled as paths in an acyclic directed graph. Events are characterized by multiple system properties.  Two-phase approach Model construction: »Analyze system behavior with the help of past logfiles. »Extract patterns by means of clustering algorithms. »Construct a generalized model. Model application: »Wait for the occurrence of events. »Check whether the event matches known patterns (paths). »If true, calculate probability and timeframe for every path.

Salfner, Malek -- Humboldt University Berlin6 Model construction  Identify target positions in a logfile  Cut out segments preceding the target positions (extract history)  Each segment forms one path in the graph  Group events by means of clustering algorithms  Simplify the graph  Calculate relative likelihoods of branches

Salfner, Malek -- Humboldt University Berlin7 Model application  Example: Measure memory usage each time an event occurs  Two types of failures: No process memory available No shared memory available

Salfner, Malek -- Humboldt University Berlin8 Validation of the model  Focus on Telecommunication system such as AT&T or Siemens Large software system Component / container based software architecture Distributed computing system (5 – 5000 Servers)  Large data set: 500MB per day of operation  Validation of selected paths by domain experts

Salfner, Malek -- Humboldt University Berlin9 Acceptance test Checkpointing Failure Specific Dynamic Recovery  Failure specific recovery scheme  Risk levels for different failure types  Dynamic Recovery Low risk: »Predicted probability of failure occurrence is below risk level »Leave out checkpointing and acceptance test »Reduce computational overhead »Gain efficiency High risk: »Predicted probability of failure occurrence is above risk level »Checkpointing and acceptance test have to be carried out »Reduce lost computation in case of failure Computation Checkpointing Acceptance test Computation Checkpointing Acceptance test …

Salfner, Malek -- Humboldt University Berlin10 Evaluating Proactive Measures  Patterns describe system behavior in the presence of faults: How does the system usually run into failure situations?  Proactive techniques take countermeasures to prevent the system from running into failure situations.  The model facilitates evaluation of proactive measures while they are applied to a running system. Failure Normal Operation Proactive measure

Salfner, Malek -- Humboldt University Berlin11 Work in Progress  Online learning Include new patterns when failures are identified Prune nodes that are rarely used  Integration of health paths Include cases where no failure occurred  Introduce probability densities to nodes Now: Ranges for node parameters Future: Probability densities A path‘s probability also depends on the deviation from the center of a given distribution

Salfner, Malek -- Humboldt University Berlin12 Conclusions  Temporal  Temporal system behavior is directly incorporated into the model. effectively  Calculations during the model‘s application can be performed effectively. Only a depth-first-search with a few additional multiplications and additions is needed. intuitive  The model is intuitive since paths express correlations in a formalism that is easily understandable. hybrid  It is extensible to a hybrid model since it can be supplemented by paths obtained from classic system analysis (within one model).

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Similar presentations

Presentation on theme: "April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Similar presentations

Presentation on theme: "April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt."— Presentation transcript:

Similar presentations

About project

Feedback