Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques.

Slides:

Advertisements

Similar presentations

Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,

Chapter 4 Quality Assurance in Context

Inferences The Reasoning Power of Expert Systems.

Expert System Shells - Examples

Chapter 22 Object-Oriented Systems Analysis and Design and UML Systems Analysis and Design Kendall and Kendall Fifth Edition.

11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.

Best-First Search: Agendas

Solutions to Review Questions. 4.1 Define object, class and instance. The UML Glossary gives these definitions: Object: an instance of a class. Class:

Chapter 12: Expert Systems Design Examples

Object-Oriented Analysis and Design

Fault, Configuration, Performance Management

Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.

Copyright 2004 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Second Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Appendix.

Spring Routing & Switching Umar Kalim Dept. of Communication Systems Engineering 06/04/2007.

Object-oriented design CS 345 September 20,2002. Unavoidable Complexity Many software systems are very complex: –Many developers –Ongoing lifespan –Large.

Course Instructor: Aisha Azeem

Core 3: Communication Systems. On any network there are two types of computers present – servers and clients. By definition Client-Server architecture.

The chapter will address the following questions:

Network Design Essentials. Guide to Networking Essentials, Fifth Edition2 Contents 1. Examining the Basics of a Network Layout 2. Understanding Standard.

Network Design Essentials

Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.

Chapter 2 Network Design Essentials Instructor: Nhan Nguyen Phuong.

SOFTWARE ENGINEERING BIT-8 APRIL, 16,2008 Introduction to UML.

Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.

EVENT MANAGEMENT IN MULTIVARIATE STREAMING SENSOR DATA National and Kapodistrian University of Athens.

1 CS 456 Software Engineering. 2 Contents 3 Chapter 1: Introduction.

Common Devices Used In Computer Networks

An Introduction to Software Architecture

1 Routing. 2 Routing is the act of deciding how each individual datagram finds its way through the multiple different paths to its destination. Routing.

Copyright 2001 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Appendix A Object-Oriented.

Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.

Providing Policy Control Over Object Operations in a Mach Based System By Abhilash Chouksey

Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.

Event Management & ITIL V3

Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.

+ Simulation Design. + Types event-advance and unit-time advance. Both these designs are event-based but utilize different ways of advancing the time.

Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Chapter 21 Topologies Chapter 2. 2 Chapter Objectives Explain the different topologies Explain the structure of various topologies Compare different topologies.

Network Management Lecture 3. Network Faults Hardware Software.

UML diagrams What is UML UML diagrams –Static modeoing –Dynamic modeling 1.

Modeling and Simulation Discrete-Event Simulation

TAL7011 – Lecture 4 UML for Architecture Modeling.

Use Cases Use Cases are employed to describe the functionality or behavior of a system. Each use case describes a different capability that the system.

CprE 458/558: Real-Time Systems

Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.

Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.

System Testing Beyond unit testing. 2 System Testing Of the three levels of testing, system level testing is closest to everyday experience We evaluate.

Testing OO software. State Based Testing State machine: implementation-independent specification (model) of the dynamic behaviour of the system State:

Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.

M. Veeraraghavan (originals by J. Liebeherr) 1 Need for Routing in Ethernet switched networks What do bridges do if some LANs are reachable only in multiple.

Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.

Understanding Network Architecture CHAPTER FOUR. The Function of Access Methods The set of rules that defines how a computer puts data onto the network.

Copyright © 2004, Keith D Swenson, All Rights Reserved. OASIS Asynchronous Service Access Protocol (ASAP) Tutorial Overview, OASIS ASAP TC May 4, 2004.

Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Appendix A Object-Oriented Analysis and Design A.1.

1 Chapter 11 Global Properties (Distributed Termination)

Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

UML Diagrams By Daniel Damaris Novarianto S..

The Movement To Objects

Integrating HA Legacy Products into OpenSAF based system

UML Diagrams Jung Woo.

Chapter 16: Distributed System Structures

Object-Oriented Analysis

Chapter 20 Object-Oriented Analysis and Design

An Introduction to Software Architecture

Chapter 22 Object-Oriented Systems Analysis and Design and UML

Presentation transcript:

Fault Detection and Diagnosis

Outline Fault management functionality Event correlations concept Techniques

Definitions A fault may cause hundreds of alarms. We need to be able to do the following: u Detect the existence of faults u Locate faults An alarm u External manifestations of faults n Generated by components n Observable, e.g. via messages An alarm represents a symptom of a fault. An event u An occurrence of interest, e.g. an alarm message

Fault Management Functionalities Fault detection u Should be real-time u Techniques can be based on active schemes (e.g., polling) or event-based schemes (where a system component says that it has detected a failure). Fault location u Is it a link or system component or application component? Determine corrective actions Carry out corrective actions and determine effectiveness

Alarm (Event) Correlation Alarm explosion u A single problem might trigger multiple symptoms (e.g., router is down) There could be too many alarms for an administrator to handle; Techniques used to help: u Compression: reduction of multiple occurrences of an alarm into a single alarm u Count: replacement of a number of occurrences of alarms with a new alarm u Suppression: inhibiting a low-priority alarm in the presence of a higher priority alarm u Boolean: substitution of a set of alarms satisfying a condition with a new alarm

Fault Diagnosis Major application of alarm correlation is fault diagnosis Useful in fault location

Faults and Alarms

The previous figure shows that correlation c1 detects the fault f1 and that correlation c2 detects the fault f2. Correlating c1 and c2 into the correlation c0 allows the diagnosis of the fault f0.

Example Let a1, a2, a3, a4, a5 be alarms generated by client processes indicating that a client process is not getting a response from a server. Correlation techniques can be used to show that since a1, a2, a3 were generated by client processes by trying to contact the same server then the server may be the problem. Similar comments apply to a4 and a5.

Example From the perspective of client processes, the servers (at the second level of the previous figure) are at fault. However, it may be observed that alarms were generated by these two servers. Both alarms indicate that each of the two servers are not getting a response and that both were trying to contact the same server. This is another correlation.

Rule-Based Reasoning Based on expert systems Intended to represent heuristic knowledge as rules. Components u Knowledge Base (KB): Contains the expert rules that describe the action to be taken when a specific condition occurs e.g., if-then-else u Working Memory(WM): Stores information such as the system/network topology and data collected through the monitoring of application and network components.

Rule-Based Reasoning Components (continued) u Inference engine: matches the current state (as represented by the monitored data) of the system against the left-side of a rule in the knowledge base in order to trigger the action. The rules are meant to encapsulate expert knowledge Why rule-based reasoning? u Rules are interpreted which means that rules can be changed without recompiling. u Since expert knowledge can be wrong and/or complete, this feature is very useful.

Approaches Model-based Fault propagation Model traversing Case-based reasoning

Model-Based Reasoning Use structural knowledge, e.g., via object- oriented models To illustrate model-based reasoning, we will look at network equipment. A network equipment class is used to describe a network equipment type. Network equipment classes are organized into a hierarchy using class/subclass relations.

Model-Based Reasoning The root class is a generic class that represents the most general information common to all network equipment e.g., vendor name The next level of the hierarchy describes the basic network equipment classes such as physical-trunk class and switch class. Example: The trunk class refers to T1 trunk class and T3 trunk class. An association class represents a relationship between two object classes that represents connectivity.

Model-Based Reasoning The network configuration model is constructed from the instances of individual network equipment. The network configuration model is a graph consisting of objects instantiated from the network equipment classes. If there is an edge between two objects then this implies that the two objects are connected.

Example BGS MC WSC Csd_syslab Csd_res Stats Genetics Geophys zoo router hub fiber hub 10Base-T Fiber optic

Example There is a Network Equipment class that has attributes used to describe a network equipment. At a minimum assume that there is an attribute representing the name of the vendor. You can subclass the following from the Network Equipment class: u Connector u Cable Segment

Example The Connector class has two subclasses: u Hub u Router The Cable Segment class also has two subclasses: u 10Base-T u Fiber Optic The Hub class has two subclasses: u Fiber Hub u Local Hub

Example Network Equipment Connector Cable Segment Hub Router 10Base-TFiber Optic Fiber HubLocal Hub

Model-Based Reasoning Message Class Hierarchy u Describes messages created by elements of a network configuration model. u These messages report symptoms. They are essentially alarms.

Example A hub can generate a message stating that one of its ports is “bad”. In fact, there is a message associated with each port. A hub can generate a message stating that the cable segment on a particular port does not seem to be functioning. This can be encapsulated in a message class called Hub_Report_Port_Bad. This can be subclassed into messages generated by each type of hub.

Model-Based Reasoning Correlations u Conditions under which the correlations are asserted u Correlations are stated in terms of message classes. u At run-time there is match between an instantiation and a message class. u Example: If a local hub is connected to more than one router and there are two messages from the routers indicating that there are problems connecting to the local hub then this is a good indication that the hub is a problem.

Fault Propagation Based on models that describe which symptoms will be observed if a specific fault occurs. Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms. An event model is used by a management application to analyze these alarms. The event model represents knowledge of events and their causal relationships.

Fault Propogation (Coding Approach) Correlation is concerned with analysis of causal relations among events. The notation e  f is to denote causality of the event f by the event e. Causality is a partial order between events. The relation  may be described by a causality graph whose directed edges represent causality. Distinguish between fault events (faults or problems) and symptom events (symptoms). Nodes of a causality graph may be marked as problems (P) or symptoms (S). Some symptoms are not directly caused by faults, but rather by other symptoms.

Fault Propagation (Coding Approach) Example Causality Graph

Fault Propagation (Coding Approach) The correlation problem u A correlation p  s means that problem p can cause a chain of events leading to the symptom s. u This can be represented by a graph.

Fault Propagation (Coding Approach) A Correlation Graph

Fault Propagation (Coding Approach) For each fault (problem) p, the correlation graphs provides a vector that summarizes information available about correlation and symptoms and problems. This is referred to as the code of the problem. Alarms may also be described using a vector assigning measures of 1 and 0 to observed and unobserved symptoms. The alarm correlation problem is that of finding problems whose codes optimally match an observed alarm vector.

Fault Propagation (Coding Approach) Example codes (look at correlation graph example) u 1 = (0,1,1) – This indicates that problem 1 causes symptoms 9 and 10 u 2 = (1,0,1) – This indicates that problem 2 causes symptoms 6 and 10 u 11 = (0,1,1) – This indicates that problem 11 causes symptoms 9 and 10.

Fault Propagation (Coding Approach) Example alarm vector u Assume that alarms indicating symptoms 3 and 9 have been observed. u a = (0,1,1) We can infer that either 1 or 11 match the observation a. These two problems have identical codes and hence are indistinguishable. The fault management application may have to do additional tests.

Fault Propagation (Coding Approach) A Codebook is an array of the vectors just defined. The number of symptoms associated with a single problem may be very large. u Sometimes a much smaller set of symptoms is selected to accomplish a desired level of distinction among problems.

Fault Propagation (Coding Approach) Example Codebook p1p2p3p4p5p

Fault Propagation (Coding Approach) Example Codebook p1 p2 p3 p4 p5 p

Fault Propagation (Coding Approach) Distinction among problems is measured by the Hamming Distance between their codes The radius of a codebook is one half of the minimal Hamming distance among codes. When the radius is 0.5, the code provides distinction between problems.

Fault Propagation (Coding Approach) Is this easy to apply to application processes? u No Why u Applications are dynamic u The coding approach assumes the system is fairly static.

Model Traversing Reconstruct fault propagation at run time using relationships between objects Begins with managed object that generated event Work best when object relationship is graph-like and easy to obtain since it must be obtained at run-time u Performance u Potential parallelism Weaknesses u Lack of flexibility u Not well-structured like fault propagation

Model Traversing Characteristics u Event-Driven: Fault management application is passive until an event arrives. This event is the reporting of a symptom. u Correlation : Decides whether two events result from the same primary fault. u Relationship Exploration: The fault management application correlates events by detecting special relationships between the source objects of those events.

Model Traversing Event reports should have the following information: u Symptom type u Source u Target u etc If symptom s i ’s target is the same as s j ’s source then this is an indication that s i is a secondary symptom. This allows us to ignore certain alarms.

Model Traversing For each event, construct a graph of objects (models) related to the source object of that event. When two such graphs touch each other, i.e. contain at least one common object, the events which initiated their construction are regarded to be correlated. Possibly these two events are the result of the same fault. If s i is correlated with s j and s j is correlated with s k then through transitivity we can conclude that s i is a secondary symptom.

Model Traversing The process of eliminating symptom reports may result in reports that have the same target. Example: u s 1 and t u s 2 and t It might be necessary to construct possible paths of objects between s 1 and t as well as s 2 and t Nodes in common are good candidates for the faults.

Model Traversing We will now discuss the building of graphs The algorithm for building graphs uses relationships between network hardware and software components to search for the root cause of a problem. Assumes that information about the relationships between the components are available (e.g., through a database). Assumes that there are functions including these: u getNextHop(source, target,B): Get the node representing the next entity (that comes after B) in the path between source and target. Note that this may return more than one entity.

Model Traversing Example Assume the following configuration of processes and machines. All machines are connected through the Ethernet. u P1 is on chocolate; P2 is on peppermint u P3 is on vanilla; P4 is on strawberry u P5 is on doublefudge; P6 is on mintchip Communication is through remote procedure calls. This basically requires that all communication go through a daemon process on the server host’s machine. We will call this rpcd

Model Traversing P4 P3 P5 P1 P2 P6 Call structure is depicted in the following graph:

Model Traversing Example Assume that P4 terminates abnormally causing a cascade of timeouts Correlation will result on focusing on these event reports: u (P1,P4) u (P3,P4) Not enough to diagnose the fault. u It’s all at the process level. u There are still many entities or objects to examine since you do not want everything generating a message.

Model Traversing Example Starting with P1 the next component (node) along the path of the connection between P1 and P4 is identified. Between P1 and P4 are many entities. We will start out with a vertical search which basically results in the fact that P1 is running on a host machine called chocolate

Model Traversing chocolate is connected to the hub through an ethernet cable. The hub is connected to strawberry through an ethernet connection cable where P2 is running. Thus we can say that the path is the following: u P1, chocolate,ethernet connection cable,hub,strawberry,ethernet connection cable, rpcd.strawberry,P4 The path between P3 and P4 is the following: u P3, vanilla, ethernet connection cable, hub, ethernet connection cable, strawberry, rpcd.strawberry, P4

Model Traversing Example This suggests that we can narrow down the problem to hub, ethernet connection cable, strawberry.rpcd, strawberry, P 4. At this point, the fault management application may want to poll for additional information. The polling may check to see if something is up or not. An example is applying the ping operation to the host machine called strawberry. What if every entity is up? This may indicate that strawberry is overloaded. An indication of an overload can be found by measuring the CPU load.

Model Traversing Building the graphs requires structural information and the use of rules.

Model Traversing Implementation What management services are needed? u To detect and report symptoms, one could use application instrumentation. u The instrumentation library should most likely talk with a management process (or agent). u The agent sends an event report to the event server. u The event server may have a set of rules for symptom correlation. u After correlation, a task may be invoked that does relationship exploration and the final diagnosis.

Model Traversing Implementation Information Needed u Information representing the relationships between hardware components and software components is needed. u This needs to be stored in a database or a directory service (e.g., X500) u An API needs to be defined to retrieve this information. u Rules can be used to help construct the graph.

Model Traversing Implementation Information Needed u How is the information collected? u Many different techniques. Examples include: n Processes (using instrumentation) may have to register and have their information put into the database. n Network information may have to be entered manually.

Model Traversing Summary Performs very quickly once model is built u Model can be constructed incrementally during normal processing; do not have to wait until failure Can operate in parallel Can accommodate multiple events; different starting points can result in same problem element Does require model reflective of run-time u One that changes too fast is a problem

Case-based Reasoning (CBR) Objective u Learn from experience u Solutions to novel problems u Avoid extensive maintenance Basic idea: recall, adapt and execute episodes of former problem-solving in an attempt to deal with a current problem

Case-based Reasoning Approach

Case-Based Reasoning Strategy Useful for domains in which a body of knowledge with a case structure exists or is easily obtainable Case structure: u Set of fields or “slots” u Capture “essential” information Yield discriminators u Set of fields highly correlated with problems or solutions Need to find “closest” match

Case-based Reasoning Adapt

Case-based Reasoning Summary Needs well-defined cases Likely to work well when problems are “close” to existing solutions Problem selecting solutions when “not so close” u Dangerous in following actions? u How to adapt?

Summary Variety of approaches Mostly applied in network management scenarios u More controlled? u Better understanding of problems? Limited experience in application management