Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.
Fault Management The process of locating and correcting network problems and faults o fault is a failure of a network component, which o results in loss of connectivity It is the most important functional management area Resolve problem Process, 5 steps: oIdentify faults oGathering information via traps (linkDown, egpNeighborLoss) and polling oTraps may not be sufficient oIs a received trap an important one??? o Locate Fault o Detect all failed components and trace down the tree topology to the source (e.g., interface card failure on a router all connected components will indicate a failure) o Fault isolation by network and SNMP tools o Use artificial intelligence /correlation techniques o Restore service (high priority) o Identify the root cause of the problem (trouble ticket) o Resolve problem
Network Restoration- example IP Data Layer IP/MPLS, DiffServ packet QoS Intelligent transport routing/protection switch backbone Virtual router topology Collapsed Hierarchy, Improved Efficiency Traffic is successfully restored only after failure notification and a round trip configuration/confirmation. Failure detected Source notified Message received and resources configured. SEND ACK Resources successfully setup, Restore traffic
Preliminaries An event is an exceptional condition in the operation of the network o Software failure o Performance bottleneck o Configuration inconsistencies o Intrusion attempts Network management operations o Monitoring events o Interpreting events o Handling events A single problem event may cause many symptom events o Correlating symptom events to identify and localize the underlying problems
Illustrative scenario r A client application exchanges data over a TCP connection with a DB server r Distinct domains each administered by a different organization
Illustrative scenario Problem scenario A clock at an interface in WAN2 that supports T3 link loses SYNC 4 times a second for 0.25 ms intermittent noise causing loss of 0.1% of T3 capacity this small noise causes bit errors in a large number of packets routed over C-D Bit errors cause packet losses, either at routers (if IP header corrupted) or at destinations
Illustrative scenario performance of TCP connection degrades due to packet loss TCP sender interprets this as congestion and hence reduces its window TCP increases its window gradually until new packet loss However due to the noise, the TCP window will not increase DB transactions by client will last longer DB server performance will degrade due to records lock-out, causing frequent aborts for remote transactions
Illustrative scenario Three important points r problems propagate among related objects, and possibly amplified by various protocol mechanisms r single problem can cause numerous observable events in multiple domains r some problems are not observable where they originate: m WAN2 domain may observe minor error events at the T3 interface, but these events may be indistinguishable from normal operating noise WAN2 may be unaware that there is a problem Challenges r Determine events to monitor and ways to analyze them m Operations staff must have knowledge of operational parameters of managed objects and the significance of its events r Correlation of events and coordination among different domains r Automating the management activities (manual processing does not scale)
Modeling the Scenario Partition the system into multiple management domains (e.g., enterprise domain, ED, and router domain, RD) Each domain has a domain manager (DM) to monitor, correlate and handle its events A MD may subscribe to receive notifications from other domains ED sees the RD as a single entity connecting LAN1 and LAN2
Modeling the Scenario Any problem in the connection is seen as RD problem Inside each domain, finer grained correlation can determine the particular problem using symptoms from other domains Example: packet loss is degraded TCP performance is detected by ED not by the RD. this symptom is received by the RD and can be correlated along with other observable symptoms to isolate the “clock problem”. Detects only IP header corruption
Automating Event Management r An automated event management system (AEMS) must accurately model and store knowledge of the underlying system and its associated events. m Static Information associated with managed objects such as SNMP traps, thresholds for MIB variables, etc. m Dynamic information: reflects addition, removal, upgrades of network devices, etc. r The process of automation is that of developing correlation algorithms to analyze observable events r Correlation algorithms must m Scalable to large networks involving complex systems m Handle a large number of symptoms caused by a single problem m Fast --real time correlation m Robust (loss of a single alarm or generation of spurious event should not affect its decision insensitive or resilient to noise
Problems and Symptoms r A problem is an event that can be handled directly; e.g., a faulty interface m Some problems are directly observable or indirectly by observing their symptoms r Symptoms are observable events m Degraded application performance is a symptom of a faulty interface m Symptoms cannot be handled; symptoms persist unless the problem is resolved r Problems and symptoms propagate from one object to another m Noise in WAN bit errors in link C-D loss of packets at routers poor TCP performance frequent transaction aborts in the DB server
Event Correlation System r Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms. r The correlator uses an event model to analyze these alarms. r The event model represents knowledge of various events and their causal relationships m Event model depends on the expert people r The correlator determines the common problems that caused the observed alarms.
Event Knowledge The Modeler’s event knowledge contains the following information for each class of managed objects: r The data attributes of objects of this class (e.g., MIB variables). r The set of events that are observable within instances of this class (e.g., a particular MIB variable is above threshold), or by asynchronous event notifications. r The set of events caused by each problem. This set can include events within the object, as well as events in other objects to which the object is related. r The problems that can originate within instances of this class. r The relationships in which an instance of the class can be involved. r The events and/or problems that are exported by instances of the class.
Coding Approach for Event Correlation Treat the complete set of events caused by a problem as a “code” that identifies the problem Correlation is the process of decoding the set of observed symptoms o Determine which problem has these symptoms as its code o Note: traditionally, alarms are typically correlated through searches over the event model knowledge base Complexity of search limits scalability o Event model is a large database and the received alarms or symptoms may also be quite large
Coding Approach for Event Correlation Two phases: Codebook selection phase: o Select a subset of events for monitoring – codebook o Codebook is an optimal subset of events that must be monitored to distinguish the problems of interests from one another o Ensure a desired level of noise tolerance oAlgorithms must decode or infer the problem in the presence of lose alarms or the existence of spurious alarms Decoding o Find the problem whose associated symptoms (i.e., code) match the observed symptoms most closely
Causality Graph Models Correlation is concerned with analysis of causality relations among events o e f denotes causality of event f by event e o Causality is a partial order relation between events o Relation can be described by a graph whose nodes represent the events and edges represent causality
Causality Graph Models Event that is neither a symptom nor a problem. Causal equivalence A symptom caused by another symptom do not contribute any information about the problem All these indirect symptoms can be eliminated without loss of information Correlation graph
Correlation Information contained in the correlation graph must be converted into codes, one for each problem in the graph. A code for a problem p is a vector p of 0s an 1s. Each bit corresponds to a symptom in the graph example: code is of length 3 (3 symptoms) – after ordering of the symptoms (e.g., ): code for p 1 is p 1 = (1,0,1) This means p 1 causes symptoms S 3 and S 9 p 2 = (1, 1, 0) and p 11 = (1, 0, 1) Correlation graph Event correlation is finding problems whose codes optimally match an observed symptom vector
Correlation What happens when we observe symptoms S 3 and S 9 ? Both P 1 and P 11 match the observed vector! Clearly we know there is a problem but cannot identify the problem since both problems have identical codes.. What happens when we observe symptoms (0, 1, 0)? two possibilities: (1) a false event or (2) P 3 occurred but one symptom was lost. Correlation graph Interpretation depends on whether loss is more likely than false alarm generation In case spurious or lost symptoms are unlikely, information provided by S 9 is redundant (1, 0) and (1, 1) are sufficient to correlate event vectors. Subset of symptoms required to provide desired level of distinction between problems is called codebook
Correlation- example r Codebook contains only three symptoms r The codebook distinguishes among all problems however, it guarantees distinction by only a single symptom A loss or spurious generation of S 4 will result in decoding error Distinction between problems is measured by the “hamming Distance” between their codes Radius is ½ the hamming distance Codebook not resilient to noise
Correlation- example Event vectors {011100, , , } will be decoded as P 1 with a single symptom loss and {111110, } is interpreted as P 1 with a single spurious symptom When two error symptoms occur, decoder will detect the error but cannot correctly (uniquely) decode the event (e.g., P 1 and P 4 )
Correlation- Advantages