Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.

Slides:



Advertisements
Similar presentations
Ethernet Switch Features Important to EtherNet/IP
Advertisements

The Transmission Control Protocol (TCP) carries most Internet traffic, so performance of the Internet depends to a great extent on how well TCP works.
Communication Networks Recitation 3 Bridges & Spanning trees.
Connecting LANs: Section Figure 15.1 Five categories of connecting devices.
The Network Layer Functions: Congestion Control
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
UNIT-IV Computer Network Network Layer. Network Layer Prepared by - ROHIT KOSHTA In the seven-layer OSI model of computer networking, the network layer.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
Chapter 19: Network Management Business Data Communications, 4e.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Fault, Configuration, Performance Management
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168)
Chapter 10 Introduction to Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.
1 Chapter 8 Local Area Networks - Internetworking.
CS335 Networking & Network Administration Tuesday, April 20, 2010.
Internetworking Fundamentals (Lecture #2) Andres Rengifo Copyright 2008.
Introduction to Computer Networks 09/23 Presenter: Fatemah Panahi.
1 Network Management and SNMP  What is Network Management?  ISO Network Management Model (FCAPS)  Network Management Architecture  SNMPv1 and SNMPv2.
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Agenda SNMP Review SNMP Manager Management Information Base (MIB)
1 25\10\2010 Unit-V Connecting LANs Unit – 5 Connecting DevicesConnecting Devices Backbone NetworksBackbone Networks Virtual LANsVirtual LANs.
Remote Monitoring and Desktop Management Week-7. SNMP designed for management of a limited range of devices and a limited range of functions Monitoring.
Data Communications and Networks Chapter 2 - Network Technologies - Circuit and Packet Switching Data Communications and Network.
SNMP ( Simple Network Management Protocol ) based Network Management.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.
Firewall and Internet Access Mechanism that control (1)Internet access, (2)Handle the problem of screening a particular network or an organization from.
Lec4: TCP/IP, Network management model, Agent architectures
© 2002, Cisco Systems, Inc. All rights reserved..
 Network Segments  NICs  Repeaters  Hubs  Bridges  Switches  Routers and Brouters  Gateways 2.
1 Next Few Classes Networking basics Protection & Security.
Chapter 6 – Connectivity Devices
TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.
1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU
Chi-Cheng Lin, Winona State University CS 313 Introduction to Computer Networking & Telecommunication Chapter 5 Network Layer.
1 Network Management: SNMP The roots of education are bitter, but the fruit is sweet. - Aristotle.
COP 4930 Computer Network Projects Summer C 2004 Prof. Roy B. Levow Lecture 3.
Review: –Ethernet What is the MAC protocol in Ethernet? –CSMA/CD –Binary exponential backoff Is there any relationship between the minimum frame size and.
Network Management Lecture 3. Network Faults Hardware Software.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Wireless TCP. References r Hari Balakrishnan, Venkat Padmanabhan, Srinivasan Seshan and Randy H. Katz, " A Comparison of Mechanisms for Improving TCP.
Fault Detection and Diagnosis. Outline Fault management functionality Event correlations concept Techniques.
1 by Behzad Akbari Fall 2008 In the Name of the Most High Network Management Applications.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Chapter 16 Connecting LANs, Backbone Networks, and Virtual LANs.
Introduction to Active Directory
Advanced Network Management
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
HP Openview NNM: Scalability and Distribution. Reference  “HP Openview NNM: A Guide to Scalability and Distribution”,
7.1 The Network Layer It provides services to the transport layer. It is concerned with getting packets from the source to the destination, possibly making.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Manajemen Jaringan, Sukiswo ST, MT 1 Network Monitoring Sukiswo
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
Dynamic Routing Protocols II OSPF
Distributed Systems CS
Networking Devices.
Chapter 4 Data Link Layer Switching
CHAPTER 3 Architectures for Distributed Systems
Network Monitoring System
NTHU CS5421 Cloud Computing
The Network Layer Network Layer Design Issues:
Lecture 3: Secure Network Architecture
Distributed Systems CS
Routing and the Network Layer (ref: Interconnections by Perlman
In-network computation
Distributed Systems CS
Presentation transcript:

Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.

Fault Management  The process of locating and correcting network problems and faults o fault is a failure of a network component, which o results in loss of connectivity  It is the most important functional management area  Resolve problem  Process, 5 steps: oIdentify faults oGathering information via traps (linkDown, egpNeighborLoss) and polling oTraps may not be sufficient oIs a received trap an important one??? o Locate Fault o Detect all failed components and trace down the tree topology to the source (e.g., interface card failure on a router  all connected components will indicate a failure) o Fault isolation by network and SNMP tools o Use artificial intelligence /correlation techniques o Restore service (high priority) o Identify the root cause of the problem (trouble ticket) o Resolve problem

Network Restoration- example IP Data Layer IP/MPLS, DiffServ packet QoS Intelligent transport routing/protection switch backbone Virtual router topology Collapsed Hierarchy, Improved Efficiency  Traffic is successfully restored only after failure notification and a round trip configuration/confirmation. Failure detected Source notified Message received and resources configured. SEND ACK Resources successfully setup, Restore traffic

Preliminaries  An event is an exceptional condition in the operation of the network o Software failure o Performance bottleneck o Configuration inconsistencies o Intrusion attempts  Network management operations o Monitoring events o Interpreting events o Handling events  A single problem event may cause many symptom events o Correlating symptom events to identify and localize the underlying problems

Illustrative scenario r A client application exchanges data over a TCP connection with a DB server r Distinct domains each administered by a different organization

Illustrative scenario Problem scenario A clock at an interface in WAN2 that supports T3 link loses SYNC 4 times a second for 0.25 ms  intermittent noise causing loss of 0.1% of T3 capacity  this small noise causes bit errors in a large number of packets routed over C-D  Bit errors cause packet losses, either at routers (if IP header corrupted) or at destinations

Illustrative scenario  performance of TCP connection degrades due to packet loss  TCP sender interprets this as congestion and hence reduces its window  TCP increases its window gradually until new packet loss  However due to the noise, the TCP window will not increase  DB transactions by client will last longer  DB server performance will degrade due to records lock-out, causing frequent aborts for remote transactions

Illustrative scenario Three important points r problems propagate among related objects, and possibly amplified by various protocol mechanisms r single problem can cause numerous observable events in multiple domains r some problems are not observable where they originate: m WAN2 domain may observe minor error events at the T3 interface, but these events may be indistinguishable from normal operating noise  WAN2 may be unaware that there is a problem Challenges r Determine events to monitor and ways to analyze them m Operations staff must have knowledge of operational parameters of managed objects and the significance of its events r Correlation of events and coordination among different domains r Automating the management activities (manual processing does not scale)

Modeling the Scenario  Partition the system into multiple management domains (e.g., enterprise domain, ED, and router domain, RD)  Each domain has a domain manager (DM) to monitor, correlate and handle its events  A MD may subscribe to receive notifications from other domains  ED sees the RD as a single entity connecting LAN1 and LAN2

Modeling the Scenario  Any problem in the connection is seen as RD problem  Inside each domain, finer grained correlation can determine the particular problem using symptoms from other domains  Example: packet loss is degraded TCP performance is detected by ED not by the RD.  this symptom is received by the RD and can be correlated along with other observable symptoms to isolate the “clock problem”. Detects only IP header corruption

Automating Event Management r An automated event management system (AEMS) must accurately model and store knowledge of the underlying system and its associated events. m Static Information associated with managed objects such as SNMP traps, thresholds for MIB variables, etc. m Dynamic information: reflects addition, removal, upgrades of network devices, etc. r The process of automation is that of developing correlation algorithms to analyze observable events r Correlation algorithms must m Scalable to large networks involving complex systems m Handle a large number of symptoms caused by a single problem m Fast --real time correlation m Robust (loss of a single alarm or generation of spurious event should not affect its decision  insensitive or resilient to noise

Problems and Symptoms r A problem is an event that can be handled directly; e.g., a faulty interface m Some problems are directly observable or indirectly by observing their symptoms r Symptoms are observable events m Degraded application performance is a symptom of a faulty interface m Symptoms cannot be handled; symptoms persist unless the problem is resolved r Problems and symptoms propagate from one object to another m Noise in WAN  bit errors in link C-D  loss of packets at routers  poor TCP performance  frequent transaction aborts in the DB server

Event Correlation System r Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms. r The correlator uses an event model to analyze these alarms. r The event model represents knowledge of various events and their causal relationships m Event model depends on the expert people r The correlator determines the common problems that caused the observed alarms.

Event Knowledge The Modeler’s event knowledge contains the following information for each class of managed objects: r The data attributes of objects of this class (e.g., MIB variables). r The set of events that are observable within instances of this class (e.g., a particular MIB variable is above threshold), or by asynchronous event notifications. r The set of events caused by each problem. This set can include events within the object, as well as events in other objects to which the object is related. r The problems that can originate within instances of this class. r The relationships in which an instance of the class can be involved. r The events and/or problems that are exported by instances of the class.

Coding Approach for Event Correlation  Treat the complete set of events caused by a problem as a “code” that identifies the problem  Correlation is the process of decoding the set of observed symptoms o Determine which problem has these symptoms as its code o Note: traditionally, alarms are typically correlated through searches over the event model knowledge base  Complexity of search limits scalability o Event model is a large database and the received alarms or symptoms may also be quite large

Coding Approach for Event Correlation Two phases:  Codebook selection phase: o Select a subset of events for monitoring – codebook o Codebook is an optimal subset of events that must be monitored to distinguish the problems of interests from one another o Ensure a desired level of noise tolerance oAlgorithms must decode or infer the problem in the presence of lose alarms or the existence of spurious alarms  Decoding o Find the problem whose associated symptoms (i.e., code) match the observed symptoms most closely

Causality Graph Models  Correlation is concerned with analysis of causality relations among events o e  f denotes causality of event f by event e o Causality is a partial order relation between events o Relation  can be described by a graph whose nodes represent the events and edges represent causality

Causality Graph Models Event that is neither a symptom nor a problem. Causal equivalence A symptom caused by another symptom  do not contribute any information about the problem All these indirect symptoms can be eliminated without loss of information Correlation graph

Correlation  Information contained in the correlation graph must be converted into codes, one for each problem in the graph. A code for a problem p is a vector p of 0s an 1s. Each bit corresponds to a symptom in the graph  example: code is of length 3 (3 symptoms) – after ordering of the symptoms (e.g., ):  code for p 1 is p 1 = (1,0,1) This means p 1 causes symptoms S 3 and S 9 p 2 = (1, 1, 0) and p 11 = (1, 0, 1) Correlation graph Event correlation is finding problems whose codes optimally match an observed symptom vector

Correlation  What happens when we observe symptoms S 3 and S 9 ? Both P 1 and P 11 match the observed vector! Clearly we know there is a problem but cannot identify the problem since both problems have identical codes..  What happens when we observe symptoms (0, 1, 0)? two possibilities: (1) a false event or (2) P 3 occurred but one symptom was lost. Correlation graph Interpretation depends on whether loss is more likely than false alarm generation In case spurious or lost symptoms are unlikely, information provided by S 9 is redundant  (1, 0) and (1, 1) are sufficient to correlate event vectors. Subset of symptoms required to provide desired level of distinction between problems is called codebook

Correlation- example r Codebook contains only three symptoms r The codebook distinguishes among all problems however, it guarantees distinction by only a single symptom A loss or spurious generation of S 4 will result in decoding error Distinction between problems is measured by the “hamming Distance” between their codes Radius is ½ the hamming distance Codebook not resilient to noise

Correlation- example Event vectors {011100, , , } will be decoded as P 1 with a single symptom loss and {111110, } is interpreted as P 1 with a single spurious symptom When two error symptoms occur, decoder will detect the error but cannot correctly (uniquely) decode the event (e.g., P 1 and P 4 )

Correlation- Advantages