FAULT MANAGEMENT. Definition can be defined as the real-time or near-real-time monitoring of the elements of a computer communications network with attendant.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Computer Architecture
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Chapter 19: Network Management Business Data Communications, 5e.
OSI Model OSI MODEL.
Network Layer and Transport Layer.
Chapter 19: Network Management Business Data Communications, 4e.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
EEC-484/584 Computer Networks Lecture 2 Wenbing Zhao
Protocols and the TCP/IP Suite
Basic Input/Output Operations
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
 A system consisting of a number of remote terminal units (or RTUs) collecting field data connected back to a master station via a communications system.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Maintaining and Updating Windows Server 2008
Computer Networking Devices Seven Different Networking Components.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Communicating over the Network Network Fundamentals – Chapter 2.
WAN Technologies.
Data Communications and Networks Chapter 2 - Network Technologies - Circuit and Packet Switching Data Communications and Network.
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
CECS 5460 – Assignment 3 Stacey VanderHeiden Güney.
Chapter 4: Managing LAN Traffic
1 BTEC HNC Systems Support Castle College 2007/8 Systems Analysis Lecture 9 Introduction to Design.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Communicating over the Network Network Fundamentals – Chapter 2.
Textbook  “Data Communications and Networking” 2 nd Edition by Behrouz A. Forouzan  “Data and Computer Communication” 6 th Edition by William Stallings.
Operating Systems.  Operating System Support Operating System Support  OS As User/Computer Interface OS As User/Computer Interface  OS As Resource.
The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.
Protocol Architectures. Simple Protocol Architecture Not an actual architecture, but a model for how they work Similar to “pseudocode,” used for teaching.
Common Devices Used In Computer Networks
Protocols and the TCP/IP Suite
ACM 511 Chapter 2. Communication Communicating the Messages The best approach is to divide the data into smaller, more manageable pieces to send over.
Basic LAN techniques IN common with all other computer based systems networks require both HARDWARE and SOFTWARE to function. Networks are often explained.
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
Internet Addresses. Universal Identifiers Universal Communication Service - Communication system which allows any host to communicate with any other host.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU
NETWORK COMPONENTS Assignment #3. Hub A hub is used in a wired network to connect Ethernet cables from a number of devices together. The hub allows each.
WebCCTV 1 Contents Introduction Getting Started Connecting the WebCCTV NVR to a local network Connecting the WebCCTV NVR to the Internet Restoring the.
Data Communications and Networking Overview
Chapter 8 1 Chap 8 – Network Troubleshooting Learning Objectives Establish a network baseline Describe troubleshooting methodologies and troubleshooting.
CHAPTER 4 PROTOCOLS AND THE TCP/IP SUITE Acknowledgement: The Slides Were Provided By Cory Beard, William Stallings For Their Textbook “Wireless Communication.
CPSC 873 John D. McGregor Session 9 Testing Vocabulary.
Open System Interconnection Describe how information from a software application in one computer moves through a network medium to a software application.
4 Linking the Components Linking The Components A computer is a system with data and instructions flowing between its components in response to processor.
NERC Lessons Learned Summary LLs Published in September 2015.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
N ETWORKING Standards and Protocols. S TANDARDS AND P ROTOCOLS The OSI Model.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.
Mr. Sathish Kumar. M Department of Electronics and Communication Engineering I’ve learned that people will forget what you said, people will forget what.
Maintaining and Updating Windows Server 2008 Lesson 8.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
CPSC 372 John D. McGregor Module 8 Session 1 Testing.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
Computer Engineering and Networks, College of Engineering, Majmaah University Protocols OSI reference MODEL TCp /ip model Mohammed Saleem Bhat
OSI Model OSI MODEL. Communication Architecture Strategy for connecting host computers and other communicating equipment. Defines necessary elements for.
OSI Model OSI MODEL.
Chapter 19: Network Management
THE OSI MODEL By: Omari Dasent.
Lecturer, Department of Computer Application
DEPARTMENT OF COMPUTER SCIENCE
Protocols and the TCP/IP Suite
OSI Model OSI MODEL.
SURVIVABILITY IN IP-OVER-WDM NETWORKS (2)
Protocols and the TCP/IP Suite
Presentation transcript:

FAULT MANAGEMENT

Definition can be defined as the real-time or near-real-time monitoring of the elements of a computer communications network with attendant resolution of related problems. This function addresses both hardware and software activities that are part of the system operation. These elements are both physical and functional, and can include connectivity errors, equipment failures, performance bottlenecks, and performance discontinuities. Fault management may obtain some of its inputs from other functions, such as configuration management, but the evaluation of inputs is irrelevant to their source.

Description The fault management function interfaces with the various system components sometimes directly, as with network elements, or sometimes indirectly via subnetwork managers, such as LAN managers. In the OSI scheme for network management, the fault management function, as with the other interfaces with the network through the Systems Management Applications Entity(SMAE), that is located at each node, via the Common Management Interface Protocol(CMIP)gateway. A node may be anything from a single piece of routing equipment to a complex array of data processors. The key feature is that the node must somehow terminate the incoming data channels and reconnect them at the outgoing side.

Fig2-2 illustrates the basic issues involved in signal detection. The crosshatched area labeled as the probability of detection, P(D), represents the probability of detection of a fault, given that a fault has actually occured. Sometimes, a fault is declared when, in fact, such is not the case, and this is represented as the probabilit of a false alarm, P(FA). The curve labeled as noise includes all non-fault or false alarm reports. In some cases, signals that pass the criteria for declaration as actual fault events are, in fact, not faults at all. Such events are false alarms. Depending upon where the detection threshold is placed, the likelihood of experiencing more or fewer such false alarms is altered. As the detection threshold is shifted to the left, there is a higher probability of experiencing false alarms. The positive aspect of moving the detection threshold to the left is that there is also a higher probability of detecting all actual faults. The trick is to separate the noise and event curves as much as possible in order to maximize the detection capability while, at the same time, minimizing the possibility of risking false alarams.

Fig 2-3 shows the conceptual causes of ambiguity In some systems, status information may come over separate channels, or may be interleaved with the operational data. In Case A, the accumulation of information about the status of the system elements is spread over n number of elements, thus confusion may arise as to which elements are at fault when all status reports are processed. Even though status reports are usually identified with the equipment may ripple through the data stream to affect equipment in other parts of the system. In Case B, possible simultaneous failures, or single failures that affect all channels at the same time, may confuse the deciphering of status messages. Ambiguity may arise where there is decoupling between the data traffic configuration and the incoming status messages.

Fault management must satisfy one or more of the following ten tasks: 1.Spontaneous Error Reporting -An SMAE, at each network management nodal interface of the network, can send and receive timely error reports between itself and another SMAE. 2. Cumulative Error Gathering -A designated SMAE can gather error information on behalf of another SMAE within the system. The designated SMAE can poll error counters within other SMAEs on a periodic basis and can reset each counter as it is polled. 3. Error Treshold Alarm -Any SMAE can be configured to send threshold reports to another SMAE with previously set error thresholds, and current thresholds can be determined. Finally, the resetting of counters used to compare thresholds can be accomplished. 4. Event Logging -Any SMAE can send all event reports to another SMAE, providing for the initialization and termination of event logging.

5. Confidence and Diagnostic Testing -Any SMAE may request any other SMAE to perform testing and to report back to the requestor the results of such testing. 6. Repair Action Reporting -An SMAE may request the status, from another SMAE, of any resource that has been previously been reported as faulty. 7. Trace Communication Path -Cooperating SMAEs can test interconnecting communications paths and report results back to a requesting SMAE. 8. Resouce Reinitialization -An SMAE can request another SMAE to set the initial state(s) of some resouce to a known parameter(s). 9. Event Tracing -One SMAE can request another SMAE to start or stop logging specific avents locally, and to report back the status of this exercise. 10. Fault Management Information Gathering -This facility provides for one SMAE to collect, dump, and analyza local information so as to support other SMAEs making such requests.

Faults and Failure Mechanisms 1.Single Stuck-at Faults -These faults occur due to an inability to make a transition between logic levels in some circuit. This is usually caused by alternate failures within the circuitry or some type of deposition of material during the manufacturing process. 2. Multiple Stuck-at Faults -These types of faults may be the result of manufacturing defects caused by over-etching, an incorrect PC layer, etc. For non-interactive faults in this category, tests can be divised to expose each separately, while interactive faults may not be revealed with standard tests. 3. Bridging Faults -These faults can be thought of as variations of the single and multiple stuck-at faults. Foreign material touching two or more component parts or circuit traces is the most common form of bridging fault. A not-so-elegant approach to such problems has been to implement trace cuts as a fix.

4. Intermittent Faults -Any fault that occurs and then disappears without intervening action taken to correct the fault is an intermittent fault. Such faults are usually discovered by examining the physical implementation of the unit in question. Vibration is likely cause of intermittent faults, as is thermal stress. 5. Memory Faults -Chip miniaturization with its constraints upon separations is a major cause of such faults. Tighter and tighter constraints brought about by transitions form LSI to VLSI to VHSIC technology where the separtions between traces is 0.5 microns or less, causes such faults to be more common. 6. Time-Dependent Faults -Such faults are usually linked to physical rather than electronic roots. Varying the memory refresh times of a display monitor are an example of such faulty conditions that can be attributed to physical conditions.

THE END