Network Monitoring Chu-Sing Yang Department of Electrical Engineering National Cheng Kung University
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Introduction Network monitoring Observes and analyzes the status and behavior of the end systems, intermediate systems and subnetworks that make up the configuration to be managed Three major design areas for network monitoring Access to monitored information How to define monitoring information How to get that information from a resource to a manager Design of monitoring mechanisms How best to obtain information from resources Application of monitored information How the monitored information is used in various management functional areas
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Network-Monitoring Information Static information Characterizes the current configuration and the elements in the current configuration The number & identification of ports on a router Is typically generated by the element involved The information is available to a manager by an agent or a proxy Dynamic information Is related to events in the network A change of state of a protocol machine Transmission of a packet on a network Is collected and stored by the network element responsible for the underlying events
Network-Monitoring Information (cont.) Statistical information Is derived from dynamic information Average no. of packets transmitted per unit time Is generated by any system that has access to the underlying dynamic information
Monitoring Real-Time System
Network-Monitoring System Monitoring application Includes the functions of network monitoring that are visible to user Performance monitoring, fault monitoring, accounting monitoring Manager function Is the module at network monitor Performs the basic monitoring function of retrieving information from other elements Agent function Gathers and records management information for one or more network elements Communicates the information to the monitor Managed objects Is the management information that represents resources and their activities Monitoring agent An additional module concerned with statistical information Generates summaries and statistical analyses of management information
Network Monitoring Configurations
Network monitor Includes agent software and a set of managed objects To assure that the monitor continues to perform function Monitor the load on itself and on the network Monitor the status and behavior of the network monitor Monitors the amount of network management traffic into and out of the network monitor External monitors (remote monitors) Includes one or more agents that monitor traffic on a network Proxy agent If network elements do not share a common network management protocol with the network monitor
Two-Tier Management Communication Model Database } Network Elements Network Queries Unsolicited Events { Manager Unmanaged Element Managed Element Agent Managed Element Agent Managed Element Agent Network Management System
Two-Tier Management Communication } Network Elements Network Queries Unsolicited Events Router The Real World CiscoWorks HP-OpenView } Network Management System Call Manager PrinterRouter Switch
Unmanaged Element Proxy Agent Three-Tier Management Communication } Network Elements RMON Probe The Model MDB { Manager Managed Element Agent NMS
Three-Tier Management Communication The Real World CiscoWorks Concord eHealth } Network Management System SwitchProbe Switch { Managed Element
Polling and Event Reporting Information that is useful for network monitoring is collected and stored by agents and made available to one or more managers systems Polling Is a request-response interaction between a manager and agent The manager queries any agent and request the values of various information elements Is used to generate a report on behalf of a user and to respond to specific user queries
Event Reporting Agent may generate a report Periodically to give the manager its current status When a significant event or an unusual event occurs Manager Is a listener waiting for incoming information Preconfigure or set the reporting period Benefits Be useful for detecting problems as soon as they occur More efficient than polling for monitoring objects whose states or values change relatively infrequently
Polling Manager Queries any agent and request the values of various information elements Learns about the configuration it is managing Obtains periodically an update of conditions Investigates an area in detail after being alerted to a problem Agent Responds with information from its MIB Reports information matching certain criteria Supplies the manager with information about the structure of the MIB at the agent
Polling vs. Event Reporting Factors of choices The amount of network traffic generated by each methods Robustness in critical situations The time delay in notifying the network manager The amount of processing in managed devices The tradeoffs of reliable versus unreliable transfer The network-monitoring applications being supported The contingencies required in case a notifying device fails before sending a report In general SNMP approach: polling Telecommunications management systems: both
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Performance Indicators Difficulties in selection and use of the indicators There are too many indicators in use The meanings of most indicators are not yet clearly understood Some indicators are supported by some manufacturers only Most indicators are not suitable for comparison with each other Indicators are accurately measured but incorrectly interpreted The calculation of indicators takes too much time, and the final results can hardly be used for controlling the environment
Performance Indicators Service-oriented measures the highest priority Availability Response time Accuracy Efficiency-oriented measures Throughput Utilization
Availability The percentage of time that a network system, a component, or an application is available for a user Availability is based on the reliability of the individual components of a network MTBF: mean time between failures MTTR: mean time to repair Availability = MTBF / (MTBF+MTTR) Availability of a system depends on the availability of its individual components plus the system organization Redundant components
A = 0.98 A(serial)=0.98x0.98 =0.96 Unavailabily=1-A=0.02 Unavailability of parallel =0.02x0.02= A(parallel) = =0.9996
Availability (cont.) Functional availability for a dual link system Nonpeak periods accounts for 40% of requests, ether link can handle the traffic load During peak periods, both links are required to handle the full load, but one link can handle 80% of the peak load A f = (capability when 1 link is up) * Pr[1 link up] + (capability when 2 links are up) * Pr[2 links up] A f (nonpeak) = 1 * [A(1-A) + (1-A)A] + 1 * (A*A) = 0.99 A f (peak) = 0.8 * [A(1-A) + (1-A)A] + 1 * (A)(A) = A f = 0.6 * A f (peak) * A f (nonpeak) If A = 0.9, A f =
Base Requirements for Availability Secure facilities Power systems Circuit diversity Intra-chassis redundancy Dual power supplies Online Insertion and Removal Multi-processor design
Response Time The time it takes for a response to appear at a user’s terminal after a user action calls for it The cost for shorter response time Computer processing power Increased processing power means increased cost Competing requirements Provides rapid response time to some processes may penalized other processes Productivity increases as rapid response times are achieved Up to 2 seconds response time is acceptable for most interactive applications
System Response Time
Elements of Response Time
Accuracy The percentage of time that no errors occur in the transmission and delivery of information Built-in error correction mechanisms in protocols Data link and TCP protocols Monitors the rate of errors Indicates an intermittent faulty line Exists a source of noise or interference
Throughput The rate at which application-oriented events occur Is an application-oriented measure No. of transactions of a given type for a period of time No. of customer sessions for a given applications during a period of time No. of calls for a circuit-switched environment Is useful to track these measures over time Performance trouble spots
Utilization The percentage of the theoretical capacity of a resource (e.g., multiplexer, transmission line, switch) that is being used Is a more fine-grained measure than throughput Used to search for potential bottlenecks and areas of congestion Response time usually increases exponentially as the utilization of a resource increases
Simple Efficiency Analysis
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Performance-Monitoring Function Three components for performance monitoring Performance measurement Gathers statistics about network traffic and timing Accomplished by agent modules to observe the behavior of nodes No. of connections, the traffic per connection External (remote) monitor Be able to unload the processing requirement from operational nodes to a dedicated system Performance analysis Consists of software for reducing and presenting the data Synthetic traffic generation Permits the network to be observed under a controlled load
Performance Measurement Reports Host communication matrix Group communication matrix Packet type histogram Data packet size histogram Throughput-utilization distribution Packet interarrival time histogram Channel acquisition delay histogram Communication delay histogram Collision count histogram Transmission count histogram
Inquiry Concerns Possible Errors and Inefficiencies Are there S-D pairs with unusually heavy traffic Are some packet types of unusually high frequency, indicating an error or an inefficient protocol? What is the distribution of data packet size? What are the channel acquisition and communication delay distribution? Are collisions a factor in getting packets transmitted? What is the channel utilization and throughput?
Inquiry Concerns Increasing Traffic Load What is the effect of traffic load on utilization, throughput and time delay? When does traffic load start to degrade system performance? What is the tradeoff among stability, throughput and delay? What is the max capacity of the channel under normal operating conditions? How many active users are necessary to reach this maximum?
Inquiry Concerns Varying Packet Sizes Do larger packets increase or decrease throughput and delay? How does constant packet size affect utilization and delay?
Statistical versus Exhaustive measurement When an agent is monitoring a heavy load of traffic, it may not be practical to collect exhaustive data Monitors the total number of packets in a given time period between each S-D pair on the LAN Samples the traffic stream to estimate the value of the random variable Statistical methods: probabilities
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Fault Monitoring Objective Identify faults as quickly as possible after they occur and identify the cause of the fault so that remedial action may be taken Problems of fault observation – locate and diagnose faults Unobservable faults Certain faults are inherently unobservable locally The existence of a deadlock between cooperating distributed processes may not be observable locally Partially observable faults A node failure may be observable but insufficient to pinpoint the problem The failure of low-level protocol Uncertainty in observation Lack of response from a remote device may mean that the device is stuck, the network is partitioned, congestion caused the response to be delayed, or the local timer is faulty
Fault Monitoring (cont.) Problems in fault isolation Multiple potential causes Multiple technologies will cause the potential point of failure and the types of failures increase Too many related observations A single failure may generate many secondary failures Interference between diagnosis and local recovery procedures Local recovery procedures may destroy important evidence concerning the nature of the fault, disabling diagnosis Absence of automated testing tools Testing to isolate faults is difficult and costly to administer
Fault Monitoring
Fault-Monitoring Functions Detect faults Agent reports errors independently to one or more managers Agent maintains a log of significant events and errors Criteria for issuing a fault report Avoids overloading Anticipate faults Set up thresholds Packet loss rate An effective user interface
Test a Fault Monitoring System Connectivity test Data integrity test Protocol integrity test Data saturation test Connection saturation test Response-time test Loopback test Function test Diagnostic test
Outline Introduction Network monitoring architecture Performance monitoring Fault monitoring Accounting monitoring
Accounting Monitoring Keep track of users’ usage of network resources An internal accounting system assesses the overall usage of resources and determines the cost of shared resource to each department System offers a public services Resources that may be subjected to accounting Communications facilities LANs, WANs, leased lines, dial-up lines, and PBX system Computer hardware Workstations and servers Software and systems Applications and utility software in servers, a data center, and end-user sites Services Includes all commercial communication and information services
Collect Accounting Data Based on the requirements of the organization Communications-related accounting data might be gathered and maintained on each user User identification Receiver No. of packets Security level Identifies the transmission and processing priorities Time stamps Associated with each transmission and processing event Transaction start and stop times Network status codes Indicates the nature of any detected errors or malfunctions Resources used
Summary Network monitoring is the most fundamental aspect of automated network management Gathers information about the status and behavior of network elements Static information Dynamic information Statistical information Agent collects local management information and transmits to one or more NMS Each NMS includes network management application software plus software for communication with agents
Summary Performance monitoring Availability Response time Accuracy Throughput Utilization Fault monitoring Identifies faults as quickly as possible Identifies the cause of the fault to take corrective action Fault monitoring function is complicated Accounting monitoring Gathers usage information for each resources