Download presentation
Presentation is loading. Please wait.
Published byὨριγένης Δασκαλόπουλος Modified over 6 years ago
1
Connie Logg, Joint Techs Workshop February 4-9, 2006
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006
2
BW Change Detection: Important
Know what you are looking for How long must a change persist before alerting? What threshold to use for alerting (drop of N %)? What probes provide quality data and are relevant? May differ between network types and technologies Once an alert is detected, what circumstances must be met before another alert is generated for the same or new drop? Alerting and forecasting/predicting future performance are two different things – however data taken may be relevant to both Remember – Don’t want to respond to every little glitch – more probing may escalate a minor momentary congestion event.
3
Study them for accuracy and relevance
What to do with ALERTS Study them for accuracy and relevance What information would help diagnose the drop? Were there traceroute changes? Do changes in other probes seem to have occurred in the same time frame? Was there an increase in the ping RTT times? If TCP RTT is available, was there a change in that? What does OWAMP show (to be implemented)
4
Algorithm - Simplified
Stream of data t tn 2 buffers: history buffer (hbuff) and trigger buffer (tbuff), sizes hmax & tmax Load data t0-thmax into history buffer and calculate baseline histmean(hm) & histsd(hsd)
5
Algorithm - Simplified
Loop over data t = {thmax tn} if t > hm -2*hsd, tbuffoldest->hbuff, t->hbuff, drop hbuffoldest ,calc hm & hsd, next If t<= hm -2*hsd t->tbuff If size(tbuff) < tmax, next; Calc tbuff mean (tm), if (hm-tm)/hm > threshold, generate an alert, tbuff -> hbuff, calc hm, hsd, next Once alert is generated, drop threshold must be met again from the tm or the data stream must recover for ½ of drop time.
6
Overview What we currently look for
Look for a drop lasting at least 6 hours Look for a drop of 33% Before reporting another drop, require 3 hours of restored throughput Up for at least 3 hours Bandwidth 33% drop 6 hours Drop of 33% for 6 hours 33% drop 6 more hours Time
7
Observations: Traceroute changes occasionally coincide with bandwidth drops Challenge: How do you defined a traceroute change and which have most priority? Checksum error Duplicate responding or non responding hop ! Annotations IP addr differ in 4th octet (or 3rd and 4th octets) How do you quickly review traceroute changes?
8
Traceroute Visualization
One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (period(.) = no change) Identify unique routes with a number Inspect the route associated with a route number Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change
9
Pathology Encodings Change but same AS No change Probe type
Change in only 4th octet End host not pingable Hop does not respond Stutter ICMP checksum ! Annotation (!X)
10
Navigation traceroute to CCSVSN04.IN2P3.FR ( ), 30 hops max, 38 byte packets 1 rtr-gsr-test ( ) ms … 13 in2p3-lyon.cssi.renater.fr ( ) ms !X #rt# firstseen lastseen route , , , ,..., xxx.xxx , , ,137,..., xxx.xxx , , , ,..., xxx.xxx , , , ,..., xxx.xxx , , , ,...,(n/a), xxx.xxx , , , ,..., xxx.xxx , , , ,..., xxx.xxx , , ,(n/a),..., xxx.xxx , , , ,(n/a),..., xxx.xxx
11
AS’ information
12
Esnet-LosNettos segment in the path
Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am
13
New Graphical Map Display
New Traceroute Map Display
14
Quality Control – Bandwidth Monitoring
It is good to have a local target host for a sanity check: Problem here was that the monitoring host rebooted into single CPU mode after maintenance had been performed on it.
15
More Sanity Checks Target host – – was not completely installed – process cleanup did not have the perl modules that it needed to kill lingering processes (needs install check)
16
Probe Correlation Traceroute change affected all 3
Pathchirp analysis shows drop Multi-stream iperf show drop Single stream iperf shows drop
17
Analysis Results is sent to interested parties with links to graphs, data and traceroute analysis Alerts are saved in the ALERT table and graphs are saved in the GRAPH Table for future reference Every analysis run, about every 2 hours, a table showing which alerts occurred for which probes and when is generated. It has links to the more detailed alert information. Reports are generated nightly for the last month alerts from these tables.
18
Future Improvements Integrate Ping RTTmin and RTTmax analysis
Optimize code for speed of execution – estimate mean and std dev Upload alerts to MonALISA – What info? Compare detection algorithms (KS, HW, PCA?) Recommendations on data taking frequencies and how to define the trigger and history buffer sizes still needs more exploring Implement prediction/forecasting algorithm(s) QUESTIONS?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.