Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC 2006 IEEE/IFIP Network Operations & Management Symposium Forecasting Network Performance Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward. Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)

Outline Why do we want forecasting & anomaly detection?
What are we using for the input data And what are the problems How do we make forecasts, detect anomaly? First approaches The real world Results Conclusions & Futures

Uses of Techniques Automated problem identification:
Admins cannot review 100’s of graphs each day Alerts for network administrators, e.g. Bandwidth changes in time-series, iperf, SNMP Alerts for systems people OS/Host metrics Anomalies for security Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement

Measurement Topology 40 target hosts in 13 countries
Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites

Using Active IEPM-BW measurements
Focus on high performance for a few hosts needing to send data to a small number of collaborator sites Makes regular measurements Ping (RTT, connectivity), traceroute pathchirp, pathload, ABwE (packet pair dispersion) iperf (single & multi-stream), thrulay, Lots of analysis and visualization Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites

Forecasting and Anomaly detection

Anomaly Detection Anomaly is when the actual value significantly differs from the expected value So need forecasts to find anomalies Focus was initially on abing time-series measurements: Measurement each 3 minutes Low network impact BUT very noisy so hard test case

Plateau, most intuitive
Each observation: If outside history buffer mean mh ± b*sh then add to trigger buffer Else add to history, and remove oldest from trigger buffer When trigger buffer > t points then trigger issued Check if (mh - mt) / mh > D & 90% trigger in last T mins then have trigger Move trigger buffer to history buffer Observations Event * = history length = 1 day, t = trigger length = 3 hours = standard deviations = 2 We set the history buffer length to one day in order to minimize the lag between the history mean and the observations due to diurnal changes. Trigger % full History mean History mean – 2 * stdev

Plateau (Seasons & false alerts)
Congestion on Monday following a quiet weekend causes a high forecast, gives an alert Also a history buffer of not a day causes History mean to be out of sync with observations

K-S For each observation: for the previous 100 observations with next 100 observations Compare the vertical difference in CDFs How does it differ from random CDFs Expressed as % difference The trigger buffer reporting the event well after start of step down is partially an artifact. It could for example report the time of the start of the event as say when the trigger buffer reached 10% full. However, K-S is still more accurate in defining the time when the change was greatest.

K-S (Seasons and False Alerts
Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

Seasonal Changes Use Holt-Winters (H-W) technique:
Uses triple exponential weighted moving average Three terms each with its own parameter (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends O(t) = a*(y(t) / S(t-w)) + (1 - a) * (O(t-1) + T(t-1)) #Overall S(t) = b*(y(t) / O(t)) + (1 – b) * S(t-w) #Seasonal effect T(t) = g*(O(t) – O(t-1)) + (1 – g)*T(t-1) #Trend F(t+m) = (O(t) + m – T(t)) * S(t-w+m)#Forecast w=number of periods that complete a season, m is m periods ahead (for forecasting), t=time t. The trend component for our data is flat.

H-W Implementation Need regularly spaced data (else going back one season is difficult, and gets out of sync): Interpolate data: select bin size Average points in bin If no points in first week bin then get data from future weeks For following weeks, missing data bins filled from previous week Initial values for smoothing from NIST “Engineering Statistics Handbook” Choose parms by minimizing (1/N)Σ(Ft-yt)2 Ft=forecast for time t as function of parameters, yt = observation at time t A week is special and defines a cycle of seasons. We do nothing special with the day. Note we need a weeks worth of data to get going

H-W and alerts H-W is a forecasting technique; need to complement with a method to identify events If a percentage of residuals are outside twice the EWMA of absolute deviation, in a time duration, generate event (HWE) Apply Plateau on H-W residuals (PHR) and K-S on H-W residuals (KHR)

Results

Evaluation Created a library of time series for 100 days from June through Sep 2004 for 40 hosts Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) 23 hosts had 120 candidate events Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds Classify ~120 events as to whether interesting Large, sharp drop in bandwidth, persist for >> 3hrs Plateau easiest to understand and tune etc. also first to be developed. Classification is subjective, large (mh-mt)/mh> 10%, also looked at 30%, step occurs in <4 hours

Compare (KS & Plateau) K-S shows similar results to Plateau
As adjust parameters to reduce false positives then increase missed events E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(mh-mt)/mh Plateau (b=2) K-S with ± 100 observations D False Miss 10% 16% 8% 30% 2% 32%

MB technique MB on IPerf data from SLAC to Caltech; Oscillates wildly as it tries to track individual spikes and misses the important step down and up.

HW technique Minimizing square of residuals to estimate the initial HW parameters fairly wide range of values of the local smoothing parameter (α = to 0.95, median = ± 0.12) and the seasonal parameter (β = to 0.999, median = 0.22 ± 0.2). Poor forecasts during the initial weeks of data - suggestive that there may not be a suitable single set of parameters for all paths - due to not having good initial estimates.

Example Local smoothing 99% weight for last 24 hours
Linear trend 50% last 24 hours Seasonal mainly from last week, but includes several weeks Within an 80 minute window, 80% points outside deviation envelope ≡ event 1 hr avg Observations Deviations are smoothed absolute residuals Note the difference in weekend vs weekday Deviations Forecast Weekend Weekdays

PHR and KHR Both were able to detect one-off step-downs.
PHR shows no false positives KHR raises several false events e.g. weekend data H-W residuals close to 0 for weekend, but more spread out for weekdays (though small in absolute value) K-S compares the weekend and weekdays residuals distribution This happens because KS compares two frequency distributions irrespective of the relative change in values. During weekends, residuals are close to 0 as data values usually mirror past weekend’s data and HW is able to make good forecasts based on its past week’s seasonal cycle values. However due to higher network usage during weekdays, there are higher fluctuations and residuals are more spread out, although small in absolute value. So KS on weekend data effectively compares two very different distributions (past weekdays and weekend) thereby raising false events. This interesting observation highlights a weakness of KS for our current application.

Comparison of various techniques

Conclusions A few paths (10%) have strong seasonal effects
Plateau & K-S work well if only weak seasonal effects K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) H-W promising for seasonal effects, but Is more complex, and requires more parameters which may not be easy to estimate Requires regular data (interpolation step) Can use to remove seasonal effects and then apply Plateau CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 H-W works, still need to quantify its effectiveness

Current & Future Work Different objective function to minimize for HW parms Future Development in PCA Enable looking at multiple measurements simultaneously E.g. RTT, loss, capacity …; multiple routes Interpolate heavyweight/infrequent measurements based on light weight more frequent Netflow passive exploration Manually study events, leading to development of automated diagnosis of events: traceroutes, router utilizations, host measurements

More information SLAC Plateau implementation
SLAC H-W implementation www-iepm.slac.stanford.edu/monitoring/forecast/hw.html Eng. Statistics Handbook IEPM-BW Measurement Infrastructure

Appendix

KS technique KS on IPerf data from SLAC to Caltech; Doesn’t distinguish between increases and decreases in the data, and thus detects roughly twice as many events

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

Similar presentations

Presentation on theme: "Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

Similar presentations

Presentation on theme: "Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC."— Presentation transcript:

Similar presentations

About project

Feedback