The CALgorithm for Detecting Bandwidth Changes

The CALgorithm for Detecting Bandwidth Changes
Connie Logg SLAC

Motivation Throughput to a target host decreases dramatically - questions arise: Why? When did it start? What was the duration? Is the decrease periodic? Was it associated with a route change? Could we have known and avoided being affected? What other destinations are affected?

The Challenge Lightweight detection techniques
We do not want to consume bandwidth to measure bandwidth Quick response detection requires frequent measurements Must automate the detection and generate alerts automatically

Measurement Tools IEPM-BW framework ABWE Traceroute Ping

Methodology - I Every few minutes (10 currently) run the tests to target hosts Ping See if we can ping the host. If not, it is not the end of the world unless we get “unknown host” response Results are logged to a flat file

Methodology - II Traceroute
Run forward, and if possible, reverse traceroutes Need to be able to ssh to target to run reverse traceroute (not always possible) Depending upon traceroute restrictions in the route, may need to run ICMP traceroute instead of UDP traceroute Record in a flat file the traceroute results

Methodology - III ABWE is running continuously, every minute. The data is put into an Oracle database, one point per minute

Analysis “triganal” analyzes the ABWE data for decreases in throughput (can also do throughput increases, but we are not concerned about those now) Note the ABWE data is once a minute “triganal” parameters depend on data frequency and how long you want drop to exist before alerting on it

“triganal” Methodology - I
There are two data buffers: A history buffer – histbuf – where the processed data is stored. It has a minimum size histmin which is the minimum amount of data required to “prime the pump”. This data is loaded into the history buffer when the algorithm is invoked. It has a maximum size histmax which is the maximum amount of data allowed in histbuf. As new values are added to histbuf, if size(histbuf)>histmax, the oldest values are removed

“triganal” – Methodology - II
A trigger buffer – trigbuf where the data which is considered possible “trigger data” is stored for ongoing analysis. trigdur is the number of points which when loaded into trigbuf, trigger the alert analysis. This also, in this case, represents the amount of time the throughput must be depressed before the trigger buffer data is evaluated to see if it constitutes an “alert” situation

“triganal” – Methodology - III
The algorithm is controlled by various parameters which must be tuned to the nature of the data histmax, histmin, and trigur which were discussed previously Sensitivity ($sens) – which is a multiplicative factor applied to the standard deviation for determining whether the new data point goes into the trigger buffer or whether it is an outlier

“triganal” – Methodology - IV
Threshold parameters – are % change values which determine whether the contents of the full trigger buffer is a major alert or a minor alert. Major and minor alerts are implemented to allow for “tuning” of the algorithm. Generally we set the equal (disable minor alerts): minorthresh = majorthresh Assume they are: majorthresh(40%) and minorthresh(40%)

“triganal: - Methodology V
As the data is processed in time order, there are two functions (qtrigger and qoutlier) which are applied to each new data point to determine which buffer (histbuf or trigbuf) and what state the data is to be loaded into the appropriate buffer in. “Outlier” data is loaded into the buffer as negative data, and the script which calculates the mean and standard deviation (calcstats) does not include the negative data in its calculations.

“triganal” – Methodology - VI
qtrigger – determines whether a value qualifies for the trigger buffer. $val = value currently being examined histmean – mean of data in the history buffer histsd – standard deviation of the data in the history buffer if (($val > histmean+sens*histsd) or ($val < histmean-sens*histsd)) then qtrigger = true else qtrigger = false

“triganal” – Methodology - VII
qoutlier – determines whether a value is so out of range that it is an outlier and should not be included in the mean and standard deviation calculations if (($val > histmean+sens*histsd*2) or ($val < histmean-sens*histsd*2)) then qoutlier = true else qoutlier = false (note that one might use variance instead of 2*histsd*, but we find that the variance does not work well)

“triganal” – Startup Prime histbuf with histmin values and then calculate the histmean and histsd value. Initialize the data direction: $curdir = none

The Master “triganal” Loop - I
Loop over the values in the data set with the following algorithm (start of data input loop) Is $val NOT a trigger value? Then If (abs(($val - histmean)/histmean) < .1) add value to histbuf but do not include it in the stats ($val = -$val). This is to avoid flatlining the distribution and making the histsd very small. Calculate the stats histmean and histsd If (size(trigbuf)>0) then remove oldest trigbuf value We want to age out the data in the trigger buffer if we are recovering from a bandwidth drop Go to start of data input loop

The Master “triganal” Loop - II
$val is a trigger value: direction of change is important. $curdir = current direction of data from histmean which is in trigbuf If ($val > histmean) then “direction” $valdir = up else $valdir = down If trigbuf is empty, $curdir = $valdir If qoutlier($val) then $val = -$val Save $val in trigbuf

The Master “triganal” Loop - III
If ($curdir ne “none” and $curdir ne $valdir) the data has changed direction and we abort the trigger state Save the absolute value of the trigbuf data into histbuf Calculate new histbuf stats Clear the trigger buffer trigbuf Increment $aborttrigger $curdir = “none” Go to start of the data input loop

The Master “triganal” Loop - IV
If Trigger buffer is not full: Go to start of data input loop If the Trigger buffer is full: Calculate the trigmean and trigsd of the absolute values of all the trigbuf data Calculate the percent change $perchange = 100*(histmean-trigmean)/histmean

The Master “triganal” Loop - V
If this is NOT drop in throughput That is (trigmean >= histmean) Add abs(trigbuf values) to histbuf Clear trigbuf Recalculate the histbuf stats Go to start of the data input loop

The Master “triganal” Loop - VI
It IS a drop in throughput does $perchange exceed the majorthresh? Compare it to the previous still active alert if there is one ($alertmean ne 0) $trigchange = 100*($alertmean-$trigmean)/$alertmean If ($trigchange > majorthresh) then Generate a major alert Add absolute value (trigbuf values) to histbuf Clear trigbuf, reset its stats to 0, and calculate new histbuf stats Set $alertmean = trigmean (preserve alert status) Go to start of the data input loop

The Master “triganal” Loop - VII
There was no previous alert ($alertmean = 0) Generate the alert $alertmean = trigmean Add abs(trigbuf values) to histbuf Recalcualte histbuf stats Clear trigbuf Go to start of the data input loop END OF DATA INPUT LOOP – all data is processed

Final Steps Create the plots for the html pages That’s all Folks

The CALgorithm for Detecting Bandwidth Changes

Similar presentations

Presentation on theme: "The CALgorithm for Detecting Bandwidth Changes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The CALgorithm for Detecting Bandwidth Changes

Similar presentations

Presentation on theme: "The CALgorithm for Detecting Bandwidth Changes"— Presentation transcript:

Similar presentations

About project

Feedback