Forecasting Network Performance

Slides:

Advertisements

Similar presentations

Pathload A measurement tool for end-to-end available bandwidth Manish Jain, Univ-Delaware Constantinos Dovrolis, Univ-Delaware Sigcomm 02.

Advertisements

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers Tsung-i (Mark) Huang Jaspal Subhlok University of Houston GAN ’ 05 / May 10, 2005.

End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput Manish Jain Constantinos Dovrolis SIGCOMM 2002 Presented.

Moving Averages Ft(1) is average of last m observations

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

Available bandwidth measurement as simple as running wget D. Antoniades, M. Athanatos, A. Papadogiannakis, P. Markatos Institute of Computer Science (ICS),

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

ANOMALY DETECTION AND CHARACTERIZATION: LEARNING AND EXPERIANCE YAN CHEN – MATT MODAFF – AARON BEACH.

Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.

Radial Basis Function Networks

A Signal Analysis of Network Traffic Anomalies Paul Barford with Jeffery Kline, David Plonka, Amos Ron University of Wisconsin – Madison Summer, 2002.

Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.

Network Planète Chadi Barakat

KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.

EGEE is a project funded by the European Union under contract IST Bandwidth Measurements Loukik Kudarimoti Network Engineer, DANTE JRA4 Meeting,

Comparison of Public End-to-End Bandwidth Estimation tools on High-Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

Comparison of Public End-to-End Bandwidth Estimation tools on High- Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

Queueing and Active Queue Management Aditya Akella 02/26/2007.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

DEPARTMENT OF MECHANICAL ENGINEERING VII-SEMESTER PRODUCTION TECHNOLOGY-II 1 CHAPTER NO.4 FORECASTING.

© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Congestion Control 0.

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.

1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.

1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,

Soft Timers : Efficient Microsecond Software Timer Support for Network Processing - Mohit Aron & Peter Druschel CS533 Winter 2007.

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan

OPERATING SYSTEMS CS 3502 Fall 2017

Corelite Architecture: Achieving Rated Weight Fairness

Deep Feedforward Networks

Monitoring Persistently Congested Internet Links

Topics discussed in this section:

Jian Wu (University of Michigan)

Les Cottrell & Yee-Ting Li, SLAC

The CALgorithm for Detecting Bandwidth Changes

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Chapter 6 Congestion Avoidance

R. Hughes-Jones Manchester

CIS, University of Delaware

BOF Discussion: Uploading IEPM-BW data to MonALISA

TCP-LP Distributed Algorithm for Low-Priority Data Transfer

Open Issues in Router Buffer Sizing

Using Netflow data for forecasting

Connie Logg, Joint Techs Workshop February 4-9, 2006

Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the

Wide Area Networking at SLAC, Feb ‘03

Amogh Dhamdhere, Hao Jiang and Constantinos Dovrolis

End-to-end Anomalous Event Detection in Production Networks

Connie Logg February 13 and 17, 2005

My Experiences, results and remarks to TCP BW and CT Measurements Tools Jiří Navrátil SLAC.

End-to-end Anomalous Event Detection in Production Networks

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

SLAC monitoring Web Services

Advanced Networking Collaborations at SLAC

TCP Congestion Control

MAGGIE NIIT- SLAC On Going Projects

The CALgorithm for Detecting Bandwidth Changes

By Manish Jain and Constantinos Dovrolis 2003

Presentation transcript:

Forecasting Network Performance Les Cottrell, Grid Performance Workshop, Edinburgh, June 22-34, 2005 http://www.slac.stanford.edu/grp/scs/net/talk05/predict-edinburgh05.ppt Forecasting Network Performance Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward. Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)

Outline Why do we want forecasting & anomaly detection? What are we using for the input data And what are the problems How do we make forecasts, detect anomaly? First approaches The real world Results Conclusions & Futures Possible uses

Uses of Techniques Automated problem identification: Alerts for network administrators, e.g. Bandwidth changes in time-series, iperf, SNMP Alerts for systems people OS/Host metrics Anomalies for security Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement

Data

Using Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements Ping (RTT, connectivity), traceroute pathchirp, ABwE (packet pair dispersion) iperf (single & multi-stream), thrulay, Bbftp (file transfer application) Looking at GridFTP but complex requiring renewing certificates Lots of analysis and visualization Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

ABwE/abing Uses packet pair dispersion of 20 packets to provide: Capacity, X-traffic, available bandwidth At 3 minute intervals Very noisy time series data Moving averaged over 1 hour Capacity

Pathchirp/Rice/INCITE From PAM paper, pathchirp more accurate but Ten times as long (10s vs 1s) More network traffic (~factor of 10) Pathload factor of 10 again more IEPM-BW now supports both

BUT… Packet pair dispersion relies on accurate timing of inter packet separation At > 1Gbps this is getting beyond resolution of Unix clocks AND 10GE NICs are offloading function Coalescing interrupts, Large Send & Receive Offload, TOE Need to work with TOE vendors Turn off offload Do timing in NICs

Iperf vs thrulay Iperf has multi streams Maximum RTT Iperf has multi streams Thrulay more manageable & gives RTT They agree well Throughput ~ 1/avg(RTT) Average RTT RTT ms Minimum RTT Achievable throughput Mbits/s

BUT… At 10Gbits/s on transatlantic path Slow start takes over 6 seconds To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance)

Passive Use Netflow records at border Per flow provide start/stop time, bytes/packets etc. Collect records for several weeks Divide by remote site, add parallel streams Fold data onto one week, see bands at known capacities

Netflow 2/2 Use existing traffic, no extra traffic Works on fast networks

Forecasting and Anomaly detection

Anomaly Detection Anomaly is when the actual value significantly differs from the expected value So need forecasts to find anomalies Focus has been on ABwE time-series measurements: Packet pair dispersion on 20 packets Send 20 packet pairs back to back and measure one-way packet separation at remote end Minimum gives an indication of bottleneck capacity of link Measurement each 3 minutes Low network impact BUT very noisy so hard test case

Plateau, most intuitive Each observation: If outside history buffer mean mh ± b*sh then add to trigger buffer Else add to history, and remove oldest from trigger buffer When trigger buffer > t points then trigger issued Check if (mh - mt) / mh > D & 90% trigger in last T mins then have trigger Move trigger buffer to history buffer Observations Event * = history length = 1 day, t = trigger length = 3 hours = standard deviations = 2 We set the history buffer length to one day in order to minimize the lag between the history mean and the observations due to diurnal changes. Trigger % full History mean History mean – 2 * stdev

K-S For each observation: for the previous 100 observations with next 100 observations Compare the vertical difference in CDFs How does it differ from random CDFs Expressed as % difference The trigger buffer reporting the event well after start of step down is partially an artifact. It could for example report the time of the start of the event as say when the trigger buffer reached 10% full. However, K-S is still more accurate in defining the time when the change was greatest. Compare K-S with Plateau

Compare Results between K-S & plateau very similar, using K-S coefficient threshold = 70% Current plateau only finds negative changes Useful to see when condition returns to normal K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters K-S more formalized Plateau and K-S work well for non seasonal observations (e.g. small changes day/night) Plateau takes about 14 mins for 100 days H-W FNAL takes 7 mins K-S takes 1.08 min on +- 100 points 3.07 min on +- 200 points 14.75 min on +- 400 points

Seasons & false alerts Congestion on Monday following a quiet weekend causes a high forecast, gives an alert Also a history buffer of not a day causes History mean to be out of sync with observations

Diurnal Variation People arriving at work between 19:00 & 22:00 PDT (7:00 & 10:00 PK time) cause sudden drop in dynamic capacity

Effect on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

Seasonal Changes Use Holt-Winters (H-W) technique: Uses triple exponential weighted moving average EWMA(i) = Obs(i) * a + (1-a) * EWMA(i-1) Three terms each with its own parameter (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends The trend component for our data is flat.

H-W Implementation Need regularly spaced data (else going back one season is difficult, and gets out of sync): Interpolate data: select bin size Average points in bin If no points in first week bin then get data from future weeks For following weeks, missing data bins filled from previous week Initial values for smoothing from NIST “Engineering Statistics Handbook” Choose parms by minimizing (1/N)Σ(Ft-yt)2 Ft=forecast for time t as function of parameters, yt = observation at time t A week is special and defines a cycle of seasons. We do nothing special with the day. Note we need a weeks worth of data to get going

H-W Implementation Three implementations evaluated (two new) FNAL (Maxim Grigoriev) Inspiration for evaluating this method Part of RRD (Brutlag) Limited control over what it produces and how it works SLAC Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms

Results

Example Local smoothing 99% weight for last 24 hours Linear trend 50% last 24 hours Seasonal mainly from last week, but includes several weeks Within an 80 minute window, 80% points outside deviation envelope ≡ event 1 hr avg Observations Deviations are smoothed absolute residuals Note the difference in weekend vs weekday Deviations Forecast Weekend Weekdays

Evaluation Created a library of time series for 100 days from June through Sep 2004 for 40 hosts Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) 23 hosts had 120 candidate events Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds Classify ~120 events as to whether interesting Large, sharp drop in bandwidth, persist for >> 3hrs Plateau easiest to understand and tune etc. also first to be developed. Classification is subjective, large (mh-mt)/mh> 10%, also looked at 30%, step occurs in <4 hours

Results K-S shows similar results to Plateau As adjust parameters to reduce false positives then increase missed events E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(mh-mt)/mh Plateau (b=2) K-S with ± 100 observations D False Miss 10% 16% 8% 30% 2% 32%

Conclusions A few paths (10%) have strong seasonal effects Plateau & K-S work well if only weak seasonal effects K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) H-W promising for seasonal effects, but Is more complex, and requires more parameters which may not be easy to estimate Requires regular data (interpolation step) CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 H-W works, still need to quantify its effectiveness Looking at PCA to evaluate multiple metrics simultaneously (e.g. fwd & bwd traffic, RTT, multiple paths) AND multiple paths

Future Work Future Development in PCA Enable looking at multiple measurements simultaneously E.g. RTT, loss, capacity …; multiple routes Neural networks to interpolate heavyweight/infrequent measurements based on light weight more frequent Continue Netflow passive exploration

Some Uses: Detect anomalies reliably (few false positives, few misses): Make extra measurements related to anomaly, e.g. ping, traceroute, performance history etc. Notify people (e.g. via email) Forecast into future taking account diurnal changes: Make long-term (hours – days) integrated estimates of performance with probabilities Use for data location selection

Apply forecasts to Router utilizations to find bottlenecks Get measurements from Internet2/ESnet/Geant SONAR project via NMWG web services Save as time series, forecast for each interface For given path and duration forecast most probable bottlenecks Use MPLS to apply QoS at bottlenecks (rather than for the entire path) for selected applications

More information SLAC Plateau implementation www.acm.org/sigs/sigcomm/sigcomm2004/workshop_papers/nts26-logg1.pdf SLAC H-W implementation www-iepm.slac.stanford.edu/monitoring/forecast/hw.html Eng. Statistics Handbook http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm IEPM-BW Measurement Infrastructure http://www-iepm.slac.stanford.edu/

Events Can look at residuals (Ft – yt), or Χ2 Could use K-S or plateau on: residuals, or on the local smoothing (i.e. after removing long term seasonal effects)

Mark Burgess Method A two dimensional time-series approach in order to classify a periodic, adaptive threshold for service level anomaly detection An iterative algorithm is applied to history analysis on this periodic time to provide a smooth roll-off in the significance of the data with time. This method was originally designed to detect anomalous behavior on a single host.

Compare with KS Mark Burgess technique detects the anomalies for Iperf from SLAC to Caltech – Feb & Mar 05 KS-Result KS Technique works Very well for the long Term anomalous Variations in internet End-to-end traffic. Mark Burgess technique detects the anomalies for each and every Unwanted huge spikes/variation (Real Time) Mark Burgess Tech-Result

PCA PCA is a coordinate transformation method that maps a given set of data points onto new axes. These axes are called the principal axes or principal components. For network anomaly detection PCA divides the data into normal & abnormal subspace Procedure Arrangement of data into matrix form Zero meaning the matrix data Calculating the covariance matrix Calculate principal components Application of the formulae (I-PPT)(data-matrix) yields the result. P is the matrix of Principal Components.

PCA Results on SLAC-BINP (June-Sep, 2004) Due to 10% rise in dbcap Anomalous Good Events 10% rise in RTT Caught all the events that were detected by HW, Plateau and KS Can work on multiple parameters Tested PCA on six routes so far SLAC-FZK, SLAC-DESY, SLAC-CALTECH, SLAC-NIIT, SLAC-BINP, SLAC-UMICH