1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

Slides:

Advertisements

Similar presentations

pathChirp Efficient Available Bandwidth Estimation

Advertisements

QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.

Path Optimization in Computer Networks Roman Ciloci.

End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput Manish Jain Constantinos Dovrolis SIGCOMM 2002 Presented.

Measurements of Congestion Responsiveness of Windows Streaming Media (WSM) Presented By:- Ashish Gupta.

1 Estimating Shared Congestion Among Internet Paths Weidong Cui, Sridhar Machiraju Randy H. Katz, Ion Stoica Electrical Engineering and Computer Science.

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.

1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.

PingER Management1 Error Reporting Model for Ping End-to-End Reporting (PingER Management)

1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

Available bandwidth measurement as simple as running wget D. Antoniades, M. Athanatos, A. Papadogiannakis, P. Markatos Institute of Computer Science (ICS),

T T18-05 Trend Adjusted Exponential Smoothing Forecast Purpose Allows the analyst to create and analyze the "Trend Adjusted Exponential Smoothing"

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.

Inline Path Characteristic Estimation to Improve TCP Performance in High Bandwidth-Delay Networks HIDEyuki Shimonishi Takayuki Hama Tutomu Murase Cesar.

Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.

A Step Towards Automated Event Diagnosis Stanford Linear Accelerator Center Adnan Iqbal, Yee-Ting Li, Les Cottrell Connie A. Log. Williams Jerrod.

Bandwidth Estimation: Metrics Mesurement Techniques and Tools By Ravi Prasad, Constantinos Dovrolis, Margaret Murray and Kc Claffy IEEE Network, Nov/Dec.

A Signal Analysis of Network Traffic Anomalies Paul Barford with Jeffery Kline, David Plonka, Amos Ron University of Wisconsin – Madison Summer, 2002.

Reading Report 14 Yin Chen 14 Apr 2004 Reference: Internet Service Performance: Data Analysis and Visualization, Cross-Industry Working Team, July, 2000.

KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

Alok Shriram and Jasleen Kaur Presented by Moonyoung Chung Empirical Evaluation of Techniques for Measuring Available Bandwidth.

POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 2. Network Monitoring Metrics.

LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

EGEE is a project funded by the European Union under contract IST Bandwidth Measurements Loukik Kudarimoti Network Engineer, DANTE JRA4 Meeting,

Comparison of Public End-to-End Bandwidth Estimation tools on High-Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

Comparison of Public End-to-End Bandwidth Estimation tools on High- Speed Links Alok Shriram, Margaret Murray, Young Hyun, Nevil Brownlee, Andre Broido,

11 Experimental and Analytical Evaluation of Available Bandwidth Estimation Tools Cesar D. Guerrero and Miguel A. Labrador Department of Computer Science.

DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.

Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

Network Measurement Tools ESnet Site Coordinators Meeting 26 April 2000 Tracie Monk, UCSD/SDSC/CAIDA -

1 Measurements of Internet performance for NIIT, Pakistan Jan – Feb 2004 PingER From Les Cottrell, SLAC For presentation by Prof. Arshad Ali, NIIT.

1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan

1 Capacity Dimensioning Based on Traffic Measurement in the Internet Kazumine Osaka University Shingo Ata (Osaka City Univ.)

V Bandi and R Lahdelma 1 Forecasting. V Bandi and R Lahdelma 2 Forecasting? Decision-making deals with future problems -Thus data describing future must.

Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

PathChirp & STAB Measuring Available Bandwidth and Locating Bottlenecks in Packet Networks Vinay Ribeiro Rolf Riedi, Richard Baraniuk Rice University spin.rice.edu.

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.

1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan

Bandwidth Estimation of a Network Path ET-4285 Measuring & Simulating the internet Bandwidth Estimation of a Network Path Group 4: S. Ngabonziza Rugemintwaza.

Monitoring Persistently Congested Internet Links

BOF Discussion: Uploading IEPM-BW data to MonALISA

Using Netflow data for forecasting

Connie Logg, Joint Techs Workshop February 4-9, 2006

Wide Area Networking at SLAC, Feb ‘03

End-to-end Anomalous Event Detection in Production Networks

Connie Logg February 13 and 17, 2005

My Experiences, results and remarks to TCP BW and CT Measurements Tools Jiří Navrátil SLAC.

End-to-end Anomalous Event Detection in Production Networks

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

Forecasting Network Performance

SLAC monitoring Web Services

MAGGIE NIIT- SLAC On Going Projects

The CALgorithm for Detecting Bandwidth Changes

By Manish Jain and Constantinos Dovrolis 2003

pathChirp Efficient Available Bandwidth Estimation

pathChirp Efficient Available Bandwidth Estimation

Presentation transcript:

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network Operations & Management Symposium Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)

2 Outline Why do we want forecasting & anomaly detection? What are we using for the input data –And what are the problems How do we make forecasts, detect anomaly? –First approaches –The real world Results Conclusions & Futures Possible uses

3 Uses of Techniques Automated problem identification: –Admins cannot review 100’s of graphs each day –Alerts for network administrators, e.g. Bandwidth changes in time-series, iperf, SNMP –Alerts for systems people OS/Host metrics –Anomalies for security Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement

4 Data

5 Measurement Topology 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites

6 Using Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements –Ping (RTT, connectivity), traceroute –pathchirp, pathload, ABwE (packet pair dispersion) –iperf (single & multi-stream), thrulay, Lots of analysis and visualization Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites – bw.slac.stanford.edu/slac_wan_bw_tests.htmlhttp:// bw.slac.stanford.edu/slac_wan_bw_tests.html

7 abing Uses packet pair dispersion of 20 packets to provide: –Capacity, X-traffic, available bandwidth –At 3 minute intervals –Very noisy time series data Capacity Moving averaged over 1 hour Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links Started with this: Knew developer Simple method

8 Available bandwidth Accuracy: –From PAM paper, pathload most accurate, followed by pathchirp, abing has problems at lower speeds Overhead: –Pathload has 100 times network overhead of pathchirp and abing Measurement duration: –Abing take < 1sec for measurement, –Pathchirp takes ~ 10 secs –pathload takes tens of seconds depends on RTT and can timeout Consistency of results: –Abing very noisy, –Pathchirp in between –Pathload smoother, multi-modal Pathload Pathchirp SLAC-Caltech March ‘06

9 Iperf vs thrulay RTT ms Achievable throughput Mbits/s Minimum RTT Maximum RTT Average RTT Give TCP achievable throughput Thrulay more manageable & gives RTT They agree well Throughput ~ 1/avg(RTT) For big RTT need multi- streams Thrulay

10 Forecasting and Anomaly detection

11 Anomaly Detection Anomaly is when the actual value significantly differs from the expected value –So need forecasts to find anomalies –Focus was initially on abing time-series measurements: Measurement each 3 minutes Low network impact BUT very noisy so hard test case

12 Plateau, most intuitive Each observation: –If outside history buffer mean m h ±  s h then add to trigger buffer –Else add to history, and remove oldest from trigger buffer When trigger buffer >  points then trigger issued –Check if (m h - m t ) / m h > D  90% trigger in last T mins then have trigger –Move trigger buffer to history buffer Observations History mean Trigger % full History mean – 2 * stdev * Event = history length = 1 day,  = trigger length = 3 hours  = standard deviations = 2

13 K-S For each observation: for the previous 100 observations with next 100 observations –Compare the vertical difference in CDFs –How does it differ from random CDFs –Expressed as % difference Compare K-S with Plateau

14 Compare Results between K-S & plateau very similar, using K-S coefficient threshold = 70% Current plateau only finds negative changes –Useful to see when condition returns to normal K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters K-S more formalized Plateau and K-S work well for non seasonal observations (e.g. small changes day/night)

15 Seasons & false alerts Congestion on Monday following a quiet weekend causes a high forecast, gives an alert Also a history buffer of not a day causes History mean to be out of sync with observations

16 Effect on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

17 Seasonal Changes Use Holt-Winters (H-W) technique: –Uses triple exponential weighted moving average –Three terms each with its own parameter (  ) that take into account local smoothing, long term seasonal smoothing, and trends

18 H-W Implementation Need regularly spaced data (else going back one season is difficult, and gets out of sync): –Interpolate data: select bin size Average points in bin If no points in first week bin then get data from future weeks For following weeks, missing data bins filled from previous week Initial values for smoothing from NIST “Engineering Statistics Handbook” Choose parms by minimizing (1/N)Σ(F t -y t ) 2 –F t =forecast for time t as function of parameters, y t = observation at time t

19 H-W Implementation Three implementations evaluated (two new) –FNAL (Maxim Grigoriev) Inspiration for evaluating this method –Part of RRD (Brutlag) Limited control over what it produces and how it works –SLAC Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms

20 Results

21 Example Local smoothing 99% weight for last 24 hours Linear trend 50% last 24 hours Seasonal mainly from last week, but includes several weeks Within an 80 minute window, 80% points outside deviation envelope ≡ event Deviations Observations Forecast Weekend Weekdays 1 hr avg

22 Evaluation Created a library of time series for 100 days from June through Sep 2004 for 40 hosts Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) –23 hosts had 120 candidate events –Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds Classify ~120 events as to whether interesting –Large, sharp drop in bandwidth, persist for >> 3hrs

23 Results K-S shows similar results to Plateau As adjust parameters to reduce false positives then increase missed events –E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(m h - m t )/m h DFalseMiss 10%16%8% 30%2%32% K-S with ± 100 observations Plateau (  =2) We are generating s from events & gathering extra diagnostics, send as to net admins.

24 Conclusions A few paths (10%) have strong seasonal effects Plateau & K-S work well if only weak seasonal effects –K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) H-W promising for seasonal effects, but –Is more complex, and requires more parameters which may not be easy to estimate –Requires regular data (interpolation step) –Can use to remove seasonal effects and then apply Plateau CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 H-W works, still need to quantify its effectiveness

25 Current & Future Work Different objective function to minimize for HW parms Effect of other metrics: –frequency of measurements, speed of detection –noisiness (min-RTT and pathload smoother) Future Development in PCA –Enable looking at multiple measurements simultaneously E.g. RTT, loss, capacity …; multiple routes Neural networks, wavelets, ARIMA … Interpolate heavyweight/infrequent measurements based on light weight more frequent Netflow passive exploration Manually study events, leading to development of automated diagnosis of events: traceroutes, router utilizations, host measurements

26 More information SLAC Plateau implementation – s/nts26-logg1.pdfwww.acm.org/sigs/sigcomm/sigcomm2004/workshop_paper s/nts26-logg1.pdf SLAC H-W implementation –www-iepm.slac.stanford.edu/monitoring/forecast/hw.htmlwww-iepm.slac.stanford.edu/monitoring/forecast/hw.html Eng. Statistics Handbook – IEPM-BW Measurement Infrastructure –