Experiences in Traceroute and Available Bandwidth Change Analysis

Slides:



Advertisements
Similar presentations
Traffic Analyst Complete Network Visibility. © 2013 Impact Technologies Inc., All Rights ReservedSlide 2 Capacity Calibration Definitive Requirements.
Advertisements

Path Optimization in Computer Networks Roman Ciloci.
1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
INCITE – Edge-based Traffic Processing for High-Performance Networks R. Baraniuk, E. Knightly, R. Nowak, R. Riedi Rice University L. Cottrell, J. Navratil,
Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.
Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.
What we have learned from developing and running ABwE Jiri Navratil, Les R.Cottrell (SLAC)
PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.
Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.
POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.
LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.
8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.
Integration of AMP & Tracenol By: Qasim Bilal Lone.
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.
1 IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep ‘04
Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.
1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.
BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.
INFSO-RI Enabling Grids for E-sciencE Diagnostic Tool Brainstorming Ratnadeep Abrol EGEE JRA4 F2F, DANTE, Cambridge 9 th May 2005.
Toward a Measurement Infrastructure. Warren Matthews (SLAC) Presented at the e2e Workshop Miami, FL, February 2003.
Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.
June 11, 2002 Abilene Route Quality Control Initiative Aaron D. Britt Guy Almes Route Optimization.
IEPM-BW Deployment Experiences
The CALgorithm for Detecting Bandwidth Changes
Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers
Networking for the Future of Science
R. Hughes-Jones Manchester
IEPM-BW Deployment Experiences
BOF Discussion: Uploading IEPM-BW data to MonALISA
Byungchul Park ICMP & ICMPv DPNM Lab. Byungchul Park
Milestones/Dates/Status Impact and Connections
What’s “Inside” a Router?
Using Netflow data for forecasting
Connie Logg, Joint Techs Workshop February 4-9, 2006
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
Wide Area Networking at SLAC, Feb ‘03
ABwE: Available Bandwidth Estimator Jiri Navratil R. Les
Internet Control Message Protocol Version 4 (ICMPv4)
Connie Logg February 13 and 17, 2005
End-to-end Anomalous Event Detection in Production Networks
Pong: Diagnosing Spatio-Temporal Internet Congestion Properties
Navigating PingER Les Cottrell – SLAC
File Transfer Issues with TCP Acceleration with FileCatalyst
Experiences in Traceroute and Available Bandwidth Change Analysis
Network Performance Measurement
Internet Control Message Protocol
Modeling and Taming Parallel TCP on the Wide Area Network
Performance Evaluation of Computer Networks
Internet Control Message Protocol
SLAC monitoring Web Services
IEPM. Warren Matthews (SLAC)
Wide-Area Networking at SLAC
Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.
Performance Evaluation of Computer Networks
Data sharing practices in the region
MAGGIE NIIT- SLAC On Going Projects
PIPE Dreams Trouble Shooting Network Performance for Production Science Data Grids Presented by Warren Matthews at CHEP’03, San Diego March 24-28, 2003.
The CALgorithm for Detecting Bandwidth Changes
An Empirical Evaluation of Wide-Area Internet Bottlenecks
pathChirp Efficient Available Bandwidth Estimation
pathChirp Efficient Available Bandwidth Estimation
Summer 2002 at SLAC Ajay Tirumala.
TCP/IP Protocol Suite 1 Chapter 9 Upon completion you will be able to: Internet Control Message Protocol Be familiar with the ICMP message format Know.
Presentation transcript:

Experiences in Traceroute and Available Bandwidth Change Analysis Connie Logg, Les Cottrell & Jiri Navratil SIGCOMM’04 Workshops September 3, 2004 Modern data intensive science such as HENP requires the ability to copy large amounts of data between collaborating sites. This in turn requires high-performance reliable end-to-end network paths and the ability to take advantage of them. End-users thus need both long-term and near real-time estimates of the network and application performance of such paths for planning, setting expectations, and trouble-shooting. The IEPM-BW (Internet End-to-end Performance Monitoring - BandWidth) project was instigated in 2001 to meet the above needs for the BaBar HENP community. This produced a toolkit for monitoring Round Trip Times (RTT), TCP throughput (iperf), file copy throughput (bbftp, bbcp and GridFTP), traceroute and more recently lightweight cross-traffic and available bandwidth measurements (ABwE). Since then it has been extended to LHC, CDF, D0, ESnet, Grid, and high performance network Research & Education sites, about 60-70 paths are now being monitored (including about 50 remote sites) and the monitoring toolkit has been installed at ten sites and is in production at three or four sites, in particular FNAL (for CMS, CDF and D0) and SLAC (for BaBar and PPDG). Each monitoring site is relatively independent and the monitoring is designed to map to the design of modern HENP tiering of sites, i.e. it is hierarchical rather than full mesh. The monitoring toolkit is installed at the site and its contact chooses the remote hosts it wishes to monitor. Current work is in progress to analyze and visualize the traceroute meaurements and to automatically detect anomalous step down changes in bandwidth. Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

Motivation High Energy Nuclear Physics (HENP) analysis requires the effective distribution of large amounts of data between collaborators world wide. In 2001 we started development on a project (IEPM-BW) for Internet End-to-end Performance Monitoring of BandWidth) for the network paths to our collaborators It has evolved over time from network intensive measurements (run about every 90 minutes) to light weight non-intensive measurements (run about every 3-5 minutes)

IEPM-BW Version 1 IEPM-BW version 1 (September 2001) performed sequential heavy weight (iperf and file transfer tests) and light weight (ping & traceroute) measurements to a few dozen collaborator nodes – it could only be run a few times a day Concurrently an Available BandWidth Estimation (ABWE) tool was being developed by Jiri to perform light weight available bandwidth (ABW) and link dynamic bottleneck capacity estimates (DBCAP)

IEPM-BW Version 2 IEPM-BW version 2 incorporated ABWE as a probing tool, and extensive comparisons were made between the heavy weight iperf and file transfer tests and the ABWE results ABWE results tracked well with iperf and file transfer tests in many cases

Forward Routing changes Reverse Routing changes Examples New CENIC path 1000 Mbits/s Forward Routing changes ABWE Iperf back to new CENIC path Bbftp Iperf 1 stream Drop to 622 Mbits/s path Reverse Routing changes 28 days bandwidth history. During this time we can see 3 different situations caused by different routing from SLAC to CALTECH Scatter plot graphs of Iperf versus ABW on different paths (range 20–800 Mbits/s) showing agreement of two methods (28 days history)

Challenges The monitoring was very useful to us but: Too many graphs and reports to examine manually every day We could only run it a few times a day We needed to automate what the brain does – pick out changes Changes of concern: route and bandwidth

Traceroute Analysis Need a way to visualize traceroutes taken at regular intervals to several tens of remote hosts Report the pathologies identified Allow quick visual inspection for: Multiple routes changes Significant route changes Pathologies Drill down to more detailed information Histories Topologies Related bandwidth & alerts

Display Many Routes on Single Page One page per day One row per host, one column per hour Identify unique routes with a number Be able to inspect the route associated with a route number Provide for analysis of long term route evolutions Use single character to ID a route that has not significantly changed Character identifies pathology of route (usually period(.) = no change) Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

Pathology Encodings Hop does not respond (*) End host does not respond, i.e. 30 hops (|) Stutters (“) Hop change only affects 4th octet (: ) Hop change but address in same AS (a) ICMP checksum (orange) ! Annotation e.g. network unreachable, admin blocked Multi-homed host Probe type: UDP or ICMP There are several pathologies associated with traceroutes. We needed to find a way to encode them.

Pathology Encodings Change but same AS No change Probe type Change in only 4th octet End host not pingable Hop does not respond Stutter Multihomed ICMP checksum ! Annotation (!X)

Navigation traceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms … 13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X

History Channel

AS’ information

Esnet-LosNettos segment in the path Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Data Display Many different ways to look at traceroute data Output from traceroute command Tabular format: facilitates comparisions #date # time #hops epoch rtno node route 08/31/2004 11:13:19 14 1093975999 3 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.58,62.40.103.213,...,195.113.xxx.xxx 08/31/2004 11:23:37 14 1093976617 2 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.200,62.40.103.214,...,195.113.xxx.xxx 08/31/2004 11:33:38 14 1093977218 3 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.58,62.40.103.213,...,195.113.xxx.xxx Topology maps

Data Display Historical list of routes (route number, first seen date, last seen date, hops) #rt# firstseen lastseen route 0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx 2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx 5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx 7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx 8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

Summary One page per day to eyeball for route changes Links provided for ease of further examination Do not alert on traceroute changes yet, but is integrated with Bandwidth Change Analysis

Bandwidth Change Analysis Purpose is to generate “alerts” about “major” drops in available bandwidth and/or link capacity Simplistically: Data is buffered into History & Trigger Examine time spacing of the data – calculate size of History and Trigger buffers – we have chosen “History” buffer (about 24 hours) “Trigger” buffer (about 3 hours) Pick a threshhold of change to alert on (~40%) Start with the oldest data Load about 3 hours data into the history buffer Calculate the mean and standard deviation (histmean and histsd)

Methodology Start examining the data in order of oldest to newest. If value > histmean -2*histsd: Put it in the history buffer Remove oldest value from the trigger buffer Recalculate the histmean and histsd Else put it in the trigger buffer Trigger buffer is not full Trigger buffer is full

Trigger Buffer is Full Calculate the trigger buffer mean (trigmean) and standard deviation (trigsd) If (histmean – trigmean)/histmean > threshhold 1st level alert Load trigger buffer data into History buffer, clear trigger buffer, and continue processing the data NOTE: Process actually has various filtering conditions in it. This is not the end: only identifies first level trigger conditions

Examples

Examples

Challenges Diurnal Variations Capacity Available bandwidth RTT X traffic

Challenges Unusual variations Trigger buffer length 10 points In this case, we wanted to know that this was happening Trigger buffer length 30 points

Considerations From the performance monitoring perspective of managing production networks, we are primarily concerned about pathologies that interfere with the production process We are not really interested in the minor ebb and flow of network traffic We do want to be alerted to pathologies which may be affecting the production process

Challenges There are many algorithms which may be useful in analyzing data for various types of variations with which we may not be concerned Developing code for this is challenging and complex, but can be done. Problem: CPU power and “elapsed” time to analyze monitoring data with all these analysis tools is impractical – most of us cannot afford supercomputers or farms to do it Analysis and identification of “events” must be timely

Solutions Quick first level trigger analysis which can be done frequently to check for “events” Provide web page for looking at general health and first level trigger occurrences Can also invoke immediate but synchronized more extensive tests to verify drops Input event data (and longer term data) into more sophisticated analysis to filter for serious “alert”s Save event signatures for future reference

IEPM-BW Future IEPM-BW Version 3 is being architected to facilitate frequent light available bandwidth and link capacity measurements Will use SQL database to manage the probe, monitoring host, and target host specifications as well as the probe data and analysis results Frequent lightweight first trigger level change analysis

Long Term Facility for scheduling on demand and automatic heavyweight bandwidth tests in response to triggers Automatically feed results into more complex analysis code to filter only for “real” alerts Distributed Monitoring

References ABWE: A Practical Approach to Available Bandwidth Estimation, Jiri Navratil and Les CottrellAutomated Event Detection for Active Measurement Systems, A. J. McGregor and H-W. Braun, Passive and Active Measurements 2001. Overview of IAEPM-BW Bandwidth Testing of Bulk Data Transfer, Les Cottrell and Connie Logg Experiences and Results from a New High Performance Network and Application Monitoring Toolkit, Les Cottrell, Connie Logg, and I-Heng Mei Correlating Internet Performance Changes and Route Changes to assist in Trouble-Shooting from an End-User Perspective, Connie Logg, Jiri Navratil, and Les Cottrell Miscellaneous, SLAC

Future - continues Further develop Bandwidth Change Analysis Now have 1st level trigger mechanism Develop more extensive analysis to analyze identified events Develop algorithms to automatically conduct other tests integrate those results into further trigger analysis

IEPM-BW Version 3 Architecture changes SQL data base for host specifications, probe tool specifications, probe scheduling mechanism, analysis results, knowledge base Scheduling mechanisms for lightweight vs heavyweight probes Distributed monitoring and remote data retrieval for “grid” analysis Change analysis (route and bandwidth) and alerts

… and Topology Choose times and hosts and submit request Hour of day SLAC ESnet Alternate rt Alternate route GEANT JAnet Nodes colored by ISP Mouseover shows node names Click on node to see subroutes Click on end node to see its path back Also can get raw traceroutes with AS’ CESnet IN2P3 DL CLRC

In Progress Code is being rewritten to: Allow for standalone use Integrate into IEPM-BW version 3 Integrate with Bandwidth Change Analysis

Bandwidth Change Analysis Available BandWidth Estimation (ABWE), developed by Jiri Navratil is used to perform frequent probes for link capacity, available bandwidth and cross traffic load During its development we noticed that ABWE results and iperf tracked very closely. Iperf is network intensive and it is practical to only do a few measurements day ABWE is very light weight and can do measurements about every 3 minutes for about 60 nodes Took a look at using ABWE measurements for monitoring and alerting on bandwidth changes

Futures IEPM-BW Version 3 - Technology changing rapidly – Asynchronous frequent lightweight probes Synchronous less frequent heavy weight probes which can be used to check out changes indicated by lightweight probes Technology changing rapidly – New routing, buffering, hardware, software, protocols etc. will require new probe techniques Provide a framework within which to evaluate new probe techniques

Futures IEPM-BW will continue to be useful for developing countries

More information Where to get it: Topology: IEPM-BW home page Example: http://www.slac.stanford.edu/comp/net/bandwidth-tests/hercules/tracesummaries/today.html Where to get it: Topology: http://pcgiga.cern.ch:8080/cgi-bin/pnets.pl IEPM-BW home page http://www-iepm.slac.stanford.edu/bw/ ABwE lightweight bandwidth estimation http://www-iepm.slac.stanford.edu/abing/

Forward Routing changes Reverse Routing changes New CENIC path 1000 Mbits/s Forward Routing changes AbWE Iperf back to new CENIC path Bbftp Iperf 1 stream Drop to 100 Mbits/s by Routing (BGP) errors RTT Drop to 622 Mbits/s path Reverse Routing changes 28 days bandwidth history. During this time we can see several different situations caused by different routing from SLAC to CALTECH ABwE also works well on DSL and wireless networkss. Scatter plot graphs of Iperf versus ABw on different paths (range 20–800 Mbits/s) showing agreement of two methods (28 days history)