Experiences in Traceroute and Available Bandwidth Change Analysis

Slides:

Advertisements

Similar presentations

21.1 Chapter 21 Network Layer: Address Mapping, Error Reporting, and Multicasting Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction.

Advertisements

QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.

Path Optimization in Computer Networks Roman Ciloci.

1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.

1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

INCITE – Edge-based Traffic Processing for High-Performance Networks R. Baraniuk, E. Knightly, R. Nowak, R. Riedi Rice University L. Cottrell, J. Navratil,

Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.

What we have learned from developing and running ABwE Jiri Navratil, Les R.Cottrell (SLAC)

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.

POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.

LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.

Integration of AMP & Tracenol By: Qasim Bilal Lone.

DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.

1 ESnet/HENP Active Internet End-to-end Performance & ESnet/University performance Les Cottrell – SLAC Presented at the ESSC meeting Albuquerque, August.

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

Iperf Quick Mode Ajay Tirumala & Les Cottrell. Sep 12, 2002 Iperf Quick Mode at LBL – Les Cottrell & Ajay Tirumala Iperf QUICK Mode Problem – Current.

1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.

Jeremy Nowell EPCC, University of Edinburgh A Standards Based Alarms Service for Monitoring Federated Networks.

1 IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep ‘04

Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.

1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.

Toward a Measurement Infrastructure. Warren Matthews (SLAC) Presented at the e2e Workshop Miami, FL, February 2003.

IEPM-BW Deployment Experiences

OptiView™ XG Network Analysis Tablet

The CALgorithm for Detecting Bandwidth Changes

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Networking for the Future of Science

R. Hughes-Jones Manchester

IEPM-BW Deployment Experiences

BOF Discussion: Uploading IEPM-BW data to MonALISA

Ad-hoc Transport Layer Protocol (ATCP)

Byungchul Park ICMP & ICMPv DPNM Lab. Byungchul Park

Milestones/Dates/Status Impact and Connections

What’s “Inside” a Router?

Using Netflow data for forecasting

Connie Logg, Joint Techs Workshop February 4-9, 2006

Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the

Wide Area Networking at SLAC, Feb ‘03

ABwE: Available Bandwidth Estimator Jiri Navratil R. Les

Internet Control Message Protocol Version 4 (ICMPv4)

Connie Logg February 13 and 17, 2005

End-to-end Anomalous Event Detection in Production Networks

Experiences in Traceroute and Available Bandwidth Change Analysis

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

TransCAD Vehicle Routing 2018/11/29.

File Transfer Issues with TCP Acceleration with FileCatalyst

E2E piPES Project Russ Hobby, Internet2 HENP Working Group Meeting

Network Performance Measurement

Internet Control Message Protocol

Internet Control Message Protocol

SLAC monitoring Web Services

IEPM. Warren Matthews (SLAC)

AIMS Equipment & Automation monitoring solution

Wide-Area Networking at SLAC

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

MAGGIE NIIT- SLAC On Going Projects

PIPE Dreams Trouble Shooting Network Performance for Production Science Data Grids Presented by Warren Matthews at CHEP’03, San Diego March 24-28, 2003.

The CALgorithm for Detecting Bandwidth Changes

pathChirp Efficient Available Bandwidth Estimation

pathChirp Efficient Available Bandwidth Estimation

Summer 2002 at SLAC Ajay Tirumala.

TCP/IP Protocol Suite 1 Chapter 9 Upon completion you will be able to: Internet Control Message Protocol Be familiar with the ICMP message format Know.

Presentation transcript:

Experiences in Traceroute and Available Bandwidth Change Analysis Connie Logg, Les Cottrell & Jiri Navratil SIGCOMM’04 Workshops September 3, 2004 http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-10518.pdf Modern data intensive science such as HENP requires the ability to copy large amounts of data between collaborating sites. This in turn requires high-performance reliable end-to-end network paths and the ability to take advantage of them. End-users thus need both long-term and near real-time estimates of the network and application performance of such paths for planning, setting expectations, and trouble-shooting. The IEPM-BW (Internet End-to-end Performance Monitoring - BandWidth) project was instigated in 2001 to meet the above needs for the BaBar HENP community. This produced a toolkit for monitoring Round Trip Times (RTT), TCP throughput (iperf), file copy throughput (bbftp, bbcp and GridFTP), traceroute and more recently lightweight cross-traffic and available bandwidth measurements (ABwE). Since then it has been extended to LHC, CDF, D0, ESnet, Grid, and high performance network Research & Education sites, about 60-70 paths are now being monitored (including about 50 remote sites) and the monitoring toolkit has been installed at ten sites and is in production at three or four sites, in particular FNAL (for CMS, CDF and D0) and SLAC (for BaBar and PPDG). Each monitoring site is relatively independent and the monitoring is designed to map to the design of modern HENP tiering of sites, i.e. it is hierarchical rather than full mesh. The monitoring toolkit is installed at the site and its contact chooses the remote hosts it wishes to monitor. Current work is in progress to analyze and visualize the traceroute meaurements and to automatically detect anomalous step down changes in bandwidth. Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

Motivation High Energy Nuclear Physics (HENP) analysis requires the effective distribution of large amounts of data between collaborators world wide. In 2001 we started development on a project (IEPM-BW) for Internet End-to-end Performance Monitoring of BandWidth) for the network paths to our collaborators It has evolved over time from network intensive measurements (run about every 90 minutes) to light weight non-intensive measurements (run about every 3-5 minutes)

IEPM-BW Version 1 IEPM-BW version 1 (September 2001) performed sequential heavy weight (iperf and file transfer tests) and light weight (ping & traceroute) measurements to a few dozen collaborator nodes – it could only be run a few times a day Concurrently an Available BandWidth Estimation (ABWE) tool was being developed by Jiri to perform light weight available bandwidth (ABW), link capacity (DBCAP), and cross traffic (XTR) estimates

IEPM-BW Version 2 IEPM-BW version 2 incorporated ABWE as a probing tool, and extensive comparisons were made between the heavy weight iperf and file transfer tests and the ABWE results ABWE results tracked well with iperf and file transfer tests in many cases

Forward Routing changes Reverse Routing changes Examples New CENIC path 1000 Mbits/s Forward Routing changes ABWE Iperf back to new CENIC path Bbftp Iperf 1 stream Drop to 622 Mbits/s path Reverse Routing changes 28 days bandwidth history. During this time we can see 3 different situations caused by different routing from SLAC to CALTECH Scatter plot graphs of Iperf versus ABW on different paths (range 20–800 Mbits/s) showing agreement of two methods (28 days history)

Challenges The monitoring was very useful to us but: Too many graphs and reports to examine manually every day We could only run the probes a few times a day We needed to automate what the brain does – pick out changes Changes of concern included: route and bandwidth

Traceroute Analysis Need a way to visualize traceroutes taken at regular intervals to several tens of remote hosts Report the pathologies identified Allow quick visual inspection for: Multiple routes changes Significant route changes Pathologies Drill down to more detailed information Histories Topologies Bandwidth monitoring data

Display Many Routes on Single Page One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (usually period(.) = no change) Identify unique routes with a number Be able to inspect the route associated with a route number Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

Pathology Encodings Probe type: UDP or ICMP Hop does not respond (*) End host does not respond, i.e. 30 hops (|) Stutters (“) – hop replies more than once Hop change only affects 4th octet (: ) Hop change but address in same AS (a) ICMP checksum (orange) ! Annotation e.g. network unreachable, admin blocked Multi-homed host There are several pathologies associated with traceroutes. We needed to find a way to encode them.

Pathology Encodings Change but same AS No change Probe type Change in only 4th octet End host not pingable Hop does not respond Stutter Multihomed ICMP checksum ! Annotation (!X)

Navigation traceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms … 13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X #rt# firstseen lastseen route 0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx 2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx 5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx 7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx 8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

History Channel

AS’ information

Esnet-LosNettos segment in the path Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Data Display Many different and useful ways to look at traceroute data Output from traceroute command Tabular format: facilitates comparisions #date # time #hops epoch rtno node route 08/31/2004 11:13:19 14 1093975999 3 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.58,62.40.103.213,...,195.113.xxx.xxx 08/31/2004 11:23:37 14 1093976617 2 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.200,62.40.103.214,...,195.113.xxx.xxx 08/31/2004 11:33:38 14 1093977218 3 node1.cesnet.cz ...,134.55.209.1,...,134.55.209.58,62.40.103.213,...,195.113.xxx.xxx Topology maps

Data Display Historical list of routes (route number, first seen date, last seen date, hops) #rt# firstseen lastseen route 0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx 2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx 5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx 7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx 8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

Summary We are quite happy with our traceroute analysis One page per day to eyeball for route changes Links provided for ease of further examination Do not alert on traceroute changes, but traceroute information is integrated with Bandwidth Change Analysis IEPM-BW in Action

Bandwidth Change Analysis Purpose is to generate “alerts” about “major” drops in available bandwidth and/or link capacity which may be impacting our physics analysis production work Inspired by “Plateau Algorithm” for analyzing ping data from McGregor and Braun The implementation is described in the paper so I am not going to go over it here

Overview We currently feed a weeks worth of data into it for developmental and verification purposes The actual implementation processes that weeks worth of data in a sliding window of currently about 24 hours A 1st level “alert” is generated when there is a period of depressed bandwidth for about 3 hours. Our threshold of depression is currently 40% There is one web page that displays an analysis overview with links to more details

Gigabit Link Examples

Kilobit Link Example Today’s Page

Challenges Diurnal Variations Capacity Available bandwidth RTT X traffic

Challenges Unusual variations Trigger buffer length 10 points In this case, we wanted to know that this was happening Trigger buffer length 30 points

Considerations From the performance monitoring perspective of managing a production network, we are primarily concerned about pathologies that interfere with the production process We are not really interested in the minor ebb and flow of network traffic We are not interested in monitoring the entire “Grid” We are interested in monitoring what are users are “seeing”

Challenges There are many algorithms which may be useful in analyzing data for various types of variations with which we may not be concerned Developing code for this is challenging and complex, but can be done. Problem: CPU power and “elapsed” time to analyze monitoring data with all these analysis tools is impractical – most of us cannot afford supercomputers or farms to do it Analysis and identification of “events” must be timely

Solutions Quick first level trigger analysis which can be done frequently to check for “events” Provide web page for looking at general health and first level trigger occurrences Can also invoke immediate but synchronized more extensive tests to verify drops Input event data (and longer term data) into more sophisticated analysis to filter for serious “alert”s Save event signatures for future reference

IEPM-BW Future IEPM-BW Version 3 is being architected to facilitate frequent light available bandwidth and link capacity measurements Will use SQL database to manage the probe, monitoring host, and target host specifications as well as the probe data and analysis results Frequent lightweight first trigger level change analysis

Long Term Facility for scheduling on demand and automatic heavyweight bandwidth tests in response to triggers Automatically feed results into more complex analysis code to filter only for “real” alerts Distributed self contained independent monitoring systems Working on a technique to encode bandwidth variation signatures (such as diurnal variations)

References ABWE: A Practical Approach to Available Bandwidth Estimation, Jiri Navratil and Les Cottrell Automated Event Detection for Active Measurement Systems, A. J. McGregor and H-W. Braun, Passive and Active Measurements 2001. Overview of IEPM-BW Bandwidth Testing of Bulk Data Transfer, Les Cottrell and Connie Logg Experiences and Results from a New High Performance Network and Application Monitoring Toolkit, Les Cottrell, Connie Logg, and I-Heng Mei Correlating Internet Performance Changes and Route Changes to assist in Trouble-Shooting from an End-User Perspective, Connie Logg, Jiri Navratil, and Les Cottrell Miscellaneous, SLAC

Future - continues Further develop Bandwidth Change Analysis Now have 1st level trigger mechanism Develop more extensive analysis to analyze identified events Develop algorithms to automatically conduct other tests integrate those results into further trigger analysis

IEPM-BW Version 3 Architecture changes SQL data base for host specifications, probe tool specifications, probe scheduling mechanism, analysis results, knowledge base Scheduling mechanisms for lightweight vs heavyweight probes Distributed monitoring and remote data retrieval for “grid” analysis Change analysis (route and bandwidth) and alerts