Connie Logg, Joint Techs Workshop February 4-9, 2006

Slides:



Advertisements
Similar presentations
TELE202 Lecture 8 Congestion control 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »X.25 »Source: chapter 10 ¥This Lecture »Congestion control »Source:
Advertisements

Presentation by Joe Szymanski For Upper Layer Protocols May 18, 2015.
1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.
1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.
INCITE – Edge-based Traffic Processing for High-Performance Networks R. Baraniuk, E. Knightly, R. Nowak, R. Riedi Rice University L. Cottrell, J. Navratil,
Measurement and Monitoring Nick Feamster Georgia Tech.
17/10/2003TCP performance over ad-hoc mobile networks. 1 LCCN – summer 2003 Uri Silbershtein Roi Dayagi Nir Hasson.
Routing Measurements Matt Zekauskas, ITF Meeting 2006-Apr-24.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
© 2002, Cisco Systems, Inc. All rights reserved..
CCNA 2 Week 8 TCP/IP Suite Error Control Messages.
POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.
LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.
IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.
Chapter 12 Transmission Control Protocol (TCP)
Use cases Navigation Problem notification Problem analysis.
Integration of AMP & Tracenol By: Qasim Bilal Lone.
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Iperf Quick Mode Ajay Tirumala & Les Cottrell. Sep 12, 2002 Iperf Quick Mode at LBL – Les Cottrell & Ajay Tirumala Iperf QUICK Mode Problem – Current.
CSC 600 Internetworking with TCP/IP Unit 5: IP, IP Routing, and ICMP (ch. 7, ch. 8, ch. 9, ch. 10) Dr. Cheer-Sun Yang Spring 2001.
1 IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep ‘04
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,
1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.
BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.
INFSO-RI Enabling Grids for E-sciencE Diagnostic Tool Brainstorming Ratnadeep Abrol EGEE JRA4 F2F, DANTE, Cambridge 9 th May 2005.
1 Deploying Measurement Systems in ESnet Joint Techs, Feb Joseph Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.
IEPM-BW Deployment Experiences
Monitoring Persistently Congested Internet Links
Topics discussed in this section:
Les Cottrell & Yee-Ting Li, SLAC
The CALgorithm for Detecting Bandwidth Changes
Paola Grosso SLAC October
Networking for the Future of Science
IEPM-BW Deployment Experiences
Rohit Kapoor, Ling-Jyh Chen, M. Y. Sanadidi, Mario Gerla
BOF Discussion: Uploading IEPM-BW data to MonALISA
Introduction to Networking
Routing.
Using Netflow data for forecasting
NT2640 Unit 9 Activity 1 Handout
Wide Area Networking at SLAC, Feb ‘03
ABwE: Available Bandwidth Estimator Jiri Navratil R. Les
IP : Internet Protocol Surasak Sanguanpong
Connie Logg February 13 and 17, 2005
End-to-end Anomalous Event Detection in Production Networks
Experiences in Traceroute and Available Bandwidth Change Analysis
High Performance Network Monitoring for UltraLight
High Performance Network Monitoring for UltraLight
Experiences in Traceroute and Available Bandwidth Change Analysis
Network Performance Measurement
Use of Simplex Satellite Configurations to support Internet Traffic
CapProbe Ling-Jyh Chen, M. Y. Sanadidi, Mario Gerla
Chapter 11. Frame Relay Background Frame Relay Protocol Architecture
SLAC monitoring Web Services
Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.
MAGGIE NIIT- SLAC On Going Projects
The CALgorithm for Detecting Bandwidth Changes
pathChirp Efficient Available Bandwidth Estimation
LAN Addresses and ARP IP address: drives the packet to destination network LAN (or MAC or Physical) address: drives the packet to the destination node’s.
Routing.
Internet Control Message Protocol
pathChirp Efficient Available Bandwidth Estimation
Summer 2002 at SLAC Ajay Tirumala.
Presentation transcript:

Connie Logg, Joint Techs Workshop February 4-9, 2006 IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006

BW Change Detection: Important Know what you are looking for How long must a change persist before alerting? What threshold to use for alerting (drop of N %)? What probes provide quality data and are relevant? May differ between network types and technologies Once an alert is detected, what circumstances must be met before another alert is generated for the same or new drop? Alerting and forecasting/predicting future performance are two different things – however data taken may be relevant to both Remember – Don’t want to respond to every little glitch – more probing may escalate a minor momentary congestion event.

Study them for accuracy and relevance What to do with ALERTS Study them for accuracy and relevance What information would help diagnose the drop? Were there traceroute changes? Do changes in other probes seem to have occurred in the same time frame? Was there an increase in the ping RTT times? If TCP RTT is available, was there a change in that? What does OWAMP show (to be implemented)

Algorithm - Simplified Stream of data t0 - - - tn 2 buffers: history buffer (hbuff) and trigger buffer (tbuff), sizes hmax & tmax Load data t0-thmax into history buffer and calculate baseline histmean(hm) & histsd(hsd)

Algorithm - Simplified Loop over data t = {thmax+1 - - - tn} if t > hm -2*hsd, tbuffoldest->hbuff, t->hbuff, drop hbuffoldest ,calc hm & hsd, next If t<= hm -2*hsd t->tbuff If size(tbuff) < tmax, next; Calc tbuff mean (tm), if (hm-tm)/hm > threshold, generate an alert, tbuff -> hbuff, calc hm, hsd, next Once alert is generated, drop threshold must be met again from the tm or the data stream must recover for ½ of drop time.

Overview What we currently look for Look for a drop lasting at least 6 hours Look for a drop of 33% Before reporting another drop, require 3 hours of restored throughput Up for at least 3 hours  Bandwidth 33% drop 6 hours Drop of 33% for 6 hours 33% drop 6 more hours  Time

Observations: Traceroute changes occasionally coincide with bandwidth drops Challenge: How do you defined a traceroute change and which have most priority? Checksum error Duplicate responding or non responding hop ! Annotations IP addr differ in 4th octet (or 3rd and 4th octets) How do you quickly review traceroute changes?

Traceroute Visualization One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (period(.) = no change) Identify unique routes with a number Inspect the route associated with a route number Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

Pathology Encodings Change but same AS No change Probe type Change in only 4th octet End host not pingable Hop does not respond Stutter ICMP checksum ! Annotation (!X)

Navigation traceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms … 13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X #rt# firstseen lastseen route 0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx 2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx 5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx 6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx 7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx 8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

AS’ information

Esnet-LosNettos segment in the path Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

New Graphical Map Display New Traceroute Map Display

Quality Control – Bandwidth Monitoring It is good to have a local target host for a sanity check: Problem here was that the monitoring host rebooted into single CPU mode after maintenance had been performed on it.

More Sanity Checks Target host – iepm-bw@caltech – was not completely installed – process cleanup did not have the perl modules that it needed to kill lingering processes (needs install check)

Probe Correlation Traceroute change affected all 3 Pathchirp analysis shows drop Multi-stream iperf show drop Single stream iperf shows drop

Analysis Results Email is sent to interested parties with links to graphs, data and traceroute analysis Alerts are saved in the ALERT table and graphs are saved in the GRAPH Table for future reference Every analysis run, about every 2 hours, a table showing which alerts occurred for which probes and when is generated. It has links to the more detailed alert information. Reports are generated nightly for the last month alerts from these tables.

Future Improvements Integrate Ping RTTmin and RTTmax analysis Optimize code for speed of execution – estimate mean and std dev Upload alerts to MonALISA – What info? Compare detection algorithms (KS, HW, PCA?) Recommendations on data taking frequencies and how to define the trigger and history buffer sizes still needs more exploring Implement prediction/forecasting algorithm(s) QUESTIONS?