1 End-to-end Monitoring of High Performance Network Paths Les Cottrell, Connie Logg, Jerrod Williams SLAC, for the ESCC meeting, Columbus Ohio, July 2004.

Slides:



Advertisements
Similar presentations
QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.
Advertisements

1 High Performance Active End-to- end Network Monitoring Les Cottrell, Connie Logg, Warren Matthews, Jiri Navratil, Ajay Tirumala – SLAC Prepared for the.
1 IEPM-BWIEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003.
MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance Warren Matthews (SLAC) Presented at the Measurement SIG ESCC/Internet2.
1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
PingER Management1 Error Reporting Model for Ping End-to-End Reporting (PingER Management)
1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.
MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.
PIPE Dreams Trouble Shooting Network Performance for Production Science Data Grids Presented by Warren Matthews at CHEP’03, San Diego March 24-28, 2003.
INCITE – Edge-based Traffic Processing for High-Performance Networks R. Baraniuk, E. Knightly, R. Nowak, R. Riedi Rice University L. Cottrell, J. Navratil,
1 Terapaths: Datagrid WAN Network Monitoring Infrastructure Les Cottrell, Connie Logg, Jerrod Williams SLAC, for the DoE 2004 PI Network Research Meeting,
Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.
Routing Measurements Matt Zekauskas, ITF Meeting 2006-Apr-24.
Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
What we have learned from developing and running ABwE Jiri Navratil, Les R.Cottrell (SLAC)
PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.
1 End-to-end Monitoring of High Performance Network Paths Les Cottrell, Connie Logg, Jerrod Williams, Jiri Navratil, SLAC, for the ESCC meeting, Columbus.
LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
IEPM-BW Deployment Experiences Connie Logg SLAC Joint Techs Workshop February 4-9, 2006.
Integration of AMP & Tracenol By: Qasim Bilal Lone.
DataGrid Wide Area Network Monitoring Infrastructure (DWMI) Connie Logg February 13-17, 2005.
Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.
1 ESnet/HENP Active Internet End-to-end Performance & ESnet/University performance Les Cottrell – SLAC Presented at the ESSC meeting Albuquerque, August.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
1 Measurements of Internet performance for NIIT, Pakistan Jan – Feb 2004 PingER From Les Cottrell, SLAC For presentation by Prof. Arshad Ali, NIIT.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
Iperf Quick Mode Ajay Tirumala & Les Cottrell. Sep 12, 2002 Iperf Quick Mode at LBL – Les Cottrell & Ajay Tirumala Iperf QUICK Mode Problem – Current.
1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.
1 SLAC IEPM PingER and BW monitoring & tools PingER Presented by Les Cottrell, SLAC At LBNL, Jan 21,
IEPM. Warren Matthews (SLAC) Presented at the ESCC Meeting Miami, FL, February 2003.
1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan
1 IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep ‘04
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
D-Link TSD 2009 workshop D-Link Net-Defends Firewall Training ©Copyright By D-Link HQ TSD Benson Wu.
1 Internet Performance Monitoring for the HENP Community Les Cottrell & Warren Matthews – SLAC Presented.
1 MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance Warren Matthews Stanford Linear Accelerator Center (SLAC)
Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.
INDIANAUNIVERSITYINDIANAUNIVERSITY Status of FAST TCP and other TCP alternatives John Hicks TransPAC HPCC Engineer Indiana University APAN Meeting – Hawaii.
1 PingER performance to Bangladesh Prepared by Les Cottrell, SLAC for Prof. Hilda Cerdeira May 27, 2004 Partially funded by DOE/MICS Field Work Proposal.
The EU DataTAG Project Richard Hughes-Jones Based on Olivier H. Martin GGF3 Frascati, Italy Oct 2001.
1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,
1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.
BOF Discussion: Uploading IEPM-BW data to MonALISA Connie Logg SLAC Winter 2006 ESCC/Internet2 Joint Techs Workshop ESCCInternet2ESCCInternet2 February.
1 PingER6 Preliminary PingER Monitoring Results from the 6Bone/6REN. Warren Matthews Les Cottrell.
INFSO-RI Enabling Grids for E-sciencE Diagnostic Tool Brainstorming Ratnadeep Abrol EGEE JRA4 F2F, DANTE, Cambridge 9 th May 2005.
1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,
Toward a Measurement Infrastructure. Warren Matthews (SLAC) Presented at the e2e Workshop Miami, FL, February 2003.
1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan
BOF Discussion: Uploading IEPM-BW data to MonALISA
Warren Matthews and Les Cottrell (SLAC)
Using Netflow data for forecasting
Connie Logg, Joint Techs Workshop February 4-9, 2006
Wide Area Networking at SLAC, Feb ‘03
Connie Logg February 13 and 17, 2005
Experiences in Traceroute and Available Bandwidth Change Analysis
Experiences in Traceroute and Available Bandwidth Change Analysis
SLAC monitoring Web Services
Advanced Networking Collaborations at SLAC
IEPM. Warren Matthews (SLAC)
Wide-Area Networking at SLAC
Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.
MAGGIE NIIT- SLAC On Going Projects
PIPE Dreams Trouble Shooting Network Performance for Production Science Data Grids Presented by Warren Matthews at CHEP’03, San Diego March 24-28, 2003.
Internet2 E2E piPEs Project
Net Rat Network Reliability and Troubleshooting.
Presentation transcript:

1 End-to-end Monitoring of High Performance Network Paths Les Cottrell, Connie Logg, Jerrod Williams SLAC, for the ESCC meeting, Columbus Ohio, July Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

2 Need Data intensive science (e.g. HENP) needs to share data at high speeds Needs high-performance, reliable e2e paths and the ability to use them End users need long and short term estimates of network and application performance for: Planning, setting expectations & trouble shooting You can’t manage what you can’t measure

3 IEPM-BW Toolkit: –Enables regular, E2E measurements with user selectable: Tools: iperf (single & multi-stream), bbftp, bbcp, GridFTP, ping (RTT), traceroute Periods (with randomization) Remote hosts to monitor –Hierarchical to match the tiered approach of BaBar & LHC computation / collaboration infrastructures –Includes: Auto-clean up of hung processes at both ends Management tools to look for failures (unreachable hosts, failing tools etc.) Web navigation of results Visualization of data as time-series, histograms, scatter plots, tables Access to data in machine readable form Documentation on host etc. requirements, program logic manuals, methods

4 Requirements –Requires: Monitoring toolkit installed on Linux monitoring host –Host provided & administered by monitoring site personnel –No need for root privileges –Appropriate iperf, bbftp etc. ports to be opened –SLAC can do initial install & configuration for monitoring host »50 line configuration file for each remote host, tells where directories, applications are located, options for various tools etc (mainly defaults) Small toolkit installed at remote (monitored hosts) Ssh access to an account at remote hosts –This is the biggest problem with deployment

5 Achievable throughput & file transfer IEPM-BW –High impact (iperf, bbftp, GridFTP …) measurements min intervals Select focal area Fwd route change Rev route change Min RTT Iperf bbftp iperf1 abing Avg RTT

6 Visualization: traceroutes Compact table to see correlations between many routes Identify significant changes in routes –Differences in > 1 hop, NOT same first 3 octets, NOT same AS Report all traceroute pathologies: –! Annotations, ICMP checksum errs, non-responding interfaces, unreachable end host, stutters, multi-homed end host Note, we observe: – most route changes (>98%) do not result in significant performance changes –Many performance changes (~50+-20%) are NOT due to route changes Applications, host congestion, level 2 changes etc.

7 Route table Example Compact so can see many routes at once History navigation Multiple route changes (due to GEANT), later restored to original route Available bandwidth Raw traceroute logs for debugging Textual summary of traceroutes for to ISP Description of route numbers with date last seen User readable (web table) routes for this host for this day Route # at start of day, gives idea of root stability Mouseover for hops & RTT

8 Another example TCP probe type Host not pingable Intermediate router does not respond ICMP checksum error Level change Get AS information for routes

9 Topology Choose times and hosts and submit request DL CLRCCLRC CLRC IN2P3 CESnet ESnet JAnet GEANT Nodes colored by ISP Mouseover shows node names Click on node to see subroutes Click on end node to see its path back Also can get raw traceroutes with AS’ Alternate rt SLAC Alternate route Hour of day

10 IEPM-BW HENP Deployment June 2004 Measurements from SLAC & FNAL –BaBar, CMS, D0, CDF remote hosts in 12 countries Toolkits needed in monitor & remote hosts Range of bandwidths:500Kbps to 1 Gbps

11 Working on: Provide more options for security for remote hosts Web services API access to data Provide & integrate low network utilization tool: –~ 25% of Abilene traffic is net measurement Automate detection of anomalous step changes in performance Evaluate using QOS or HSTCP-LP to reduce impact of iperf traffic –Evidence that causes packet loss (ESnet/FNAL/SLAC)

12 Simplify remote security Currently use ssh to start, kill servers, check things etc. Instead run servers all time at remote host –Check & restart with cron job –Also kill hung processes with cron jobs –More work for remote admin –More difficult to check why things not working NASA very hard to get account (requires training etc.), so this will be a work-around

13 Data Access Interactive web accessible –Most data can be downloaded in space or comma separated etc. (accessible via link or to program (e.g. using lynx to access URL)) –However non standard Web services (GGF NMWG definitions) Working (with Warren Matthews/GATech/I2) on defining / providing access to traceroutes for AMP & IEPM-LITE MonALISA is accessing data via Web services CharacteristicToolname path.bandwidth.achievable.TCP iperf path.bandwidth.achievable.TCP.multiStreamIperf,bbftp, bbcp, GridFTP CharacteristicToolname path.bandwidth.capacityABwE path.bandwidth.utilizationABwE

14 Low impact bandwidth measurement Goals: –Make a measurement in < second rather than tens of seconds –Injects little network traffic –Provide reasonable agreement with more intense methods (e.g. iperf) Enables: –Measurements of low performance links (e.g. to developing countries) –Helps avoid need for scheduling –More frequent measurements (minutes vs. hours) –Lower impact more friendly

15 Low impact Bandwidth Use 20 packet pairs to roughly estimate dynamic bw Capacity & Xtraffic, then Available = Capacity – Xtraffic –Capacity  min pair separation; Xtraffic  packet pair dispersion Dynamic bandwidth capacity (DBC) Available bandwidth = DBC – X-traffic Cross-traffic Iperf ABwE SLAC to Caltech Mar 19, 2004

16 Anomalous Event Detection Too many graphs to scan by hand, need to automate –SLAC Caltech link performance dropped by factor 5 for ~ month before noticed, fixed within 4 hours of reporting Looking for long-term step down changes in bandwidth Use modified “plateau” algorithm from NLANR –Divide data into history & trigger buffer –If y <  h –  *  h then trigger, else history (  When trigger buffer fills: if  t <  *  h, then have an event

17 Anomalous Event Detection Length of trigger buffer (  ) determines how long a step down must last before being interesting, we use 1 to 3 hours –E.g. 20 mins saw 9 events, 40mins saw 3, 60mins none Works well unless strong (>40%) diurnal changes –Next step incorporate diurnal checks Events caused by application on Caltech host (not network related)

18 Putting it together     ESnet CENIC Abilene SLAC Supernet SOX

19 Future plans Looking for funding… Integrate it all Improve distribution and management tools Add monitoring sites e.g. HENP tier 0 & 1 sites such as CERN, BNL, IN2P3, DESY …; ESnet, StarLight, Caltech … Add extra functionality: –Improved event detection include diurnals, multivariate –Filter alerts –Upon detecting anomaly gather relevant information (network, host etc.) including on-demand measurements (e.g. NDT) and prepare web page & –Improved web services access

20 Thanks: Development Jiri Navratil (Prague) – bandwidth estimation (ABwE) Paola Grosso (SLAC) & Warren Matthews (GATech) - web services Maxim Grigoriev (FNAL) – event detection, IEPM visualization, major monitoring site Ruchi Gupta (Stanford) – event visualization Prof Arshad Ali & Fahad Khalid (NIIT, Pakistan) – data collection after event Rich Carlson (I2), NDT

21 Thanks: on-going Foreign: –Andrew Daviel (TRIUMF), Simon Leinen (SWITCH), Olivier Martin (CERN), Sven Ubik (CESnet), Kars Ohrenberg (DESY), Bruno Hoeft (FZK), Dominique (IN2P3), Fabrizio Coccetti (INFN), Cristina Bulfon (INFN), Yukio Karita (KEK), Takashi Ichihara (RIKEN), Yoshinori Kitasuji (APAN), Antony Antony (NIKHEF), Arshad Ali (NIIT), Serge Belov (BINP), Robin Tasker (DL & RAL), Yee Ting Lee (UCL), Richard Hughes-Jones (Manchester) US –Shawn McKee (Michigan), Tom Hacker (Michigan), Eric Boyd (I2), Stanislav Shalunov (SOX), George Uhl (GSFC), Brian Tierney (LBNL), John Hicks (Indiana), John Estabrook (UIUC), Maxim Grigoriev (FNAL), Joe Izen (UT Dallas), Chris Griffin (U Florida), Tom Dunigan (ORNL), Dantong Yu (BNL), Suresh Singh (Caltech), Chip Watsom (JLab), Robert Lukens (JLab), Shane Canon (NERSC), Kevin Walsh (SDSC), David Lapsley (MIT/Haystack/ISI-E)

22 More information IEPM-BW home page – Comparison of Internet E2E Measurement infrastructures; – iepm.slac.stanford.edu/grp/scs/net/proposals/infra-mon.htmlhttp://www- iepm.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html ABwE lightweight bandwidth estimation – Anomalous Event Detection – IEPM Web Services – / /

23 Extra Slides

24 Web Services See Working for: RTT, loss, capacity, available bandwidth, achievable throughput No schema defined for traceroute (hop-list) PingER –Definition WSDL – path.delay.roundTrip ms (min/avg/max + RTTs), path.loss.roundTrip IPDV(ms), Also dups, out of order, IPDV, TCP thru estimate Require to provide packet size, units, timestamp, sce, dst –path.bandwidth.available, path.bandwidth.utilized, path.bandwidth.capacity Mainly for recent data, need to make real time data accessible Used by MonALISA so need coordination to change definitions

25 Perl access to PingER

26 PingER WSDL

27 Output from script

28 Perl AMP traceroute

29 AMP traceroute output

30 Intermediate term access Provide access to analyzed data in tables via.tsv format download from web pages.

31 Bulk Data For long term detailed data, we tar and zip the data on demand. Mainly for PingER data.

32 AbWE Iperf 28 days bandwidth history. During this time we can see several different situations caused by different routing from SLAC to CALTECH Drop to 100 Mbits/s by Routing (BGP) errors Drop to 622 Mbits/s path back to new CENIC path New CENIC path 1000 Mbits/s Reverse Routing changes Forward Routing changes Scatter plot graphs of Iperf versus ABw on different paths (range 20– 800 Mbits/s) showing agreement of two methods (28 days history) RTT Bbftp Iperf 1 stream

33 Changes in network topology (BGP) can result in dramatic changes in performance Snapshot of traceroute summary table Samples of traceroute trees generated from the table ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Changes detected by IEPM-Iperf and AbWE Esnet-LosNettos segment in the path (100 Mbits/s) Hour Remote host Dynamic BW capacity (DBC) Cross-traffic (XT) Available BW = (DBC-XT) Mbits/s Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Los-Nettos (100Mbps)