Lessons Learned Monitoring the WAN

Slides:

Advertisements

Similar presentations

A Flexible Model for Resource Management in Virtual Private Networks Presenter: Huang, Rigao Kang, Yuefang.

Advertisements

QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.

1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,

1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.

MAGGIE NIIT- SLAC On Going Projects Measurement & Analysis of Global Grid & Internet End to end performance.

Network Performance Measurement Atlas Tier 2 Meeting at BNL December Joe Metzger

1 ESnet Network Measurements ESCC Feb Joe Metzger

PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.

workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.

1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February

Measurement & Analysis of Global Grid & Internet End to end performance (MAGGIE) Network Performance Measurement.

1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.

1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.

1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.

Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan

NETWORKING FUNDAMENTALS. Network+ Guide to Networks, 4e2.

Measurement in the Internet Measurement in the Internet Paul Barford University of Wisconsin - Madison Spring, 2001.

1 Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia Partially funded by DOE and by Internet2.

1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.

Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.

1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,

1 Network Measurement Challenges LHC E2E Network Research Meeting October 25 th 2006 Joe Metzger Version 1.1.

1 High Performance Network Monitoring Challenges for Grids Les Cottrell, Presented at the Internation Symposium on Grid Computing 2006, Taiwan

What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.

Multi Node Label Routing – A layer 2.5 routing protocol

Instructor Materials Chapter 1: LAN Design

Instructor Materials Chapter 6: Quality of Service

OptiView™ XG Network Analysis Tablet

Les Cottrell & Yee-Ting Li, SLAC

Monitoring 10Gbps and beyond

Paola Grosso SLAC October

Improving Datacenter Performance and Robustness with Multipath TCP

Networking for the Future of Science

Networking between China and Europe

Chapter 4 Data Link Layer Switching

Deployment & Advanced Regular Testing Strategies

Tools for High Performance Network Monitoring

Terapaths: DWMI: Datagrid Wide Area Monitoring Infrastructure

© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Quality of Service Connecting Networks.

Module 5 - Switches CCNA 3 version 3.0.

Using Netflow data for forecasting

Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.

Connie Logg, Joint Techs Workshop February 4-9, 2006

ESnet Network Measurements ESCC Feb Joe Metzger

Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the

Wide Area Networking at SLAC, Feb ‘03

Connie Logg February 13 and 17, 2005

End-to-end Anomalous Event Detection in Production Networks

Experiences in Traceroute and Available Bandwidth Change Analysis

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Mahesh Chhaparia & Les Cottrell, SLAC.

File Transfer Issues with TCP Acceleration with FileCatalyst

High Performance Network Monitoring for UltraLight

High Performance Network Monitoring for UltraLight

IT351: Mobile & Wireless Computing

Experiences in Traceroute and Available Bandwidth Change Analysis

Forecasting Network Performance

Network Performance Measurement

SLAC monitoring Web Services

Advanced Networking Collaborations at SLAC

CS4470 Computer Networking Protocols

Wide-Area Networking at SLAC

Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil SLAC.

MAGGIE NIIT- SLAC On Going Projects

Requirements Definition

“Detective”: Integrating NDT and E2E piPEs

EE 122: Lecture 22 (Overlay Networks)

Presentation transcript:

Lessons Learned Monitoring the WAN Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia Forecasting Network Performance Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward. Partially funded by DOE and by Internet2

Uses of Measurements Automated problem identification & trouble shooting: Alerts for network administrators, e.g. Outages, baselines, bandwidth changes in time-series, iperf, SNMP Alerts for systems people OS/Host metrics Forecasts for Grid Middleware, e.g. replica manager, data placement Engineering, planning, SLA (set & verify), expectations Also (not addressed here): Security: spot anomalies, intrusion detection Accounting Set the stage, why do we need to monitor at all?

WAN History PingER (1994), IEPM-BW (2001), Netflow E2E, active, regular, end user view, all hosts owned by individual sites, core mainly centrally designed & developed (homogenous), contributions from FNAL, GATech, NIIT (close collaboration) Why are you monitoring: network trouble management, planning, auditing/setting SLAs, Grid forecasting are very different though may use same measurements

PingER (1994) PingER project originally (1995) for measuring network performance for US, Europe & Japanese HEP community - now mainly R&E sites Extended this century to measure Digital Divide: Collaboration with ICTP Science Dissemination Unit http://sdu.ictp.it ICFA/SCIC: http://icfa-scic.web.cern.ch/ICFA-SCIC/ >120 countries (99% world’s connected population) >35 monitor sites in 14 countries Uses ubiquitous ping facility The size of the Internet infrastructure is a good indication of a country's progress towards an information-based economy. Africa's Internet infrastructure is the least developed in the world, with on average less than 1 in 100 people having access. However averages obscure the great diversity of the African continent, which is reflected in wide variations in levels of Internet-use. But measuring the numbers of users is not easy in developing countries because many people share accounts, use corporate and academic networks, or visit the rapidly growing number of cyber cafés, telecentres and business services. Furthermore, simply measuring the number of users does not take into account the extent of use, from those who just write a couple of emails a week, to people who spend many hours a day on the net browsing, transacting, streaming, or downloading. As a result, new measures of Internet activity are needed to take these factors into account. One indicator that is becoming increasingly popular is to measure the amount of international Internet bandwidth used by a country - the 'size of the pipe', most often measured in Kilobits per second (Kbps), or Megabits per second (Mbps). Most of the Internet traffic in a developing country is international (75-90%), so the size of its international traffic compared to population size provides a ready indication of the extent of Internet activity in a country. In Africa some of these international links may only be as big as the circuit used by a small or medium sized business, or even a broadband home user in a developed country - about 128Kbps, or about 3-4 times standard modem dialup speeds. In most cases these are confined to very small and poor African countries, but there are many other regulatory, historic and social factors that also influence the extent of Internet use. Credits: Research & coordination - Mike Jensen mikej@sn.apc.org Conceptualisation, management and refinement - Richard Fuchs rfuchs@idrc.ca & Heloise Emdon hemdon@idrc.ca Design, DTP and Layout: Adam Martin lee@wildcoast.com Background research and liason: Lee Martin & Rochelle Martin lee@wildcoast.com Printing: The Printshop, Durban, South Africa. With the above in mind PingER uses the Internet ping facility to measure the international Internet performance in terms of RTT, Loss, and derived throughput to characterize a country’s performance. Coverage missing in Central Africa, Myanmar, Cambodia, Arabian Peninsula, parts of W. Africa, Guyana, Suriname, French Guiana. Monitor 44 sites in S. Asia Maybe most extensive active E2E monitoring in world

PingER Design Details PingER Design (1994: no web services or RRD, security not a big thing, etc.) Simple, no remote software (ping everywhere), no probe development, monitor host install 0.5 day effort for sys-admin Data centrally gathered, archived, analyzed, so hard jobs (archiving, analysis, viz) do NOT require distribution, only one copy Database flat ASCII files, rawdata, analyzed data, file/pair/day. Compressed saves factor 6 (100GB) Data available via web (lot of use, some uses unexpected, often analysis by Excel)

PingER Lessons Measurement code rewritten twice, once to add extra data, once to document (perldoc) / parameterize / simplify installation Gathering code (uses LYNX or FTP) pull from archive, no major mods in 10 years Most of development: for download, analyze data, viz, manage New ways to use data: jitter, out of order, duplicates, derive throughput, MOS all required study of data then implement and integrate Dirty data (pathologies not related to network) require filtering or filling before analysis Had to develop easy make/install download, instructions, FAQ, still new installs require communication: pre-reqs, getting name registration, getting cron jobs running, getting web server running, unblock, clarify documentation (often non-native English speakers) Documentation (tutorials, help, FAQs), publicity (brochures, papers, maps, presentations/travel), get funding/proposals Monitor availability of (developed tools to simplify/automate): monitor sites (hosts stop working: security blocks, hosts replaced, site forgets), nudge contacts critical remote sites (beacons), choose new one (automatically updates monitor sites) Validate/update meta data (name, address, institute, lat/long, contact …) in database (need easy update)

IEPM-BW (2001) 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites Added Sunnyvale for UltraLight Covers all USATLAS tier 0, 1, 2 sites Recently Added FZK, QAU Main author (Connie Logg) retired

IEPM-BW Design Details More focused (than PingER), fewer sites (e.g. BaBar collaborators), more intense, more probe tools (iperf, thrulay, pathload, traceroute, owamp, bbftp …), more flexibility Complete code set (measurement, archive, analyze, viz) at each monitoring site. Data distributed. Needs dedicated host Remote sites need code installed Originally executed remote via ssh, still needed code installed Security, accounts (require training), recovery problems Major changes with time: Use servers rather than ssh for remote hosts Use mysql for configuration data bases rather than require perl scripts Provide management tools for configuration data etc. Add/replace probes Tools include abwe, patchirp, bbcp, bbftp

IEPM-BW Lessons (1) Problems & recommendations: Need right versions of mysql, gnuplot, perl (and modules) installed on hosts All possible failure modes for probe tools need to be understood and accomodated Timeout everything, clean up hung processes Keep logfiles for day or so for debugging Review how processes run with Netflow (mainly manual) Scheduling: don’t run file transfer, iperf, thrulay, pathload at same time on same path Limit duration and frequency of intensive probes so do not impact network Host loses disk, upgrades OS, loses DNS, applications upgraded (e.g. gnuplot), IEPM database zapped etc. Need backup Have a local host as target for sanity check (e.g. monitoring host based issues) Monitor monitoring host load (e.g. Ganglia, Nagios…)

IEPM-BW Lessons (2) Different paths need different probes (performance and interest related) Experiences with probes (lot of work to understand, analyze & compare): Owamp vs ping: owamp needs server and accurate time; ping only round trip available everywhere, may be blocked Traceroute: need to analyze significance of results Packet pair separation: Abwe noisy, inaccurate especially on Gbps paths Pathchirp better, pathload best (most intense approaches iperf), problems at 10Gbps, look at pathneck TCP: thrulay more information, more manageable than iperf, need to keep TCP buffers optimized/updated File transfer Disk to disk close to iperf/thrulay disk measures file/disk system – not network, but end user important Adding new hosts still not easy

Other Lessons (1) Traceroute no good for layers 2 &1 Packet pair surpasses time granularity at 10Gbps Net admin cannot review thousands of graphs each day: need event detection, alert notification, and diagnosis assistance Comparing QoS vs best effort requires adding path reservation Keeping TCP buffer parameters optimized difficult Network & configurations not static Forecasting hard if path is congested, need to account for diurnal etc. variations

Examples of real data Caltech: thrulay Misconfigured windows New path Very noisy 800 Mbps Nov05 Mar06 UToronto: miperf Seasonal effects Daily & weekly 250 Mbps Jan06 Nov05 Pathchirp UTDallas Some are seasonal Others are not Events may affect multiple-metrics 120 thrulay Mbps Mar-10-06 iperf Mar-20-06 Events can be caused by host or site congestion Few route changes result in bandwidth changes (~20%) Many significant events are not associated with route changes (~50%)

Netflow et. al. Switch identifies flow by sce/dst ports, protocol Cuts record for each flow: src, dst, ports, protocol, TOS, start, end time Collect records and analyze No intrusive traffic, real: traffic, collaborators, applications No accounts/pwds/certs/keys No reservations etc Characterize traffic: top talkers, applications, flow lengths etc. May be able to use for forecasting for some sites and event detection (also security wants it) LHC-OPN requires edge routers to provide Netflow data Internet 2 backbone http://netflow.internet2.edu/weekly/ SLAC: www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html NetraMet, SCAMPI are a couple of non-commercial flow projects. IPFIX is an IETF standardization effort for Netflow type passive monitoring.

Netflow limitations Can be a lot of data to collect each day, need lots cpu Hundreds of MBytes to GBytes Use of dynamic ports makes harder to detect app. GridFTP, bbcp, bbftp can use fixed ports (but may not) P2P often uses dynamic ports Discriminate type of flow based on headers (not relying on ports) Types: bulk data, interactive … Discriminators: inter-arrival time, length of flow, packet length, volume of flow Use machine learning/neural nets to cluster flows E.g. http://www.pam2004.org/papers/166.pdf Aggregation of parallel flows (needs care, but not difficult) For FTP port 21 only used as a control channel. SCAMPI/FFPF/MAPI allows more flexible flow definition than Netflow See www.ist-scampi.org/

NG: PerfSONAR Our future focus (for us 3rd Generation), NSF proposal: Open source, open community Both end users (LHC, GATech, SLAC, Delaware) and network providers (ESnet, I2, GEANT, Eu NRENs, Brazil, …), achieve critical mass? Many developers from multiple fields Requires from the get go: shared code, documentation, collaboration Hopefully not as dependent on funding as a single team, so persistent? Transparent gathering and storage of measurement, both from NRENs and end users Sharing of information across autonomous domains Uses standard formats More comprehensive view AAA to provide protection for of sensitive data Reduces debugging time Access to multiple components of the path No need to play telephone tag Currently mainly middleware, needs: Data mining and viz Topology also at layers 1 & 2 Forecasting Event detection and event diagnosis

Challenges Probe tools fail at > 1Gbps Dedicated circuits, QoS, Layers 2 &1 Impact of new (e.g. TOS NICs) Auto event detection, alerts, diagnosis Integrate passive & active Tie in end systems, file/disk systems, Apps Sustainability (funding disappears) Factorise components (measure, archive, analyze) AND tasks Provide standard, published interfaces Engage community (multiple developers, users, providers: has its own challenges)

Questions, More information Comparisons of Active Infrastructures: www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html Some active public measurement infrastructures: www-iepm.slac.stanford.edu/ www-iepm.slac.stanford.edu/pinger/ www.slac.stanford.edu/grp/scs/net/talk06/IEPM-BW%20Deployment.ppt e2epi.internet2.edu/owamp/ amp.nlanr.net/ (no longer available) Monitoring tools www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html www.caida.org/tools/ Google for iperf, thrulay, bwctl, pathload, pathchirp Event detection www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc

More Slides

Active E2E Monitoring Layer 3 or 4. Layers 1 and 2 are less well exploited/understood/related to apps. Also lots of instances: FC, FICON, 10GE, SONET, OC192, SDH … Check vendor specs. E.g. Cisco, Juniper etc. SONET monitoring (from Endace): The PHYMON occupies 1U (OC3/12) or 2U (OC48/192) of vertical rack space and is equipped with two 10/100/1000 copper Ethernet interfaces for control and reporting via LAN. Key Features Monitors up to two OC3/OC12/OC48/OC192 network links Detects link-layer failures: LOS-S, LOF-S, AIS-L, REI-L, RDI-L, AIS-P, LOP_P, UNEQ-P, REI-P, RDI-P Derive errors: CV, ES, ESA, ESB, SES and UAS according to Bellcore GR-253, Issue 2 Rev 2 standard Sends SNMP traps for all failures and error thresholds according to user configuration Reports current status in real time via telnet, ssh or serial connection Reports accumulated status for 15m, 1h, 8h, 24h, 7d intervals Retains historical data for 35 days Supplies all the underlying data for SNMP SONET MIB (RFC2558) Techniques: Loop back, test patterns (BERT), e.g. ones & zeroes, various ITU-T specs, Loss of Signal, out of Frame, loss of frame, errored seconds, code violations, unavailable seconds, alarms, near & far end, how often, history

E.g. Using Active IEPM-BW measurements Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model Makes regular measurements with probe tools ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) pathchirp, pathload (available bandwidth) iperf (one & multi-stream), thrulay, (achievable throughput) supports bbftp, bbcp (file transfer applications, not network) Looking at GridFTP but complex requiring renewing certificates Choice of probes depends on importance of path, e.g. For major paths (tier 0, 1 & some 2) use full suite For tier 3 use just ping and traceroute Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

IEPM-BW Measurement Topology 40 target hosts in 13 countries Bottlenecks vary from 0.5Mbits/s to 1Gbits/s Traverse ~ 50 AS’, 15 major Internet providers 5 targets at PoPs, rest at end sites Taiwan Added Sunnyvale for UltraLight Adding FZK Karlsruhe TWaren

Top page

Probes: Ping/traceroute Ping still useful Is path connected/node reachable? RTT, jitter, loss Great for low performance links (e.g. Digital Divide), e.g. AMP (NLANR)/PingER (SLAC) Nothing to install, but blocking OWAMP/I2 similar but One Way But needs server installed at other end and good timers Now built into IEPM-BW Traceroute Needs good visualization (traceanal/SLAC) No use for dedicated λ layer 1 or 2 However still want to know topology of paths

Probes: Packet Pair Dispersion Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links Used by pathload, pathchirp, ABwE available bw Send packets with known separation See how separation changes due to bottleneck Can be low network intrusive, e.g. ABwE only 20 packets/direction, also fast < 1 sec From PAM paper, pathchirp more accurate than ABwE, but Ten times as long (10s vs 1s) More network traffic (~factor of 10) Pathload factor of 10 again more http://www.pam2005.org/PDF/34310310.pdf IEPM-BW now supports ABwE, Pathchirp, Pathload

BUT… Packet pair dispersion relies on accurate timing of inter packet separation At > 1Gbps this is getting beyond resolution of Unix clocks AND 10GE NICs are offloading function Coalescing interrupts, Large Send & Receive Offload, TOE Need to work with TOE vendors Turn off offload (Neterion supports multiple channels, can eliminate offload to get more accurate timing in host) Do timing in NICs No standards for interfaces Possibly use packet trains, e.g. pathneck

Achievable Throughput Use TCP or UDP to send as much data as can memory to memory from source to destination Tools: iperf (bwctl/I2), netperf, thrulay (from Stas Shalunov/I2), udpmon … Pseudo file copy: Bbcp also has memory to memory mode to avoid disk/file problems

BUT… At 10Gbits/s on transatlantic path Slow start takes over 6 seconds To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance) Needs scheduling to scale, even then … It’s not disk-to-disk or application-to application So use bbcp, bbftp, or GridFTP

AND … For testbeds such as UltraLight, UltraScienceNet etc. have to reserve the path So the measurement infrastructure needs to add capability to reserve the path (so need API to reservation application) OSCARS from ESnet developing a web services interface (http://www.es.net/oscars/): For lightweight have a “persistent” capability For more intrusive, must reserve just before make measurement

Visualization & Forecasting in Real World

Examples of real data Caltech: thrulay Misconfigured windows New path Very noisy 800 Mbps Nov05 Mar06 UToronto: miperf Seasonal effects Daily & weekly 250 Mbps Jan06 Nov05 Pathchirp UTDallas Some are seasonal Others are not Events may affect multiple-metrics 120 thrulay Mbps Mar-10-06 iperf Mar-20-06 Events can be caused by host or site congestion Few route changes result in bandwidth changes (~20%) Many significant events are not associated with route changes (~50%)

Scattter plots & histograms Scatter plots: quickly identify correlations between metrics Thrulay Pathchirp Iperf Thrulay (Mbps) RTT (ms) Pathchirp & iperf (Mbps) Throughput (Mbits/s) Pathchirp Thrulay Histograms: quickly identify variability or multimodality

Esnet-LosNettos segment in the path Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperf and AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

On the other hand Route changes may affect the RTT (in yellow) Yet have no noticeable effect on on available bandwidth or throughput Available Bandwidth Achievable Throughput Route changes

However… Elegant graphics are great to understand problems BUT: Can be thousands of graphs to look at (many site pairs, many devices, many metrics) Need automated problem recognition AND diagnosis So developing tools to reliably detect significant, persistent changes in performance Initially using simple plateau algorithm to detect step changes

Seasonal Effects on events Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) Causes more anomalous events around this time

Forecasting Over-provisioned paths should have pretty flat time series Short/local term smoothing Long term linear trends Seasonal smoothing But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths Use Holt-Winters triple exponential weighted moving averages Predicting how long a file transfer will take, requires forecasting network and application performance. However, such forecasting is beset with problems. These include seasonal (e.g. diurnal) variations in the measurements, the increasing difficulty of making accurate active low network intrusiveness measurements especially on high speed (>1 Gbits/s) networks and with Network Interface Card (NIC) offloading, the intrusivenss of making more realistic active measurements on the network, the differences in network and large file transfer performance, and the difficulty of getting sufficient relevant passive measurements to enable forecasting. We will discuss each of these problems, compare and contrast the effectiveness of various solutions, look at how some of the methods may be combined, and identify practical ways to move forward.

Experimental Alerting Have false positives down to reasonable level (few per week), so sending alerts to developers Saved in database Links to traceroutes, event analysis, time-series

Passive Active monitoring What about Passive? Pro: regularly spaced data on known paths, can make on-demand Con: adds data to network, can interfere with real data and measurements What about Passive? Need replacement for active packet pair and throughput measurements. Evaluating whether we can use Netflow for this.

Netflow et. al. Switch identifies flow by sce/dst ports, protocol Cuts record for each flow: src, dst, ports, protocol, TOS, start, end time Collect records and analyze Can be a lot of data to collect each day, needs lot cpu Hundreds of MBytes to GBytes No intrusive traffic, real: traffic, collaborators, applications No accounts/pwds/certs/keys No reservations etc Characterize traffic: top talkers, applications, flow lengths etc. LHC-OPN requires edge routers to provide Netflow data Internet 2 backbone http://netflow.internet2.edu/weekly/ SLAC: www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html NetraMet, SCAMPI are a couple of non-commercial flow projects. IPFIX is an IETF standardization effort for Netflow type passive monitoring.

Typical day’s flows Divide by remote site, aggregate parallel streams Very much work in progress Look at SLAC border Typical day: ~ 28K flows/day ~ 75 sites with > 100KB bulk-data flows Few hundred flows > GByte Collect records for several weeks Filter 40 major collaborator sites, big (> 100KBytes) flows, bulk transport apps/ports (bbcp, bbftp, iperf, thrulay, scp, ftp …) Divide by remote site, aggregate parallel streams Look at throughput distribution

Netflow et. al. Peaks at known capacities and RTTs RTTs might suggest windows not optimized, peaks at default OS window size(BW=Window/RTT)

How many sites have enough flows? In May ’05 found 15 sites at SLAC border with > 1440 (1/30 mins) flows Maybe Enough for time series forecasting for seasonal effects Three sites (Caltech, BNL, CERN) were actively monitored Rest were “free” Only 10% sites have big seasonal effects in active measurement Remainder need fewer flows So promising

Mining data for sites Real application use (bbftp) for 4 months Gives rough idea of throughput (and confidence) for 14 sites seen from SLAC

Multi months Bbcp SLAC to Padova Bbcp throughput from SLAC to Padova Bbcp SLAC to Padova Fairly stable with time, large variance Many non network related factors

Netflow limitations Use of dynamic ports makes harder to detect app. GridFTP, bbcp, bbftp can use fixed ports (but may not) P2P often uses dynamic ports Discriminate type of flow based on headers (not relying on ports) Types: bulk data, interactive … Discriminators: inter-arrival time, length of flow, packet length, volume of flow Use machine learning/neural nets to cluster flows E.g. http://www.pam2004.org/papers/166.pdf Aggregation of parallel flows (needs care, but not difficult) Can use for giving performance forecast Unclear if can use for detecting steps in performance For FTP port 21 only used as a control channel. SCAMPI/FFPF/MAPI allows more flexible flow definition than Netflow See www.ist-scampi.org/

Conclusions Some tools fail at higher speeds Throughputs often depend on non-network factors: Host: interface speeds (DSL, 10Mbps Enet, wireless), loads, resource congestion Configurations (window sizes, hosts, number of parallel streams) Applications (disk/file vs mem-to-mem) Looking at distributions by site, often multi-modal Predictions may have large standard deviations Need automated assist to diagnose events

In Progress Working on Netflow viz (currently at BNL & SLAC) then work with other LHC sites to deploy Add support for pathneck Look at other forecasters: e.g. ARMA/ARIMA, maybe Kalman filters, neural nets Working on diagnosis of events Multi-metrics, multi-paths Signed collaborative agreement with Internet2 to collaborate with PerfSONAR Provide web services access to IEPM data Provide analysis forecasting and event detection to PerfSONAR data Use PerfSONAR (e.g. router) data for diagnosis Provide viz of PerfSONAR route information Apply to LHCnet Look at layer 1 & 2 information