1 Internet Performance Monitoring for the HENP Community Les Cottrell & Warren Matthews – SLAC Presented at the Passive & Active Measurement Workshop, University of Waikato, New Zealand April 3, 2000 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP
2 Overview Requirements PingER Validations Results Quality of Service IPv6 Monitoring Summary
3 HENP Requirements Large experiments with collaborators in over 50 countries –Hundreds or even > 1000 people on experiment Data volumes of PetaBytes or even ExaBytes (10 18 ) Distributed access: –Bulk transfer to regional centers –Fast database queries –Smooth interactive sessions ICFA created standing committee to review Inter- regional Connectivity Mainly use National Research Networks Set expectations, help troubleshoot, planning input
4 PingER Measurements from –30 monitors in 15 countries –Over 500 remote hosts –Over 70 countries –Over 2100 monitor-remote site pairs Over 50% of HENP collaborator sites are explicitly monitored as remote sites by PingER project –Atlas (37%), BaBar (68%), Belle (23%), CDF (73%), CMS (31%), D0 (60%), LEP (44%), Zeus (35%), PPDG (100%), RHIC(64%) Remainder covered by Beacons –Currently 56, extending to 76
5 Beacons & UK seen from ESnet Sites in UK track one another, so can represent with single site 2 Beacons in UK Indicates common source of congestion Increased capacity by 155 times in 5 years Effect of ACLs Direct peering between JANet and ESnet
6 PingER Deployment Jan-00
7 Validations: Ping vs. Surveyor Scatter plot Ping RTT vs Surveyor RTT gives R 2 ~
8 RIPE vs Surveyor 1/2 Little short term correlation even for time differences of < 2 secs Little structure outliers don’t match
9 RIPE vs Surveyor 2/2 Optimum agreement if displace RIPE by ~ 0.2 ms (packet size difference)
10 PingER vs AMP Little obvious short term agreement (R 2 <0.1) Same if compare ping vs. ping Avg Ping distribution agrees with AMP Both show >=95% of samples are msec R 2 > 0.95 for min & avg Time series
11 Rate Limiting 1/2 Have identified about 2% of sites probably limiting Using Sting (Stefan Savage) & SynAck (SLAC) tools to identify loss(sting or synack probes) << loss(ping) blocked 884 rounds of 10 ICMP packets each, out of 903 islamabad-server2.comsats.net.pk –blocked 554 out of 903 leonis.nus.edu.sg –blocked all non 56Byte packets All low loss with sting or synack
12 Rate Limiting 2/2 “Tail-drop” behavior Rate-limiting kicks in after the first few packets and hence later packets are more likely to be dropped Calculate slope and histogram slope frequency for all nodes, look at outliers (8) Added as PingER metric, Still validating, some sites consistent others vary from month to month
13 Results: How are the U.S. Nets doing? In general performance is good (i.e. <= 1%). Edu (vBNS/Abilene) is catching up with ESnet XIWT (70%.com) 3-5 times worse than ESnet or I2
14 Europe seen from U.S. 650ms 200 ms 7% loss 10% loss 1% loss Monitor site Beacon site (~10% sites) HENP country Not HENP Not HENP & not monitored
15 Asia seen from U.S. 3.6% loss 10% loss 0.1% loss 640 ms 450 ms 250ms
16 Latin America, Africa & Australasia 4% Loss 2% Loss 350 ms 700ms 170 ms 220 ms
17 Quality of Service: How to improve More bandwidth –Keep network load low (< 30%) –Costs (at least in the W) are coming down dramatically, but non-trivial to keep up Reserved/managed bandwidth generally on ATM via PVCs today Differentiated services
18 Effect of more & managed bandwidth German Universities as good as DESY after Oct-99 upgrade DFN closes Perryman POP loses direct ESnet peering Peering re-established via 60 Hudson RTT Loss
19 RTT from ESnet to Groups of Sites ITU G ms RTT limit for voice
20 Loss seen from ESnet to groups of Sites ITU limit for loss
21 Bulk transfer - Performance Trends Bandwidth TCP < 1460/(RTT * sqrt(loss)) Note: E. Europe not catching up ESnet Flattening out
22 Interactive apps - Jitter IPDD(i) = RTT(i) - RTT(i-1)
23 Interactive apps - Jitter IPDD(i) = RTT(i) - RTT(i-1)
24 SLAC-CERN Jitter ITU/TIPHON delay jitter threshold (75 ms)
25 Voice over IP: Reachability Within N. America, & W. Europe loss, RTT and jitter is acceptable for VoIP But what about reachability
26 Availability – Outage Probabaility Surveyor probes randomly 2/second Measure time (Outage length) consecutive probes don’t get through
27 Error free seconds Typical US phone company objectives are % What do we see for the Internet using Surveyor measurements
28 Small amount of bandwidth carved off ESnet connection to provide native IPv6 service to SLAC 6REN RTR-IPv6 IPv6 Monitoring Production IPv6 allocation 2001:400:0808::/48 Addresses are in DNS PingER6 Scylla Charybdis Switch IPv6 VLAN VLAN allows deployment throughout SLAC SLAC
29 Porting PingER to PingER6 Recompiled Linux (Red Hat 6.0) kernel with IPv6 support Downloaded & installed inet- apps (including ping) from inner.net and patch for glibc- 2.1 systems Wrote Perl module to provide IPv6 DNS lookup Got remote IPv6 sites to monitor –10 countries, 40 sites Currently one monitoring site at SLAC –6TAP to start soon –China? Remote Sites
30 How does it look? % loss The weekend RTT RTT Between SLAC and Purdue in Nov/Dec 1999 IPv6 IPv4 Nov/Dec 1999 Much of current 6BONE is congested
31 Summary Long term agreement between AMP, PingER, Surveyor, & RIPE –need persistent structure (e.g. congestion or route changes) for short term point by point agreement Rate limiting still a minor effect, but could become a problem, trying to get good signature International performance from US to sites outside W. Europe, JP, KR, SG, TW is generally poor to bad Managed bandwidth can be big help. ESnet & Internet 2 doing well, even for VoIP, except reachability has a way to go PingER ported to IPv6, 6BONE congested
32 More Information This talk: – IEPM/PingER home site –www-iepm.slac.stanford.edu/www-iepm.slac.stanford.edu/ Comparison of Surveyor & RIPE & PingER – – /comp/net/wan-mon/surveyor-vs-pinger.htmlwww.slac.stanford.edu /comp/net/wan-mon/surveyor-vs-pinger.html Detecting ICMP Rate Limiting – IPv6 Monitoring –