Presentation is loading. Please wait.

Presentation is loading. Please wait.

Internet Monitoring - Results

Similar presentations


Presentation on theme: "Internet Monitoring - Results"— Presentation transcript:

1 Internet Monitoring - Results
Les Cottrell SLAC Presented at the ICFA Meeting, CERN, Mar 1998 Partially funded by MICS joint SLAC/LBL proposal on Internet End-to-end Performance Monitoring (IEPM) 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

2 Outline of Talk What, why & how are we (ESnet/HENP community) measuring? What PingER measurement reports are available and what do they show (short), intermediate & long term grouping and multi-site visualization Traffic volume & Traceroute measurements Summary Deployment/development, Internet Performance, Next Steps Collaborations NIMI/IPWT Won’t talk about actual tools, only briefly cover the method (they were covered by Dave Martin’s presentation),, also will mainly dwell on long term trend reports and how we use the results of the tools to better understand the Internet.. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

3 Why go to the effort? Apparent quality of Internet getting worse as size and demands increase Internet woefully under-measured & under-instrumented Internet very diverse - no single path typical Users need: realistic expectations, planning information guidelines for setting and validating SLAs information to help in identifying problems help to decide where to apply resources Demands are driven by: increase in number of users, increase in power available at desktop and in servers, newer applications (more graphics based, video, voice etc.), need for better QoS. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

4 Importance of Response Time
Time is scarcest and most valuable commodity Studies in late 70’s and early 80s showed the economic value of Rapid Response Time 0-0.4s High productivity interactive response 0.4-2s Fully interactive regime 2-12s Sporadically interactive regime 12s-600s Break in contact regime >600s Batch regime Threshold around 4-5s complaints increase rapidly. Voice has threshold around 100ms Note that the TCP/IP timeout caused by a packet loss is of the order of 4-5 seconds. For some newer Internet applications there are other thresholds, for example for voice a threshold appears at about 100ms - above that point, the delay causes difficulty for people trying to have a conversation and frustration grows. Also see: 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

5 Perception of Poor Packet Loss
Above 4-6% packet loss video conferencing becomes irritating, and non native language speakers become unable to communicate. The occurrence of long delays of 4 seconds or more at a frequency of 4-5% or more is also irritating for interactive activities such as telnet and X windows. Above 10-12% packet loss there is an unacceptable level of back to back loss of packets and extremely long timeouts, connections start to get broken, and video conferencing is unusable. The scarcest and most valuable commodity is time. Studies in late 70’s and early 80s by Walt Doherty of IBM and others showed the economic value of Rapid Response Time: 0-0.4s High productivity interactive response 0.4-2s Fully interactive regime 2-12s Sporadically interactive regime 12s-600s Break in contact regime >600s Batch regime There is a threshold around 4-5s where complaints increase rapidly. For some newer Internet applications there are other thresholds, for example for voice a threshold appears at about 100ms - above that point, the delay causes difficulty for people trying to have a conversation and frustration grows. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

6 Our Main Metric is Ping “Universally available”, easy to understand
no software for clients to install Low network impact Provides useful real world measures of loss, response time, reachability, unpredictability Avoid routers, they drop pings to the router if busy. Prefer lightly loaded or consistently loaded hosts, e.g. name server, mail gateway. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

7 Ping Response vs Web Response 1/2
HTTP GET Response (ms) Minimum Ping Response (ms) R**2 ~ 0.6 i.e. 60% of the GET response can be explained by the ping response. More importantly there is a a lower limit around y=2*x. This is related to the GET taking 2 round trips (SYN/ACK & GET/response) versusPing taking a single round trip. The lower limit shows that given the minimum ping response one can get an idea of the best possible Web response. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

8 Ping Response vs Web Response 2/2
Interquartile distance is ~ 250 ms (green lines show quartiles) FWHM ~ 120 ms 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

9 Ranked packet loss for 3 months
Stanford Rome UK Note X axis scale changes, shows ESnet sites much better for SLAC Note big variations from month to month. The poor U of Cincinnatti performance was a cause for concern since SLAC has a strong collaboration with them. Worked with various people to try and improve. The appearance of SLAC as the worst in November is an anomaly caused by problems on one particular day between the monitoring host and the host being monitored. The difference in the U of Colorado to SLAC and the Colorado State links is quite evident. U Colorado has a vBNS connection with good peering to ESnet, but Colorado State does not. Poor performance between SLAC and the Stanford U Medical center (separated by 2 miles) was due to poor connectivity at MAE-West. This has now been bypassed with a uwave link. Bad connectivity to Rome not reflected by other Italian sites. UK sites (RL & Glasgow) have worse performance than between SLAC and Beijing or Slac & Novosibirsk. Cincinnatti 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

10 Sawtooth Effect 2 * capacity (+ 2Mbps)
Added 45 Mbps (quadrupled capacity) 3 * capacity + 9 Mbps Adding extra capacity to UK - US link in April 96, Feb 97 and Jul 98 improves (reduces) packet loss, but the slack is soon taken up. There is also a distinct dip around the New Year holiday season The TEN-34 link improved access to Europe and mirror sites, also at this time the ANS/Sprint link balancing was improved. Recently added a second UK site (University of Glasgow) to ensure that the effects are not unique to RL (pair-wise comparison) Holidays 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

11 RAL Last 180 Days plot Lines are simply cubic splines fits to aid eye
Upper green and black points are response time in ms Red & blue are weekday loss Cyan are weekend loss Note weekend/weekday differences (cyan vs blue) Note Xmas/New Year lull Also note quick onset of saturation at end August & September 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

12 Italian sites look similar to each other
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

13 Representative International HENP Site Loss Jan-95 thru Nov-97
Note RL (UK) saw-tooths as add UK-US bandwidth (Apr-96, Feb-97, Aug-97) Indicates importance of keeping log of what happened on routes. Possibly regular pathchars, or at least traceroutes. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

14 Aggregation Group measurements, for example:
by area (e.g. N. America E, N. America E, W. Europe/Japan, others, by country) trans-oceanic links, intercontinental links separation e.g. number of hops, time zones crossed, IXPs crossed ISP (ESnet, vBNS/I2, ...) by monitoring site one site seen from multiple sites common interest/affiliation (XIWT, HENP …) user selectable 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

15 Group Selection (all sites monitoring CERN)
Select one of these groups CMU CNAF RL FNAL SLAC DESY Carelton RMKI CERN KEK Allow user to select which group of links (out of > 500) to display results for Note some collection sites ping multiple hosts at a given site. Checks for consistency 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

16 Group Response Time Jan-95 Nov-97
Improved between 1 and 2.5% / month Response & Loss similar improvements care with new sites Prime time 7am - 7pm weekday seen from SLAC. Increase in international response caused by addition of IHEP, Novosibirsk, FZU (in CZ). If we remove these additions we get just under 1% improvement/month (i.e. pretty much like the others). This points out the need to examine results for biases. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

17 Network Quiescence Frequency of zero packet loss (for all time - not cut on prime time) Eg a network busy 8 work hrs/day per week and quiescent otherwise would have % ~ 75% ~ (total hsr/wk - 5 wkdays/wk * 8hrs/day) / total hrs/wk Clear that connectivity between SLAC and ESnet sites is best. A bit similar to the phone companies idea of error free seconds, except is is a frequency rather than a number 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

18 Ping Loss Quality Want quick to grasp indicator of link quality
Loss is the most sensitive indicator loss of packet requires ~ 4 sec TCP retry timeout Studies on economic value of response time by IBM showed there is a threshold around 4-5secs where complaints increase. 0-1% = Good % = Acceptable 2.5%-5% = Poor 5%-12% = Very Poor > 12% = Bad 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

19 Quality Distributions
ESnet median good quality All other groups poor or very poor Critical to have good peering Poor performance of non Esnet sites (seen from SLAC) due to poor performance as traverse interchanges between ESnet & rest of Internet 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

20 Multi Collection Site Visualization
Collection Sites Remote Sites Median ping loss on link Remote sites ordered by number of collection sites that Can select: grouping by site,by TLD, by continent which metric to display (loss, response, quiescence, unpredictability, unreachability) which month’s data to view Placing mouse over the ? in each box provides number of links included in data 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

21 Intercontinental Grouping (Loss)
Move mouse over ? to see # links Looks pretty bad for intercontinental use 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

22 Top Level Domain Grouping (Loss)
Mouseover red dots gives more information on TLD (e.g. ch=Switzerland) Diagonals are within TLD 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

23 TLD (Response Time) 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

24 Grouping Details Also provides Excel for DIY at bottom Select metric
Select group Sort Color for quality Also provides Excel for DIY at bottom 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

25 Recent Transoceanic trends
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

26 By Monitoring Site 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

27 CERN Monitoring TLDs 3/4/98
\\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

28 ESnet bytes accepted by site for Jan ‘98
Exchanges LBL/ESnet After eliminate exchanges and LBL (5Mbps averaged over month) and ESnet the top 10 are: LANL, LLNL, BNL,FNAL, DOE1, ORNL, CEBAF, SLAC, PNL, GA 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

29 US HENP Traffic Growth Exponential growth from 3-6% 3/4/98
Need SNMP access to router, i.e administrative rights, so aither do for site’s external router links or is done by ISP (in this case ESnet) In some controlled cases e.g. CERN transatlantic link can look at traffic carried note relation between it and bandwidth available at bottleneck when congestion appears LBL 6% growth, BNL 5.4% growth, SLAC 4.8% growth, ANL 4.1% growth, FNAL 3.2% growth, CEBAF 3.1% growth ANL traffic probably going directly into ATM cloud and so not being measured at router. Note CEBAF growth as turns on 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

30 Multi Router Traffic Grapher (MRTG)
CERN-US E1(2Mbps) link Added 2nd 2Mbps link Can see weekend vs weekday utilization differences This link is heavily used The other E1 links is more lightly used. They will balance better when the two US ends are colocated. Monthly peak/average for CERN for cgate1 (is about 3 to 4), for SLAC is about 16). Peak/average ratio may be useful for indicating link congestion Useful to compare peaks with capacity available Can compare monthly average with ESnet monthly Octets accepted from sites Need summary of how often link is close to full utilization. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

31 Traffic Volume for Germany (DFN)
DFN T1 Utilization 15 Jan ‘98 (5 min averages) Green = to US Blue = from US DFN T1 Utilization for 15 Jan ‘98 (5 min averages) # of 2 min periods in Dec-96 with peak utilization > y % Upper graph from MRTG, shows line reasonably loaded Lower graph gives an idea of how often the line is at 100%l utilization etc. Number of 2 minute intervals in a month when the link was observed to be busier than x% for a 2 minute period Area under curve is an indicator of how busy/saturated the link is. Note traffic imbalance typically factor of 2-3x going out as coming in, also the loads are not typically during U.S. prime hours (they are sucking us dry) From US # Samples 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt To US

32 Capacity/Load Ratios Looking at the link capacity/average load
Most ESnet links show ratios of a few to several tens The international links (CERN-Perryman (~4), DFN (~5), Italy (~4), KEK (~10), Canada (15)) show ratios of 4-15 The worst link appears to be the MAE-W-ESnet link at about 1.5 ratio However this may not be the bottleneck link For shared networks without some form of quality of service guarantees, one way to ensure excellent performance is to over-provision the links so they have sufficient capacity to handle instantaneous loads. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

33 Bottlenecks Identification Then need to work on: Traceroute
from/to multiple sites can identify common path segments in the maps Can see onset of losses with traceping Pathchar can identify bottlenecks Then need to work on: avoiding bottlenecks (new peering) getting bottleneck owners to improve this is difficult, lots of potential bottlenecks, bottlenecks move, not under our control 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

34 TracePing (Oxford) Muliple routes seen 3/4/98
traceroute to remote sites each hour, then pings along route Archives data Can see route changes short & long term and onset of problems in time & space Written by John MacAllister of Oxford U Being converted from VMS/DCL to Windows/Perl Muliple routes seen 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

35 Traceroute Reverse traceroute servers Traceping TopologyMap
Ellipses show node on route Open ellipse is measurement node Blue ellipse no reachable Keeping history From TRIUMF 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

36 GUI Traceroute (e.g. VisualRoute)
3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

37 Summary Deployment Development
ESnet/HENP has 14 Collection sites in 8 countries collecting data on > 500 links involving 22 countries XIWT/IPWT deployed ~ 10 collection sites using PingER tools 600MB/month/link, 6 bps/link, .25 analysis site, FTE on analysis HEPNRC gathering, archiving Long term reports being ported to HEPNRC from SLAC Long term analysis today usually requires tool like SAS XIWT/IPWT want to: Measure performance of members' own networks Get tests to validate and understand what to recommend to other commercial customers and for what purposes. Build a community within XIWT so can evolve it to address harder issues. Have chosen the PingER tools for deployment Collection sites (mar-98): West Group, Bell South (2), Digital (2), HP, Intel, Hughes, NIST, SBC They are looking for an analysis/archive site SAS/Oracle can cost several tens of thousands of dollars Need indexing for rapid lookup Usually a bit of overkill as an analysis tool (don’t use much in the way of sophisticated statistics). 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

38 Summary Deployment Development Internet Performance
Performance within ESnet is good Performance between ESnet & other sites is poor to very poor on average one of main causes is congestion points, so peering is critical Intercontinental performance is very poor to bad ESnet traffic accepted from major HENP labs growing by 3-6% per month Response time improving by 1-2% / month Packet loss improving between SLAC & other sites by 3% / month 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

39 Summary Deployment Development Internet Performance (continued):
Links to sites outside N. America vary from good (KEK) to bad Some of the bad sites are to be expected, e.g. FSU, China, Czeck Republic, some surprises such as UK CERN, France, Germany acceptable to poor Provide monthly summary tables with lots of statistical measures to allow faster generation of long term reports, and more robust metrics Extend grouping, e.g. by AS, country, time zones crossed, more geographic regions, user selectable, by experiment, by community, by collection site Summaries (c.f. Weather Map, top 10s, weekly, Consumer Reports) 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

40 Summary Next Steps Deployment Development Internet Performance
Improve tools: Make long term reports at Analysis site available & understandable Look into prediction (extrapolations, develop models, configure and validate with data) Pursue IETF Surveyor & NIMI deployment Need consistent measurements of loading of link, e.g. MRTG both at end sites external routers, also ATM switches, and internal (e.g. ESnet) routers. Need to know capacity of link being monitored. Is anyone interested in passive measurements, i.e. measuring the performance of real traffic, need to correlate with other measurements. Need thruput measures and correlations to simpler measures such as ping response or packet loss, e.g. via NIMI. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

41 National Internet Measurement Infrastructure (NIMI)
Secure, scalable infrastructure for scheduling monitoring, gathering data Minimal amount of human intervention Inexpensive probe built on PC FreeBSD platform Dynamic - can add/modify measurement suites, initially includes: Traceroute TReno - measures bulk transfer thruput Poip - one way ping Based on Vern Paxson’s NPD work - it ran at 30 sites and 1994/1995 Security uses public key pairs for authentication, and encryption By design decentralized control, simple configuration and maintenance FNAL claim it took a couple of hours to install the software after that folks from PSC administer by remote control Hardware cheap ($2-3K) for 200MHz, 64MB, 4GB, Enet. Modem + optional GPS Standard dedicated platform, reduces concerns of biases caused by server loading PSC, LBNL, FNAL in place, SLAC’s being configured, working with CERN (CH), RAL (UK), KEK (JP) Want to get NIMIs placed at strategic network points to get a better idea of overall network performance 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

42 Asymmetric One-way Delays
20% Advanced to U Chicago U Chicago to Advanced Loss Loss 0% 300ms Delay Delay Nb one way response time very important for voice, need to be better than 100 msec or people start stepping into one another’s conversations. PC Hardware with GPS located at ANS & 23 CSG partner sites Measure one way loss & response time using clock synchronization, metrics defined by IETF/IPPM 8 sites now operational, monitor 56 paths ((N-1)*N) Results show can have big asymmetries (asymmetric loading & routing) Willing to deploy (at their cost) at 5 DOE sites For more see 0ms 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt 24

43 NIMI Deployed at PSC, LBL, FNAL, platforms being configured at SLAC & CERN As NIMI becomes more real will start to use as infrastructure for IPPM Surveyors Security allows full policy control over any box you own or delegation of all or subsets uses ACLs with authentication for requests, and encryption to prevent sniffing Host id is accomplished through use of public key/private key technology. Authentication and encryption uses RSA reference library Looking at additional security options to better support its use outside the U.S. Can provide 2 distributions one with full security, one with none, looking at possible support for 40 bit keys (crackable in 2 hours on PC, but session probably over by then, and use new key) or in early deployment simply turn off encryption. 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

44 Summary Lots of collaboration: SLAC & HEPNRC
Deployment Development Internet Performance Next Steps Lots of collaboration: SLAC & HEPNRC 14 collection sites, ~ 400 remote sites Collection site tools CERN & CNAF/ICFA Oxford/TracePing MapPing/MAPNet/NLANR TRIUMF Traceroute topology Map NIMI/LBNL & Surveyor/IETF XIWT/IPWT Talks at IETF, XIWT, ICFA, ESCC ... 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt

45 More Information ICFA Monitoring WG home page (links to status report, meeting notes, how to access data, and code) WAN Monitoring at SLAC has lots of links Tutorial on WAN Monitoring MapPing Tool: NIMI 3/4/98 \\pcbackup\users\cottrell\icfa\icfa-mar98.ppt


Download ppt "Internet Monitoring - Results"

Similar presentations


Ads by Google