WAN Monitoring Issues Prepared by Les Cottrell, SLAC, for the

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Advertisements

IS Network and Telecommunications Risks
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
1 PingER: Methodology, Uses & Results Les Cottrell SLAC, Warren Matthews GATech Extending the Reach of Advanced Networking: Special International Workshop.
1 ICFA/SCIC Network Monitoring Prepared by Les Cottrell, SLAC, for ICFA
Ch. 28 Q and A IS 333 Spring Q1 Q: What is network latency? 1.Changes in delay and duration of the changes 2.time required to transfer data across.
Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.
Chapter 14 Managerial issues in networking. Overview Network design Network management – Hardware – Software Technology standards Role of government and.
Hands-on Networking Fundamentals
1 Monitoring Internet connectivity of Research and Educational Institutions Les Cottrell – SLAC/Stanford University Prepared for the workshop on “Developing.
PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.
1 Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (
LAN and WAN Monitoring at SLAC Connie Logg September 21, 2005.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
 End to End Performance Initiative Russ Hobby HENP Networking Working Group Meeting, Ann Arbor, Michigan 26 October 2001.
5 October 2001  End to End Performance Initiative Performance Measurement.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
1 The PingER Project: Measuring the Digital Divide PingER Presented by Les Cottrell, SLAC At the SIS Show Palexpo/Geneva December 2003.
Cisco 3 - Switch Perrine. J Page 111/6/2015 Chapter 5 At which layer of the 3-layer design component would users with common interests be grouped? 1.Access.
1 Measurements of Internet performance for NIIT, Pakistan Jan – Feb 2004 PingER From Les Cottrell, SLAC For presentation by Prof. Arshad Ali, NIIT.
1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.
1 Quantifying the Digital Divide: focus Africa Prepared by Les Cottrell, SLAC for the NSF IRNC meeting, March 11,
1 Root-Cause VoIP Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
1 Evaluating NGI performance Matt Mathis
Network Management CCNA 4 Chapter 7. Monitoring the Network Connection monitoring takes place every day when users log on Ping only shows that the connection.
Some thoughts on E2EPI Shawn McKee Pipefitters Meeting, Internet2 Spring Meeting 8 April, 2003.
Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.
TCP Traffic Characteristics—Deep buffer Switch
1 PingER performance to Bangladesh Prepared by Les Cottrell, SLAC for Prof. Hilda Cerdeira May 27, 2004 Partially funded by DOE/MICS Field Work Proposal.
Networking and the Grid Ahmed Abdelrahim NeSC NeSC PPARC e-Science Summer School 10 th May 2005.
1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,
1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.
1 Quantifying the Digital Divide Prepared by Les Cottrell, SLAC for the Internet2/World Bank meeting, Feb 7,
COMP1321 Digital Infrastructure Richard Henson March 2016.
1 Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO PathSolutions.
Chapter 3 Part 1 Switching and Bridging
Analysis and Comparison of TCP Reno and TCP Vegas Review
Chapter 3 Part 3 Switching and Bridging
Troubleshooting a Network
Instructor Materials Chapter 9: Testing and Troubleshooting
Scaling the Network: The Internet Protocol
Planning and Troubleshooting Routing and Switching
Outline Basics of network security Definitions Sample attacks
Networking for the Future of Science
Chapter 4 Data Link Layer Switching
Outline Introduction Characteristics of intrusion detection systems
Introduction to Networks
Deployment & Advanced Regular Testing Strategies
End to end Internet Performance today
End to end Internet Performance today
Chapter 3 Part 3 Switching and Bridging
Module 5 - Switches CCNA 3 version 3.0.
Using Netflow data for forecasting
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
The PingER Project: Measuring the Digital Divide
Wide Area Networking at SLAC, Feb ‘03
TCP/IP Protocol Suite: Review
E2E piPES Project Russ Hobby, Internet2 HENP Working Group Meeting
Chapter 3 VLANs Chaffee County Academy
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Internet2 Spring Member Meeting
Chapter 3 Part 3 Switching and Bridging
Internet2 Spring Member Meeting
Wide-Area Networking at SLAC
Quantifying the Global Digital Divide
The PingER Project: Measuring the Digital Divide
Outline Basics of network security Definitions Sample attacks
Presentation transcript:

WAN Monitoring Issues Prepared by Les Cottrell, SLAC, for the NASA/LSN Workshop on Optical Network Testbeds NASA Ames August 9-11, 2004 www.slac.stanford.edu/grp/scs/net/talk03/jet-aug04.ppt Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

The Problem Distributed systems are very hard A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson Network is deliberately transparent The bottlenecks can be in any of the following components: the applications the OS the disks, NICs, bus, memory, etc. on sender or receiver the network switches and routers, and so on Problems may not be logical Most problems are operator errors, configurations, bugs When building distributed systems, we often observe unexpectedly low performance the reasons for which are usually not obvious Just when you think you’ve cracked it, in steps security Network is by design transparent so hard to find out information about how it is working etc. GGF Grid High Performance Network group is trying to bring together networkers/applications writers/users by creating documents on “Top ten things network engineers wish grid programmers knew” and vice versa. http://www.csm.ornl.gov/ghpn/ Understanding is hard: Immense, moving target, traditional (e.g. Poisson distributions) mathematical tools don’t work, looking for invariants, need parsimonious models. See Vern Paxson’s work, e.g. http://www.icir.org/vern/talks/vp-painfully-hard.UCB-mig.99.ps.gz The top three networking problems according to a paper by Claudia DeLuna of JPL, are Ethernet duplex, host configuration and bad media. Failure cause breakdown for 3 Internet sites indicated 51% caused by operator error. “Self Repairing Computers”, Scientific American, June 2003 Reviewing user reported long lasting (typically days, i.e. does not include router reboots, or time out for reconfiguration) WAN problems that SLAC over the last two years, the biggest contributors (30%) were a combination of mis-configured routers (loose unicast RPF filters, wrong buffer size, poorly chosen backup route), misconfigured switches (needed reboot, PVC incorrectly rate limited), firewalls (limit throughput, reset window scaling option). Note these are mainly engineering problems or bugs as opposed to problems we need to research to know how to fix each one individually. However, we do need to investigate how to accurately and automatically identify and report on the location and cause of such problems for the end-user.

E2E Monitoring Goals Solving the E2E performance problem is the critical problem for the user Improve e2e throughput for data intensive apps in high-speed WANs Provide ability to do performance analysis & fault detection ins Grid computing environment Provide accurate, detailed, & adaptive monitoring of all distributed components including the network

Hey, this is not working right! Anatomy of a Problem Hey, this is not working right! Others are getting in ok Not our problem Applications Developer Applications Developer LAN Administrator LAN Administrator Talk to the other guys System Administrator Everything is AOK System Administrator Campus Networking Campus Networking The computer Is working OK No other complaints Looks fine May not be realistic to try and solve this as shown. Need to divide and conquer, e.g. ID the AS responsible for where the problem is and report to them with relevant supporting information. To do this need tools to partition problem. Gigapop Gigapop All the lights are green How do you solve a problem along a path? Backbone We don’t see anything wrong From an Internet2 E2E presentation by Russ Hobby The network is lightly loaded

Needs Measurement tools to quickly, accurately and automatically identify problems Automatically take action to investigate and gather information, on-demand measurements Tools need to scale to 10Gbps and beyond Standard ways to discover request and report results of measurements GGF/NMWG schemas Share information with people and apps across a federation of measurement infrastructures Web services are in their early days, they have a steep learning curve, the schemas and not mature

Achieving throughput User can’t achieve throughput available (Wizard gap) Big step just to know what is achievable Spreadsheet \cottrell\iepm\wizard.xls Most users are unaware of the bottleneck bandwidth on the path

User throughput C. Asia, Russia, S.E. Europe, L. America, M. East, China: 4-5 yrs behind India, Africa: 7 yrs behind S.E. Europe, Russia: catching up Latin Am., Mid East, China: keeping up India, Africa: falling behind Important for policy makers Spreadsheet \cottrell\iepm\esnet-to-all-longterm.xls CERN data only goes back to Aug-01. It confirms S.E. Europe & Russia are catching up, and India & Africa are falling behind PingER is arguably the most extensive set of measurements of the end-to-end performance of the Internet going back almost ten years. Measurements are available from over 30 sites in 13 countries to sites in over 100 countries. We will use the PingER results to: demonstrate how the Internet performance to the regions of the world has evolved over the last 9 years; identify regions that have poor connectivity, how far they are behind the developed world and whether they are catching up or falling further behind; and illustrate the correlation between the UN Technology Achievement Index and Internet performance. Ghana, Nigeria and Uganda are all satellite links with 800-1100ms RTTs. The losses to Ghana & Nigeria are 8-12% while to Uganda they are 1-3%. The routes are different. The route from SLAC to Ghana uses ESnet-Worldcom-UUNET, Nigeria goes CalREN-Qwest-Teiianet-New Skies satellite, Uganda goes Esnet-Level3-Intelsat. For both Ghana and Nigeria there are no losses (for 100 pings) until the last hop when over 40 of 100 packets were lost. For Uganda the losses (3 in 100 packets) also occur at the last hop. Worksheet: for trends: \\Zwinsan2\c\cottrell\iepm\esnet-to-all-longterm.xls for Africa: \\Zwinsan2\c\cottrell\iepm\africa.xls

Hi-perf Challenges Packet loss hard to measure by ping For 10% accuracy on BER 1/10^8 ~ 1 day at 1/sec Ping loss ≠ TCP loss Iperf/GridFTP throughput at 10Gbits/s To measure stable (congestion avoidance) state for 90% of test takes ~ 60 secs ~ 75GBytes Requires scheduling implies authentication etc. Using packet pair dispersion can use only few tens or hundreds of packets, however: Timing granularity in host is hard (sub μsec) NICs may buffer (e.g. coalesce interrupts. or TCP offload) so need info from NIC or before Security: blocked ports, firewalls, keys vs. one time passwords, varying policies, Kerberos vs ssh etc. Slow start on 200ms RTT takes about 8secs on 10Gbps, on 1Gbps takes ~ 6 secs BER of 1/10^8 is not that high. For example the “SURA Optical Network Cookbook” (see http://www1.sura.org/3000/opcook.pdf) suggests that a BER of 1/10^9 is typical.

Passive measurements Security & privacy concerns SNMP access to routers Sniffers see all traffic Keeping up with capturing and analysis Only headers, sampling Vast amounts of data, needs excellent data-mining tools Gives utilization, retries Anonymization to address privacy concerns can remove much of usefulness of data. Sampling can introduce biases

Optical Could be whole new playing field, today’s tools no longer applicable: No jitter (so packet pair dispersion no use) Instrumented TCP stacks a la Web100 may not be relevant Layer 1 & 2 switches make traceroute less useful Losses so low, ping not viable to measure High speeds make some current techniques fail or more difficult (timing, amounts of data etc.)