1 Internet Monitoring & Tools Les Cottrell – SLAC Presented at the HEP Networking, Grids and Digital Divide meeting Daegu, Korea May 23-27, 2005 Partially.

Slides:



Advertisements
Similar presentations
Martin Suchara, Ryan Witt, Bartek Wydrowski California Institute of Technology Pasadena, U.S.A. TCP MaxNet Implementation and Experiments on the WAN in.
Advertisements

CPSC Network Layer4-1 IP addresses: how to get one? Q: How does a host get IP address? r hard-coded by system admin in a file m Windows: control-panel->network->configuration-
QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.
University of Helwan / Egypt, Sept 18 – Oct 3, 2010
Internet Control Message Protocol (ICMP)
1 Traceanal: a tool for analyzing and representing traceroutes Les Cottrell, Connie Logg, Ruchi Gupta, Jiri Navratil SLAC, for the E2Epi BOF, Columbus.
Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.
1 Correlating Internet Performance & Route Changes to Assist in Trouble- shooting from an End-user Perspective Les Cottrell, Connie Logg, Jiri Navratil.
1 SLAC Internet Measurement Data Les Cottrell, Jerrod Williams, Connie Logg, Paola Grosso SLAC, for the ISMA Workshop, SDSC June,
1 Evaluation of Techniques to Detect Significant Performance Problems using End-to-end Active Network Measurements Les Cottrell, SLAC 2006 IEEE/IFIP Network.
Network Traffic Measurement and Modeling CSCI 780, Fall 2005.
Introduction to Management Information Systems Chapter 5 Data Communications and Internet Technology HTM 304 Fall 07.
Copyright © 2005 Department of Computer Science CPSC 641 Winter Network Traffic Measurement A focus of networking research for 20+ years Collect.
Internet Bandwidth Measurement Techniques Muhammad Ali Dec 17 th 2005.
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Bandwidth Estimation: Metrics Mesurement Techniques and Tools By Ravi Prasad, Constantinos Dovrolis, Margaret Murray and Kc Claffy IEEE Network, Nov/Dec.
Ch. 28 Q and A IS 333 Spring Q1 Q: What is network latency? 1.Changes in delay and duration of the changes 2.time required to transfer data across.
Internet Traffic Management Prafull Suryawanshi Roll No - 04IT6008.
Sven Ubik, CESNET TNC2004, Rhodos, 9 June 2004 Performance monitoring of high-speed networks from NREN perspective.
Reading Report 14 Yin Chen 14 Apr 2004 Reference: Internet Service Performance: Data Analysis and Visualization, Cross-Industry Working Team, July, 2000.
Network Layer4-1 NAT: Network Address Translation local network (e.g., home network) /24 rest of.
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
Hands-on Networking Fundamentals
PingER: Research Opportunities and Trends R. Les Cottrell, SLAC University of Malaya.
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
1 IP: putting it all together Part 2 G53ACC Chris Greenhalgh.
Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Identifying Application Impacts on Network Design Designing and Supporting Computer.
Our Last Class!!  summary  what does the future look like?
TCP/IP Essentials A Lab-Based Approach Shivendra Panwar, Shiwen Mao Jeong-dong Ryoo, and Yihan Li Chapter 5 UDP and Its Applications.
1 Internet Monitoring Les Cottrell – SLAC Presented at NUST Institute of Information Technology (NIIT) Rawalpindi, Pakistan, March 15, 2005 Partially funded.
© 2002, Cisco Systems, Inc. All rights reserved..
POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.
1 Next Few Classes Networking basics Protection & Security.
NetFlow: Digging Flows Out of the Traffic Evandro de Souza ESnet ESnet Site Coordinating Committee Meeting Columbus/OH – July/2004.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
1 Network Measurements Les Cottrell – SLAC Lecture # 4 presented at the 26 th International Nathiagali Summer College on Physics and Contemporary Needs,
Linux Networking and Security
Integration of AMP & Tracenol By: Qasim Bilal Lone.
Chapter 8: Internet Operation. Network Classes Class A: Few networks, each with many hosts All addresses begin with binary 0 Class B: Medium networks,
1 Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools Connie Logg & Les Cottrell – SLAC/Stanford University Presented at the Internet 2.
IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization Connie Logg, Joint Techs Workshop February 4-9, 2006.
11 Network Measurements Les Cottrell – SLAC Lecture # 3 presented at the Workshop on Scientific Information in the Digital Age: Access and Dissemination.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
Securing the Network Infrastructure. Firewalls Typically used to filter packets Designed to prevent malicious packets from entering the network or its.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
1 Internet End-to-end Monitoring Project - Overview Les Cottrell – SLAC/Stanford University Partially funded by DOE/MICS Field Work Proposal on Internet.
1 High Performance Network Monitoring Challenges for Grids Les Cottrell, SLAC Presented at the International Symposium on Grid Computing 2006, Taiwan
1 Passive and Active Monitoring on a High-performance Network Les Cottrell, Warren Matthews, Davide Salomoni, Connie Logg – SLAC
1 IEPM/PingER Project Les Cottrell, SLAC DoE 2004 PI Network Research Meeting, FNAL Sep ‘04
정하경 MMLAB Fundamentals of Internet Measurement: a Tutorial Nevil Brownlee, Chris Lossley, “Fundamentals of Internet Measurement: a Tutorial,” CMG journal.
1 12-Jan-16 OSI network layer CCNA Exploration Semester 1 Chapter 5.
Internet Connectivity and Performance for the HEP Community. Presented at HEPNT-HEPiX, October 6, 1999 by Warren Matthews Funded by DOE/MICS Internet End-to-end.
1 PingER performance to Bangladesh Prepared by Les Cottrell, SLAC for Prof. Hilda Cerdeira May 27, 2004 Partially funded by DOE/MICS Field Work Proposal.
Data Communication Network Models
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
1 WAN Monitoring Prepared by Les Cottrell, SLAC, for the Joint Engineering Taskforce Roadmap Workshop JLab April 13-15,
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
1 IEPM / PingER project & PPDG Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99 Partially funded by DOE/MICS Field Work Proposal on.
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
Firewalls. Overview of Firewalls As the name implies, a firewall acts to provide secured access between two networks A firewall may be implemented as.
1 Performance Network Monitoring for the LHC Grid Les Cottrell, SLAC International ICFA Workshop on Grid Activities within Large Scale International Collaborations,
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 OSI network layer CCNA Exploration Semester 1 – Chapter 5.
Using Netflow data for forecasting
Connie Logg, Joint Techs Workshop February 4-9, 2006
Wide Area Networking at SLAC, Feb ‘03
Experiences in Traceroute and Available Bandwidth Change Analysis
Experiences in Traceroute and Available Bandwidth Change Analysis
Presentation transcript:

1 Internet Monitoring & Tools Les Cottrell – SLAC Presented at the HEP Networking, Grids and Digital Divide meeting Daegu, Korea May 23-27, 2005 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

2 Overview Why is measurement difficult yet important? LAN vs WAN SNMP Effects of measurement interval Passive Active New challenges

3 Why is measurement difficult? Internet's evolution as a composition of independently developed and deployed protocols, technologies, and core applications Diversity, highly unpredictable, hard to find “invariants” Rapid evolution & change, no equilibrium so far –Findings may be out of date Measurement not high on vendors list of priorities –Resources/skill focus on more interesting an profitable issues –Tools lacking or inadequate –Implementations poor & not fully tested with new releases ISPs worried about providing access to core, making results public, & privacy issues The phone connection oriented model (Poisson distributions of session length etc.) does not work for Internet traffic (heavy tails, self similar behavior, multi-fractals etc.)

4 Add to that … Distributed systems are very hard –A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson Network is deliberately transparent The bottlenecks can be in any of the following components: –the applications –the OS –the disks, NICs, bus, memory, etc. on sender or receiver –the network switches and routers, and so on Problems may not be logical –Most problems are operator errors, configurations, bugs When building distributed systems, we often observe unexpectedly low performance the reasons for which are usually not obvious Just when you think you’ve cracked it, in steps security

5 Why is measurement important? End users & network managers need to be able to identify & track problems Choosing an ISP, setting a realistic service level agreement, and verifying it is being met Choosing routes when more than one is available Setting expectations: –Deciding which links need upgrading –Deciding where to place collaboration components such as a regional computing center, software development –How well will an application work (e.g. VoIP) Application steering (e.g. forecasting) –Grid middleware, e.g. replication manager

6 Passive vs. Active Monitoring Active injects traffic on demand Passive watches things as they happen –Network device records information Packets, bytes, errors … kept in MIBs retrieved by SNMP –Devices (e.g. probe) capture/watch packets as they pass Router, switch, sniffer, host in promiscuous (tcpdump) Complementary to one another: –Passive: does not inject extra traffic, measures real traffic Polling to gather data generates traffic, also gathers large amounts of data –Active: provides explicit control on the generation of packets for measurement scenarios testing what you want, when you need it. Injects extra artificial traffic Can do both, e.g. start active measurement and look at passively

7 Passive tools SNMP Hardware probes e.g. Sniffer, NetScout, can be stand-alone or remotely access from a central management station Software probes: snoop, tcpdump, require promiscous access to NIC card, i.e. root/sudo access Flow measurement: netramet, OCxMon/CoralReef, Netflow Sharing measurements runs into security/privacy issues

8 Example : Passive site border monitoring Use Cisco Netflow in Catalyst 6509 with MSFC, on SLAC border Gather about 200MBytes/day of flow data The raw data records include source and destination addresses and ports, the protocol, packet, octet and flow counts, and start and end times of the flows –Much less detailed than saving headers of all packets, but good compromise –Top talkers history and daily (from & to), tlds, vlans, protocol and application utilization Use for network & security

9 SLAC Traffic profile SLAC offsite links: OC3 to ESnet, 1Gbps to Stanford U & thence OC12 to I2 OC48 to NTON Profile bulk-data xfer dominates SSH FTP HTTP Mbps in Mbps out Last 6 months 2 Days bbftp iperf

10 Top talkers by protocol Hostname MBytes/day (log scale) Volume dominated by single Application - bbcp

11 Flow sizes Heavy tailed, in ~ out, UDP flows shorter than TCP, packet~bytes 75% TCP-in < 5kBytes, 75% TCP-out < 1.5kBytes (<10pkts) UDP 80% < 600Bytes (75% < 3 pkts), ~10 * more TCP than UDP Top UDP = AFS (>55%), Real(~25%), SNMP(~1.4%) SNMP Real A/V AFS file server

12 Flow lengths 60% of TCP flows less than 1 second Would expect TCP streams longer lived –But 60% of UDP flows over 10 seconds, maybe due to heavy use of AFS

13 Some Active Measurement Tools Ping connectivity, RTT & loss –flavors of ping, fping, Linux vs Solaris ping –but blocking & rate limiting Alternative synack, but can look like DoS attack Sting: measures one way loss Traceroute –Reverse traceroute servers –Traceroute archives Combining ping & traceroute, –traceping, pingroute Pathchar, pchar, pipechar, bprobe, abing etc. Iperf, netperf, ttcp, FTP …

14 Path characterization Pathchar/pchar –sends multiple packets of varying sizes to each router along route –plot min RTT vs packet size to get bandwidth –calculate differences to get individual hop characteristics –measures for each hop: BW, queuing, delay/hop –can take a long time –may be able to ID location of bottleneck Abing/pathload/pathchirp –Sends packets with known separation, measure separation at other end –Much faster –Finds bottleneck bw Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links

15 Network throughput Iperf/thrulay –Client generates & sends UDP or TCP packets –Server receives receives packets –Can select port, maximum window size, port, duration, parallel streams, Mbytes to send etc. –Client/server communicate packets seen etc. –Reports on throughput Requires sever to be installed at remote site, i.e. friendly administrators or logon account and password Applications –GridFTP, bbcp, bbftp (single, multi-stream file to file)

16 Intrusiveness vs Accuracy Iperf pathload abing pathchirp abing pathchirp pathload Spruce Iperf CAIDA

17 Active Measurement Projects PingER (ping) AMP (ping) One way delay: –Surveyor (now defunct), RIPE (mainly Europe), owamp IEPM-BW (bandwidth, throughput …) NIMI (mainly a design infrastructure) NWS (mainly for forecasting) Skitter All projects measure routes For a detailed comparison see: – –

18 Some Challenges High performance links Dedicated circuits Visualizing topologies, e.g.traceroutes Reviewing thousands of graphs to spot anomalies –Automated anomalous event detection –Gathering more information & alerting Guiding middleware –Need long term forecasts, –Web services –E.g. scheduling wavelengths, or QoS services

19 Hi-perf Challenges Packet loss hard to measure by ping –For 10% accuracy on BER 1/10^8 ~ 1 day at 1/sec –Ping loss ≠ TCP loss Iperf/GridFTP throughput at 10Gbits/s –To measure stable (congestion avoidance) state for 90% of test takes ~ 60 secs ~ 75GBytes –Requires scheduling implies authentication etc. Using packet pair dispersion can use only few tens or hundreds of packets, however: –Timing granularity in host is hard (sub μsec) –NICs may buffer (e.g. coalesce interrupts. or TCP offload) so need info from NIC or before Security: blocked ports, firewalls, keys vs. one time passwords, varying policies … etc.

20 Anomalous Event Detection Relatively easy to spot steps in performance if the time series is normally pretty flat –Plateau algorithm, nicely intuitive – Kolmogorov Smirnov

21 Seasonal Variations Unfortunately some (10-29%) paths show large diurnal changes These can cause false positives

22 Holt Winters So use Holt-Winters triple exponential weighted moving averages –Short term smoothing –Long term linear trends –Seasonal smoothing Much better agreement, removes diurnal & week start false positives Also gives long term forecasts – can use for scheduling etc.

23 Visualizing traceroutes One compact page per day One row per host, one column per hour One character per traceroute to indicate pathology or change (usually period(.) = no change) Identify unique routes with a number –Be able to inspect the route associated with a route number –Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

24 Pathology Encodings Stutter Probe type End host not pingable ICMP checksum Change in only 4 th octet Hop does not respond No change Multihomed ! Annotation (!X) Change but same AS

25 Navigation traceroute to CCSVSN04.IN2P3.FR ( ), 30 hops max, 38 byte packets 1 rtr-gsr-test ( ) ms … 13 in2p3-lyon.cssi.renater.fr ( ) ms !X #rt# firstseen lastseen route , , , ,..., xxx.xxx , , ,137,..., xxx.xxx , , , ,..., xxx.xxx , , , ,..., xxx.xxx , , , ,...,(n/a), xxx.xxx , , , ,..., xxx.xxx , , , ,..., xxx.xxx , , ,(n/a),..., xxx.xxx , , , ,(n/a),..., xxx.xxx

26 History Channel

27 AS’ information

28 Changes in network topology (BGP) can result in dramatic changes in performance Snapshot of traceroute summary table Samples of traceroute trees generated from the table ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Changes detected by IEPM-Iperf and AbWE Esnet-LosNettos segment in the path (100 Mbits/s) Hour Remote host Dynamic BW capacity (DBC) Cross-traffic (XT) Available BW = (DBC-XT) Mbits/s Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Los-Nettos (100Mbps)

29 Dedicated Optical Circuits Could be whole new playing field, today’s tools no longer applicable: –No jitter (so packet pair dispersion no use) –Instrumented TCP stacks a la Web100 may not be relevant –Layer 1 switches make traceroute less useful –Losses so low, ping not viable to measure –High speeds make some current techniques fail or more difficult (timing, amounts of data etc.)

30 Future work Apply E2E anomaly detection to multiple metrics (RTT, available bandwidth, achievable throughput), multi-routes Apply forecasting & anomaly detection to passive data –If patterns stay stable for a long time (weeks) Put together time series from multiple separate flows Interpolate and use with Holt-Winters Detect not just size of bottleneck but location –Then can apply QoS just to poor link rather than whole path

31 More Information Tutorial on monitoring – RFC 2151 on Internet tools – Network monitoring tools – Ping – IEPM/PingER home site –www-iepm.slac.stanford.edu/www-iepm.slac.stanford.edu/ IEEE Communications, May 2000, Vol 38, No 5, pp

32 Simplified SLAC DMZ Network, 2001 swh-root Swh-dmz rtr-msfc-dmz slac-rt1.es.net ESnet Stanford Internet2 OC12 link 622Mbps 155Mbps OC3 link(*) SLAC Internal Network (*) Upgrade to OC12 has been requested (#) This link will be replaced with a OC48 POS card for the 6500 when available Etherchannel 4 gbps 10Mbps Ethernet 1Gbps Ethernet NTON 2.4Gbps OC48 link (#) 100Mbps Ethernet Dial up &ISDN

33 Flow lengths Distribution of netflow lengths for SLAC border –Log-log plots, linear trendline = power law –Netflow ties off flows after 30 minutes –TCP, UDP & ICMP “flows” are ~log-log linear for longer (hundreds to 1500 seconds) flows (heavy-tails) –There are some peaks in TCP distributions, timeouts? Web server CGI script timeouts (300s), TCP connection establishment (default 75s), TIME_WAIT (default 240s), tcp_fin_wait (default 675s) TCPUDP ICMP

34 Traceroute technical details Rough traceroute algorithm ttl=1; #To 1 st router port=33434; #Starting UDP port while we haven’t got UDP port unreachable { send UDP packet to host:port with ttl get response if time exceeded note roundtrip time else if UDP port unreachable quit print output ttl++; port++ } Can appear as a port scan –SLAC gets about one complaint every 2 weeks.

35 Time series UDP TCP Outgoing Incoming Cat q vs. ISL

36 Power law fit parameters by time Just 2 parameters provide a reasonable description of the flow size distributions

37 Averaging/Sampling intervals Typical measurements of utilization are made for 5 minute intervals or longer in order not to create much impact. Interactive human interactions require second or sub-second response So it is interesting to see the difference between measurement made with different time frames.

38 Utilization with different averaging times Same data, measured Mbits/s every 5 secs Average over different time intervals Does not get a lot smoother May indicate multi-fractal behavior 5 secs 5 mins 1 hour

39 Lot of heavy FTP activity The difference depends on traffic type Only 20% difference in max & average

40 Not your normal Internet site Ames IXP: approximately 60-65% was HTTP, about 13% was NNTP Uwisc: 34% HTTP, 24% FTP, 13% Napster

41 PingER cont. Monitor timestamps and sends ping to remote site at regular intervals (typically about every 30 minutes) Remote site echoes the ping back Monitor notes current and send time and gets RTT Discussing installing monitor site in Pakistan –provide real experience of using techniques –get real measurements to set expectations, identify problem areas, make recommendations –provide access to data for developing new analysis techniques, for statisticians etc.

42 PingER Measurements from –38 monitors in 14 countries –Over 600 remote hosts –Over 120 countries –Over 3300 monitor-remote site pairs –Measurements go back to Jan-95 –Reports on RTT, loss, reachability, jitter, reorders, duplicates … Uses ubiquitous “ping” facility of TCP/IP Countries monitored –Contain over 80% of world population –99% of online users of Internet

43 Surveyor & RIPE, NIMI Surveyor & RIPE use dedicated PCs with GPS clocks for synchronization –Measure 1 way delays and losses –Surveyor mainly for Internet 2 –RIPE mainly for European ISPs NIMI (National Internet Measurement Infrastructure) more of an infrastructure for measurements and some tools (I.e. currently does not have public available data,regularly updated) –Mainly full mesh measurements on demand

44 Skitter Makes ping & route measurements to tens of thousands of sites around the world. Site selection varies based on web site hits. –Provide loss & RTTs –Skitter & PingER are main 2 sites to monitor developing world.

45 “Where is” a host – cont. Find the Autonomous System (AS) administering –Use reverse traceroute server with AS identification, e.g.: … 14 lhr.comsats.net.pk ( ) [AS COMSATS] 711 ms (ttl=242) –Get contacts for ISPs (if know ISP or AS): Gives ISP name, web page, phone number, , hours etc. –Review list of AS's ordered by Upstream AS Adjacency Tells what AS is upstream of an ISP –Look at real-time information about the global routing system from the perspectives of several different locations around the Internet Use route views at Triangulate RTT measurements to unknown host from multiple places

46 Who do you tell Local network support people Internet Service Provider (ISP) usually done by local networker –Use puck.nether.net/netops/nocs.cgi to find ISPpuck.nether.net/netops/nocs.cgi –Use to find upstream ISPswww.telstra.net/ops/bgp/bgp-as-upsstm.txt Give them the ping and traceroute results

47 Achieving throughput User can’t achieve throughput available (Wizard gap) Big step just to know what is achievable