Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 PathDiag20060424.ppt.

Slides:



Advertisements
Similar presentations
The Transmission Control Protocol (TCP) carries most Internet traffic, so performance of the Internet depends to a great extent on how well TCP works.
Advertisements

Martin Suchara, Ryan Witt, Bartek Wydrowski California Institute of Technology Pasadena, U.S.A. TCP MaxNet Implementation and Experiments on the WAN in.
Autotuning in Web100 John W. Heffner August 1, 2002 Boulder, CO.
Pushing Up Performance for Everyone Matt Mathis 7-Dec-99.
Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.
Iperf Tutorial Jon Dugan Summer JointTechs 2010, Columbus, OH.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
QoS Solutions Confidential 2010 NetQuality Analyzer and QPerf.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Leveraging Multiple Network Interfaces for Improved TCP Throughput Sridhar Machiraju, Prof. Randy Katz.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
Client Side Mirror Selection Will Lefevers CS 526 Advanced Internet and Web Systems.
Leveraging Multiple Network Interfaces for Improved TCP Throughput Sridhar Machiraju SAHARA Retreat, June 10-12, 2002.
A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.
1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.
Internet and Intranet Protocols and Applications Section V: Network Application Performance Lecture 11: Why the World Wide Wait? 4/11/2000 Arthur P. Goldberg.
Computer Networks Transport Layer. Topics F Introduction  F Connection Issues F TCP.
Congestion Control for High Bandwidth-Delay Product Environments Dina Katabi Mark Handley Charlie Rohrs.
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Performance Diagnostic Research at PSC Matt Mathis John Heffner Ragu Reddy 5/12/05 PathDiag ppt.
NDT Tools Tutorial: How-To setup your own NDT server Rich Carlson Summer 04 Joint Tech July 19, 2004.
Draft-constantine-ippm-tcp-throughput-tm-00.txt 1 TCP Throughput Testing Methodology IETF 76 Hiroshima Barry Constantine
Peter O’Neil Executive Director November 29, 2007 MAX Fall Member Meeting.
KEK Network Qi Fazhi KEK SW L2/L3 Switch for outside connections Central L2/L3 Switch A Netscreen Firewall Super Sinet Router 10GbE 2 x GbE IDS.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
CIS 725 Wireless networks. Low bandwidth High error rates.
Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 PathDiag ppt.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Workshop on Reducing Internet Latency goals for taxonomy session survey sources of latency categorise solutions – quantify benefits – consider deployment.
KIS – Cvičenie #5 IP konfigurácia v prostredí OS Windows Marián Beszédeš, B506
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Networking COMP # 21.
Computer Networks Performance Metrics. Performance Metrics Outline Generic Performance Metrics Network performance Measures Components of Hop and End-to-End.
Wireless TCP Prasun Dewan Department of Computer Science University of North Carolina
1 Impact of transmission errors on TCP performance (Nitin Vaidya)
SMUCSE 4344 transport layer. SMUCSE 4344 transport layer end-to-end protocols –transport code runs only on endpoint hosts encapsulates network communications.
Networked & Distributed Systems TCP/IP Transport Layer Protocols UDP and TCP University of Glamorgan.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
4061 Session 25 (4/17). Today Briefly: Select and Poll Layered Protocols and the Internets Intro to Network Programming.
IP1 The Underlying Technologies. What is inside the Internet? Or What are the key underlying technologies that make it work so successfully? –Packet Switching.
Packet switching network Data is divided into packets. Transfer of information as payload in data packets Packets undergo random delays & possible loss.
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
NET100 Development of network-aware operating systems Tom Dunigan
Wide Area Network Performance Analysis Methodology Wenji Wu, Phil DeMar, Mark Bowden Fermilab ESCC/Internet2 Joint Techs Workshop 2007
1 Capacity Dimensioning Based on Traffic Measurement in the Internet Kazumine Osaka University Shingo Ata (Osaka City Univ.)
The TCP-ESTATS-MIB Matt Mathis John Heffner Raghu Reddy Pittsburgh Supercomputing Center Rajiv Raghunarayan Cisco Systems J. Saperia JDS Consulting, Inc.
1 Evaluating NGI performance Matt Mathis
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Web100 Basil Irwin National Center for Atmospheric Research Matt Mathis Pittsburgh Supercomputing Center Halloween, 2000.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
National Center for Atmospheric Research Pittsburgh Supercomputing Center National Center for Supercomputing Applications Web100 and Logistical Networking.
Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 7/19/05 PathDiag ppt.
05 October 2001 End-to-End Performance Initiative Network Measurement Matt Zekauskas, Fall 2001 Internet2 Member Meeting Network Measurement.
11 CS716 Advanced Computer Networks By Dr. Amir Qayyum.
Network Protocols: Design and Analysis Polly Huang EE NTU
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
Samuel Wood Manikandan Punniyakotti Supervisors: Brad Smith, Katia Obraczka, JJ Garcia-Luna-Aceves
PiPEs Tools in Action Rich Carlson SMM Tools Tutorial May 3, 2005.
Network Path and Application Diagnostics
Transport Protocols over Circuits/VCs
IT351: Mobile & Wireless Computing
Anant Mudambi, U. Virginia
TCP: Transmission Control Protocol Part II : Protocol Mechanisms
Error Checking continued
Lecture 6, Computer Networks (198:552)
Impact of transmission errors on TCP performance
Presentation transcript:

Network Path and Application Diagnostics Matt Mathis John Heffner Ragu Reddy 4/24/06 PathDiag ppt

Outline What is the real problem? –Lessons from Web100 –A new perspective Path and lower layer diagnosis –The pathdiag tool –A diagnostic server Application and upper layer diagnosis –LAN bench testing Future plans

TCP tuning requires expert knowledge By design TCP/IP hides the ‘net from upper layers –TCP/IP provides basic reliable data delivery –The “hour glass” between applications and networks This is a good thing, because it allows: –Invisible recovery from data loss, etc –Old applications to use new networks –New application to use old networks But then (nearly) all problems have the same symptom –Less than expected performance –The details are hidden from nearly everyone

TCP tuning is painful debugging All problems reduce performance –But the specific symptoms are hidden Any one problem can prevent good performance –Completely masking all other problems Trying to fix the weakest link of an invisible chain –General tendency is to guess and “fix” random parts –Repairs are sometimes “random walks” –Repair one problem at time at best

The Web100 project When there is a problem, just ask TCP –TCP has the ideal vantage point In between the application and the network –TCP already “measures” key network parameters Round Trip Time (RTT), available data capacity, etc Can add many more –TCP can identify the bottleneck Why did it stop sending data? –TCP can even adjust itself “autotuning” eliminates one major class of flaws See:

The next step Web100 tools still require too much expertise –They are not really end user tools –Too easy to overlook problems –Current diagnostic procedures are still cumbersome New insight from web100 experience –Nearly all symptoms scale with round trip time New NSF funded project: Network Path and Application Diagnosis (NPAD)

Nearly all symptoms scale with RTT For example –TCP Buffer Space, Network loss and reordering, etc –On a short path TCP can compensate for the flaw Local Client to Server: all applications work –Including all standard diagnostics Remote Client to Server: all applications fail –Leading to faulty implication of other components

Examples of flaws that scale Chatty application (e.g., 50 transactions per request) –On 1ms LAN, this adds 50ms to user response time –On 100ms WAN, this adds 5s to user response time Fixed TCP socket buffer space (e.g., 32kBytes) –On a 1ms LAN, limit throughput to 200Mb/s –On a 100ms WAN, limit throughput to 2Mb/s Packet Loss (e.g., 0.1% loss at 1500 bytes) –On a 1ms LAN, models predict 300 Mb/s –On a 100ms WAN, models predict 3 Mb/s

The confounded problems For nearly all network flaws –The only symptom is reduced performance –But the reduction is scaled by RTT On short paths, most flaws are undetectable –False pass for even the best conventional diagnostics –Leads to faulty inductive reasoning about flaw locations –This is the essence of the “end-to-end” problem –Current state-of-the-art diagnosis relies on tomography and complicated inference techniques

The solutions New diagnostic techniques to compensate for “symptom scaling” For path testing (and lower layers) –Test path sections using a instrumented application that can extrapolate test results to a long path For applications (and upper layers) –Bench test over an (emulated) ideal long path

Testing the path Need to test short path sections to localize a flaw –But “symptom scaling” normally hides a failing section New tool (“pathdiag”): –Measure the performance of each short section Use Web100 to collect detailed statistics Loss, delay, queuing properties, etc –Use models to extrapolate results to the full path Assume that the rest of the path is ideal You have to specify the end-to-end performance goal –Data rate and RTT –Pass/Fail on the basis of the extrapolated performance

Deploy as a Diagnostic Server Use pathdiag in a Diagnostic Server (DS) Specify End to End target performance –From server (S) to client (C) (RTT and data rate) Measure the performance from DS to C –Use Web100 in the DS to collect detailed statistics –Extrapolate performance assuming ideal backbone Pass/Fail on the basis of extrapolated performance

Example 1- good news

Example 1, continued

Example 2 - not so good

Example 2, continued

Key pathdiag/DS features Results are intended for end-users –Provides a list of specific items to be corrected Failed tests are showstoppers for HPN apps –Includes explanations and tutorial information –Details for escalation to network or system admins Coverage for a majority of OS and network flaws –Most of the remaining flaws can be detected with pathdiag in the client or traceroute –Eliminates nearly all(?) false pass results Tests becomes more sensitive on shorter paths –Conventional diagnostics become less sensitive –Depending on models, perhaps too sensitive New problem is false fail (e.g. queue space tests)

Key features, continued Flaws no longer completely mask other flaws –A single test often detects several flaws E.g. find both OS and network flaws in the same test –They can be repaired concurrently Archived DS results include raw web100 data –Can reprocess with updated reporting SW New reports from old data –Critical feedback for the NPAD project We really want to collect “interesting” failures

Status Public servers are now online. See: – Version 1.0 available for download –Follow the download link –Requires current web100 kernel patches –Should be faster than clients Version 1.1 is coming soon –Better support for non-local testing –Better support for TeraGrid scale testing

Blast from the past Same base algorithm as “Windowed Ping” [Mathis, INET’94] –Aka “mping” –See –Killer diagnostic in use at PSC in the early 90s –Stopped working with the advent of “fast path” routers Use a simple fixed window protocol –Scan window size in 1 second steps –Measure data rate, loss rate, RTT, etc as window changes

Diagnosing applications Goal: Tools to “bench test” applications in the lab –Client and server on the same LAN App developer has easy access to all components –Emulate a long ideal path between client and server Also checks some OS and TCP features Several different techniques (next topic) Developer gets first hand experience with delay –If it fails in the lab, it will not work on a WAN –Can not blame the network –Can not repeal the speed of light –Has to fix the application

Emulating delay Multiple techniques to emulate long paths –Scenic routing via tunnels –Kernel delays (e.g. netem, nistnet, dummynet) –Application (pipe) delay via a proxy We have ~5 techniques prototyped/under test –Kernel hacking vs non-privileged users –Ease of use/ease of installation –Maximum data rate –Authenticity of the delay Not ready for prime time

Try it!