1 Characterization and Evaluation of TCP and UDP-based Transport on Real Networks Les Cottrell Sun SuperG Spring 2005, April, 2005 www.slac.stanford.edu/grp/scs/net/talk05/superg-apr05.ppt.

Slides:



Advertisements
Similar presentations
TCP transfers over high latency/bandwidth network & Grid TCP Sylvain Ravot
Advertisements

FAST TCP Anwis Das Ajay Gulati Slides adapted from : IETF presentation slides Link:
Hui Zhang, Fall Computer Networking TCP Enhancements.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.
Ahmed El-Hassany CISC856: CISC 856 TCP/IP and Upper Layer Protocols Slides adopted from: Injong Rhee, Lisong Xu.
1 Testbeds Les Cottrell Site visit to SLAC by DoE program managers Thomas Ndousse & Mary Anne Scott April 27,
High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.
1 Characterization and Evaluation of TCP and UDP-based Transport on Real Networks Les Cottrell, Saad Ansari, Parakram Khandpur, Ruchi Gupta, Richard Hughes-Jones,
1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
PFLDNet Argonne Feb 2004 R. Hughes-Jones Manchester 1 UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium.
ESLEA Bedfont Lakes Dec 04 Richard Hughes-Jones Network Measurement & Characterisation and the Challenge of SuperComputing SC200x.
Introduction 1 Lecture 14 Transport Layer (Congestion Control) slides are modified from J. Kurose & K. Ross University of Nevada – Reno Computer Science.
The Effects of Systemic Packets Loss on Aggregate TCP Flows Thomas J. Hacker May 8, 2002 Internet 2 Member Meeting.
Transport Layer 4 2: Transport Layer 4.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Transport Layer3-1 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4 Principles.
TNC 2007 Bandwidth-on-demand to reach the optimal throughput of media Brecht Vermeulen Stijn Eeckhaut, Stijn De Smet, Bruno Volckaert, Joachim Vermeir,
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
Large File Transfer on 20,000 km - Between Korea and Switzerland Yusung Kim, Daewon Kim, Joonbok Lee, Kilnam Chon
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
Maximizing End-to-End Network Performance Thomas Hacker University of Michigan October 26, 2001.
Technology for Using High Performance Networks or How to Make Your Network Go Faster…. Robin Tasker UK Light Town Meeting 9 September.
UDT: UDP based Data Transfer Yunhong Gu & Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
1 Using Netflow data for forecasting Les Cottrell SLAC and Fawad Nazir NIIT, Presented at the CHEP06 Meeting, Mumbai India, February
Experience with Loss-Based Congestion Controlled TCP Stacks Yee-Ting Li University College London.
High TCP performance over wide area networks Arlington, VA May 8, 2002 Sylvain Ravot CalTech HENP Working Group.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
Rate Control Rate control tunes the packet sending rate. No more than one packet can be sent during each packet sending period. Additive Increase: Every.
Data Transport Challenges for e-VLBI Julianne S.O. Sansa* * With Arpad Szomoru, Thijs van der Hulst & Mike Garret.
Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update
1 Characterization and Evaluation of TCP and UDP-based Transport on Real Networks Les Cottrell, Saad Ansari, Parakram Khandpur, Ruchi Gupta, Richard Hughes-Jones,
Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March
Transport Layer 3- Midterm score distribution. Transport Layer 3- TCP congestion control: additive increase, multiplicative decrease Approach: increase.
TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester 1 Lessons Learned in Grid Networking or How do we get end-2-end performance to Real Users ? Richard.
TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot
TCP Traffic Characteristics—Deep buffer Switch
Final EU Review - 24/03/2004 DataTAG is a project funded by the European Commission under contract IST Richard Hughes-Jones The University of.
1 Evaluation of Advanced TCP stacks on Fast Long-Distance production Networks Prepared by Les Cottrell & Hadrien Bullot, Richard Hughes-Jones EPFL, SLAC.
INDIANAUNIVERSITYINDIANAUNIVERSITY Status of FAST TCP and other TCP alternatives John Hicks TransPAC HPCC Engineer Indiana University APAN Meeting – Hawaii.
Peer-to-Peer Networks 13 Internet – The Underlay Network
ESLEA Closing Conference, Edinburgh, March 2007, R. Hughes-Jones Manchester 1 The Uptake of High Speed Protocols or Are these protocols making their way.
An Analysis of AIMD Algorithm with Decreasing Increases Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data Mining.
1 FAST TCP for Multi-Gbps WAN: Experiments and Applications Les Cottrell & Fabrizio Coccetti– SLAC Prepared for the Internet2, Washington, April 2003
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
1 Achieving Record Speed Trans-Atlantic End-to-end TCP Throughput Les Cottrell – SLAC Prepared for the NANOG meeting, Salt Lake City, June 2003
@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.
Chapter 3 outline 3.1 transport-layer services
R. Hughes-Jones Manchester
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
Transport Protocols over Circuits/VCs
Using Netflow data for forecasting
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
Wide Area Networking at SLAC, Feb ‘03
Experiences from SLAC SC2004 Bandwidth Challenge
Characterization and Evaluation of TCP and UDP-based Transport on Real Networks Les Cottrell, Saad Ansari, Parakram Khandpur, Ruchi Gupta, Richard Hughes-Jones,
Breaking the Internet2 Land Speed Record: Twice
TCP Congestion Control
Wide-Area Networking at SLAC
Prepared by Les Cottrell & Hadrien Bullot, SLAC & EPFL, for the
Transport Layer: Congestion Control
TCP flow and congestion control
High-Performance Data Transport for Grid Applications
Review of Internet Protocols Transport Layer
Presentation transcript:

1 Characterization and Evaluation of TCP and UDP-based Transport on Real Networks Les Cottrell Sun SuperG Spring 2005, April, Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM)

2 Overview What’s the problem getting high performance data transport on high speed networks Possible solutions –Software –Hardware Measurements and comparisons: –Offloads, OS’, jumbos, parallel streams, NICs SC2005 Bandwidth Challenge 101 Gbits/s –Some results –What was special Conclusions Who needs it anyway?

3 What’s the problem? Standard TCP recovers very slowly from a congestion event (perceived packet loss): –To remove the need to wait to acknowledge each packet, the sender fills the pipe with a “window” of unacknowledged packets –To fill a 10Gbps pipe between California and Switzerland requires a window of 137,500 * 1500 Byte packets –If sees congestion reduces throughput (window) by factor 2 (Multiplicative Decrease) –Then only adds 1 packet to window for each acknowledgement (Additive Increase) –OK for low speed 64kbps links of original Internet –BUT: at 10Gbps, RTT=165msec it takes many hours to recover –So need losses of << 1 in 10^14!! Sunnyvale-Geneva RTT (165ms), 1500Byte MTU, stock TCP Src Rcv ACK ~RTT

4 Possible solutions Jumbo frames (1500Bytes std => 9000Bytes), factor of 6 improvement in recovery rate –Not an IEEE standard –May break some UDP applications –Not supported on many LANs Sender mods only, HENP model is few big senders, lots of smaller receivers –Simplifies deployment, only a few hosts at a few sending sites –So no Dynamic Right Sizing (DRS) at receiver Runs on production nets (hard to deploy a new Internet) –No router mods (XCP/ECN)

5 Parallel streams Use multiple parallel TCP streams for an application such as FTP –No modifications to kernel –Improves TCP by ~ number of streams –Very effective –But Maybe unfair to others, especially for large numbers of streams Have to optimize both the window size AND the number of streams simultaneously, and that may change from day to day or even hour to hour on congested networks

6 Software Transports Advanced TCP stacks – being developed on Linux –To overcome AIMD congestion behavior of Reno based TCPs, while preserving stability, and fairness … –BUT: SLAC “datamover” are all based on Solaris, while advanced TCPs currently are Linux only SLAC production systems people concerned about non-standard kernels, ensuring TCP patches keep current with security patches for SLAC supported Linux version So also very interested in transport that runs in user space (no kernel mods) –Evaluate UDT (reliable UDP transport) from UIC folks

7 Reno single stream Low performance on fast long distance paths –AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) –Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Remaining flows do not take up slack when flow removed Congestion has a dramatic effect Recovery is slow Multiple streams increase recovery rate SLAC to CERN RTT increases when achieves best throughput

8 New TCP stacks Adjust recovery (faster than linear) & back off (less dramatic) –Only available on Linux today, still in development One of the best performers is HTCP –Throughput is high –Big effects on RTT when achieves best throughput –Flows share equally, stable, Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN

9 Hardware Assists For 1Gbits/s paths, cpu, bus etc. not a problem For 10Gbits/s they are important NIC assistance to the CPU is becoming popular –Checksum offload –Interrupt coalescence –Large send/receive offload (LSO/LRO) –TCP Offload Engine (TOE) Several vendors for 10Gbits/s NICs, at least one for 1Gbits/s NIC But currently restricts to using NIC vendor’s TCP implementation Most focus is on the LAN –Cheap alternative to Infiniband, MyriNet etc.

10 10Gbps tests Sunfire vx0z, Linux & Solaris 10, Chelsio & Neterion Back-to-back (LAN) testing at SLAC SNV to LA At SC2004 using two 10Gbps dedicated paths between Pittsburgh and Sunnyvale –Using Solaris 10 (build 69) and Linux 2.6 –On Sunfire Vx0z (dual & quad 2.4GHz 64 bit AMD Opterons) with PCI-X 133MHz 64 bit –Only 1500 Byte MTUs Achievable performance limits (using iperf) –Reno TCP (multi-flows) vs UDTv2, –TOE (Chelsio) vs no TOE (Neterion(S2io))

11 Results - Summary UDT limit was ~ 4.45Gbits/s –Cpu limited TCP Limit was about 7.5±0.07 Gbps, regardless of: –Whether LAN (back to back) or WAN Gating factor=PCI-X 133Mhz –Raw bandwidth 8.53Gbps –But transfer broken into segments to allow interleaving –E.g. with max memory read byte count of 4096Bytes with Intel Pro/10GbE LR NIC limit is 6.83Gbits/s One host with 4 cpus & 2 NICs sent 11.5±0.2Gbps to two dual cpu hosts with 1 NIC each Two hosts to two hosts (1 NIC/host) 9.07Gbps goodput forward & 5.6Gbps reverse

12 CPU Utilization Receiver needs 20% less CPU than sender for high throughput For Neterion with LSO & Linux: Sender appears to use more CPU than receiver as the throughput increases Single stream limited by 1.8GHz CPU

13 Effect of Jumbos Throughput SLAC-CENIC LA (1 stream, 2MB window with LSO Neterion(S2io)/Linux): –1500B MTU 1.8 Gbps –9000B MTU 6 Gbps Sender CPU: GHz/Gbps (single stream with LSO Neterion/Linux): –1500B MTU = 0.5 ± 0.13 GHz/Gbps (single sender to single receiver, need to extend to multi-receivers) –9000B MTU = 0.3 ± 0.07 GHz/Gbps –Factor 1.7 improvement For Neterion with LSO &Linux on WAN, Jumbos have a huge effect on performance and also improve CPU utilization

14 Effect of LSO v20z 1.8GHz, Linux 2.6, S2io, 2 streams SLAC to Caltech, 8MB window: –With LSO: 7.4Gbits/s, –Without LSO: 5.4Gbits/s, LAN (3 streams, 164KB window) –Solaris => Linux: 6.4Gbps (No LSO support in Solaris 10 at the moment) –Linux => Solaris-10: 4.8Gbps (LSO turned off sender) –Linux => Solaris-10: 7.54Gbps (LSO turned on) For Neterion with Linux on LAN LSO improves CPU utilization by a factor of 1.4. If one is CPU limited this will also improve throughput. 1 stream

15 Solaris vs Linux Send from one to the other single stream Compare send from Linux Neterion + LSO with send from Solaris 10 without LSO –LSO support in Solaris coming soon With one stream Solaris sender sends faster Sol slightly better GHz/Gbps GHz/Gbps: Solaris ; Linux

16 Solaris vs Linux multi- streams When optimize for multiple streams, Linux + LSO sender is better 1MB 2MB 4MB Solaris without LSO performs poorly with multiple streams (LSO or OS related?) –Its GHz/Gbps is poorer than Linux+LSO for multiple streams LAN MTU: 9400B S2io 7.5Gbps 6.4Gbps

17 Chelsio Chelsio to Chelsio (TOE) With 2.4GHz V20zs from Pittsburgh to SNV 1500Byte MTUs Reliably able to get Gbps (16 streams) GHz/Gbps Chelsio(MTU=1500B) ~ Neterion (9000B) Chelsio(TOE)

18  Joint Caltech, SLAC, FNAL, CERN, UF, SDSC, BR, KR, ….  Gbps waves to HEP on show floor  Bandwidth challenge: aggregate throughput of Gbps  FAST TCP SC2004: Tenth of a Terabit/s Challenge

19 Bandwidth Challenge Large collaboration of academia and industry Took a lot of “wizards” to make it work >100 Gbps aggregate The prize!

20 What was special? End-to-end application-to-application, single and multi- streams (not just internal backbone aggregate speeds) TCP has not run out of stream yet, scales from modem speeds into multi-Gbits/s region –TCP well understood, mature, many good features: reliability etc. –Friendly on shared networks New TCP stacks only need to be deployed at sender –Often just a few data sources, many destinations –No modifications to backbone routers etc –No need for jumbo frames Used Commercial Off The Shelf (COTS) hardware and software

21 What was Special 2/2 Raise the bar on expectations for applications and users –Some applications can use Internet backbone speeds –Provide planning information The network is looking less like a bottleneck and more like a catalyst/enabler –Reduce need to colocate data and cpu –No longer ship literally truck or plane loads of data around the world –Worldwide collaborations of people working with large amounts of data become increasingly possible

22 Conclusions Todays limit (PCI-X) is about 7.5Gbits/s Jumbos can be a big help LSO is helpful (Neterion) For best throughput Linux+LSO sender better Without LSO Solaris provides more throughput –Need to revisit when Solaris supports LSO –Also untangle Solaris  Linux Solaris without LSO has problems with multiple streams TOE (Chelsio) allows one to avoid 9000Byte MTUs –Need further study on 9000B MTU for Chelsio –Try Chelsio on Solaris Longer paths (trans-Atlantic) New buses (PCI-Express)

23 Conclusions Need testing on real networks –Controlled simulation & emulation critical for understanding –BUT need to verify, and results can look different than expected Most important for transoceanic paths UDT looks promising, still needs work for > 6Gbits/s Need to evaluate various offloads (TOE, LSO...) New buses (PCI-X 266Mhz and PCI-Express important, need NICs/hosts to support then evaluate

24 Who needs it? HENP – current driver –Multi-hundreds Mbits/s and Multi TByte files/day transferred across Atlantic today SLAC BaBar experiment already has a PByte stored –Tbits/s and ExaBytes (10 18 ) stored in a decade Data intensive science: –Astrophysics, Global weather, Bioinformatics, Fusion, seismology… Industries such as aerospace, medicine, security … Future: –Media distribution Gbits/s=2 full length DVD movies/minute 100 Gbits/s is equivalent to –Download Library of Congress in < 14 minutes –Three full length DVDs in a second Will sharing movies be like sharing music today?

25 Acknowledgements Gary Buhrmaster*, Parakram Khandpur*, Harvey Newman c, Yang Xia c, Xun Su c, Dan Nae c,Sylvain Ravot c, Richard Hughes-Jones m, Michael Chen +, Larry McIntosh s, Frank Leers s, Leonid Grossman n, Alex Aizman n SLAC*, Caltech c, Manchester University m, Chelsio +, Sun s, Neterion(S2io) n

26 Further Information Web site with lots of plots & analysis – Inter-protocols comparison (Journal of Grid Comp, PFLD04) – SC2004 details –www-iepm.slac.stanford.edu/monitoring/bulk/sc2004/

27 From LS Off MTU Byte s Median Thru GbpsIQR GHz/ GbpsIQR Linux On Linux On Linux Off Solaris Off Solaris Off

28 When will it have an impact ESnet traffic doubling/year since 1990 SLAC capacity increasing by 90%/year since 1982 –SLAC Internet traffic increased by factor 2.5 in last year International throughput increase by factor 10 in 4 years So traffic increases by factor 10 in 3.5 to 4 years, so in: –3.5 to 5 years 622 Mbps => 10Gbps –3-4 years 155 Mbps => 1Gbps –3.5-5 years 45Mbps => 622Mbps : –100s Gbits for high speed production net end connections –10Gbps will be mundane for R&E and business –Home broadband: doubling ~ every year, 100Mbits/s by end of decade –Aggressive Goal: 1Gbps to all Californians by 2010 Throughput Mbits/s Throughput from US