Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis World Telecom Geneva October 15, 2003
Agenda è High TCP performance over wide area networks : r TCP at Gbps speed r MTU bias r RTT bias r TCP fairness è How to use 100% of the link capacity with TCP Reno r Network buffers impact è New Internet2 Land Speed record
Single TCP stream performance under periodic losses Loss rate =0.01%: è LAN BW utilization= 99% è WAN BW utilization=1.2% Bandwidth available = 1 Gbps u TCP throughput is much more sensitive to packet loss in WANs than in LANs r TCP’s congestion control algorithm (AIMD) is not suited to gigabit networks r Poor limited feedback mechanisms r The effect of packets loss is disastrous u TCP is inefficient in high bandwidth*delay networks u The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENO
Responsiveness (I) The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost. C. RTT 2. MSS 2 C : Capacity of the link
Responsiveness (II) CaseC RTT (ms) MSS (Byte) Responsiveness Typical LAN today 1 Gb/s 2 (worst case) ms WAN Geneva Chicago 1 Gb/s min WAN Geneva Sunnyvale 1 Gb/s min WAN Geneva Tokyo 1 Gb/s h 04 min WAN Geneva Sunnyvale 2.5 Gb/s min Future WAN CERN Starlight 10 Gb/s h 32 min Future WAN link CERN Starlight 10 Gb/s (Jumbo Frame) 15 min The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the responsiveness is multiplied by two. Therefore, values above have to be multiplied by two !
Single TCP stream TCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120ms u Time to increase the throughout from 100Mbps to 900Mbps = 35 minutes u Loss occurs when the bandwidth reaches the pipe size u 75% of bandwidth utilization (assuming no buffering) u Cwnd<BDP : è Throughput < Bandwidth è RTT constant è Throughput = Cwnd / RTT 35 minutes
Measurements with Different MTUs TCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120ms u In both cases: 75% of the link utilization u Large MTU accelerate the growth of the window u Time to recover from a packet loss decreases with large MTU u Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets)
Starlight (Chi) CERN (GVA) MTU and Fairness u Two TCP streams share a 1 Gbps bottleneck u RTT=117 ms u MTU = 1500 Bytes; Avg. throughput over a period of 4000s = 50 Mb/s u MTU = 9000 Bytes; Avg. throughput over a period of 4000s = 698 Mb/s u Factor 14 ! u Connections with large MTU increase quickly their rate and grab most of the available bandwidth RR GbE Switch Host #1 POS 2.5 Gbps 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck
Sunnyvale Starlight (Chi) CERN (GVA) RTT and Fairness RR GbE Switch Host #1 POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gb/s R 10GE u Two TCP streams share a 1 Gbps bottleneck u CERN Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s u CERN Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s u MTU = 9000 bytes u Connection with small RTT increases quickly there rate and grab most of the available bandwidth
How to use 100% of the bandwidth? Bandwidth delay product u Single TCP stream GVA - CHI u MSS=8960 Bytes; Throughput = 980Mbps u Cwnd > BDP => Throughput = Bandwidth u RTT increase u Extremely Large buffer at the bottleneck u Network buffers have an important impact on performance u Have buffers to be well dimensioned in order to scale with the BDP? u Why not use the end-to-end delay as congestion indication.
Single stream TCP performance Date From Geneva to Size of transfer Duration(second)RTT(ms)MTU(Bytes) IP version ThroughputRecordAward Feb 27 Sunnyvale 1,1 TByte IPv Gbps u Internet2 LSR u CENIC award u Guinness World Record May 27 Tokyo 65.1 GByte IPv4 931 Mbps May 2 Chicago 385 GByte IPv6 919 Mbps May 2 Chicago 412 GByte IPv6 983 Mbps u Internet2 LSR NEW Submission (Oct-11): 5.65 Gbps from Geneva to Los Angeles across the LHCnet, Starlight, Abilene and CENIC.
Early 10 Gb/s 10,000 km TCP Testing u Single TCP stream at 5,65 Gbps u Transferring a full CD in less than 1s u Un-congestioned network u No packet loss during the transfer u Probably qualifies as new Internet2 LSR Monitoring of the Abilene traffic in LA
Conclusion u The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENO u How to define the fairness? è Taking into account the MTU è Taking into account the RTT u Larger packet size (Jumbogram : payload larger than 64K) è Is standard MTU the largest bottleneck? è New Intel 10GE cards : MTU=16K è J. Cain (Cisco): “It’s very difficult to build switches to switch large packets such as jumbogram” u Our vision of the network: “The network, once viewed as an obstacle for virtual collaborations and distributed computing in grids, can now start to be viewed as a catalyst instead. Grid nodes distributed around the world will simply become depots for dropping off information for computation or storage, and the network will become the fundamental fabric for tomorrow's computational grids and virtual supercomputers”