TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University of Cambridge J.P. Martin-Flatin, O. Martin, CERN S. Low, Caltech L. Cottrell, SLAC S. Ravot, Caltech
Context u High Energy Physics (HEP) è LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec throughout the year. è Many Petabytes per year of stored and processed binary data will be accessed and processed repeatedly by the worldwide collaborations. u New backbone capacities advancing rapidly to 10 Gbps range u TCP limitation è Additive increase and multiplicative decrease policy u TCP Fairness è Effect of the MTU è Effect of the RTT u New TCP implementations è Grid DT è Scalable TCP è Fast TCP è High-speed TCP u Internet2 Land Speed record
Time to recover from a single loss u TCP reactivity r Time to increase the throughput by 120 Mbit/s is larger than 6 min for a connection between Chicago and CERN. u A single loss is disastrous r A TCP connection reduces its bandwidth use by half after a loss is detected (Multiplicative decrease) r A TCP connection increases slowly its bandwidth use (Additive increase) r TCP throughput is much more sensitive to packet loss in WANs than in LANs 6 min
Responsiveness (I) The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost. C. RTT 2. MSS 2 C : Capacity of the link
Responsiveness (II) CaseC RTT (ms) MSS (Byte) Responsiveness Typical LAN in Mb/s [ 2 ; 20 ] 1460 [ 1.7 ms ; 171 ms ] Typical LAN today 1 Gb/s 2 (worst case) ms Futur LAN 10 Gb/s 2 (worst case) s WAN Geneva Chicago 1 Gb/s min WAN Geneva Sunnyvale 1 Gb/s min WAN Geneva Tokyo 1 Gb/s h 04 min WAN Geneva Sunnyvale 2.5 Gb/s min Future WAN CERN Starlight 10 Gb/s h 32 min Future WAN link CERN Starlight 10 Gb/s (Jumbo Frame) 15 min The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the responsiveness is multiplied by two. Therefore, values above have to be multiplied by two !
Effect of the MTU on the responsiveness Effect of the MTU on a transfer between CERN and Starlight (RTT=117 ms, bandwidth=1 Gb/s) u Larger MTU improves the TCP responsiveness because you increase your cwnd by one MSS each RTT. u Couldn’t reach wire-speed with standard MTU è Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets)
Starlight (Chi) CERN (GVA) MTU and Fairness u Two TCP streams share a 1 Gb/s bottleneck u RTT=117 ms u MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s u MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s u Link utilization : 70,7 % RR GbE Switch Host #1 POS 2.5 Gbps 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck
Sunnyvale Starlight (Chi) CERN (GVA) RTT and Fairness RR GbE Switch Host #1 POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gb/s R 10GE u Two TCP streams share a 1 Gb/s bottleneck u CERN Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s u CERN Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s u MTU = 9000 bytes u Link utilization = 71,6 %
Starlight (Chi) CERN (GVA) Effect of buffering on End-hosts u Setup è RTT = 117 ms è Jumbo Frames è Transmit queue of the network device = 100 packets (i.e 900 kBytes) u Area #1 è Cwnd Throughput Throughput < Bandwidth è RTT constant è Throughput = Cwnd / RTT u Area #2 è Cwnd > BDP => Throughput = Bandwidth è RTT increase (proportional to Cwnd) u Link utilization larger than 75% Area #2 Area #1 RR Host GVA Host CHI POS 2.5 Gb/s 1 GE
Buffering space on End-hosts u Link utilization near 100% if : è No congestion into the network è No transmission error è Buffering space = Bandwidth delay product è TCP buffers size = 2 * Bandwidth delay product => Congestion window size always larger than the bandwidth delay product Txqueulen is the transmit queue of the network device
Linux Patch “GRID DT” u Parameter tuning è New parameter to better start a TCP transfer r Set the value of the initial SSTHRESH u Modifications of the TCP algorithms (RFC 2001) è Modification of the well-know congestion avoidance algorithm r During congestion avoidance, for every acknowledgement received, cwnd increases by A * (segment size) * (segment size) / cwnd. It’s equivalent to increase cwnd by A segments each RTT. A is called additive increment è Modification of the slow start algorithm r During slow start, for every acknowledgement received, cwnd increases by M segments. M is called multiplicative increment. è Note: A=1 and M=1 in TCP RENO. u Smaller backoff è Reduce the strong penalty imposed by a loss
Grid DT u Only the sender’s TCP stack has to be modified u Very simple modifications to the TCP/IP stack u Alternative to Multi-streams TCP transfers è Single stream vs Multi streams r it is simpler r startup/shutdown are faster r fewer keys to manage (if it is secure) u Virtual increase of the MTU. u Compensate the effect of delayed ack u Can improve “fairness” r between flows with different RTT r between flows with different MTU
Effect of the RTT on the fairness u Objective: Improve fairness between two TCP streams with different RTT and same MTU u We can adapt the model proposed by Matt. Mathis by taking into account a higher additive increment u Assumptions: è Approximate the packet loss of probability p by assuming that each flow delivers 1/p consecutive packets followed by one drop. è Under these assumptions, the congestion window of the flows oscillate with a period T0. è If the receiver acknowledges every packet, then the congestion window size opens by x (additive increment) packets each RTT. W W/2 (t)2T0T0 By modifying the congestion increment dynamically according to RTT, guarantee fairness among TCP connections: Relation between t and t’: Number of packets delivered by each stream in one period: CWND evolution under periodic loss
Effect of the RTT on the fairness u TCP Reno performance (see slide #8): è First stream GVA Sunnyvale : RTT = 181 ms ; Avg. throughput over a period of 7000s = 202 Mb/s è Second stream GVA CHI : RTT = 117 ms; Avg. throughput over a period of 7000s = 514 Mb/s è Links utilization 71,6% u Grid DT tuning in order to improve fairness between two TCP streams with different RTT: è First stream GVA Sunnyvale : RTT = 181 ms, Additive increment = A = 7 ; Average throughput = 330 Mb/s è Second stream GVA CHI : RTT = 117 ms, Additive increment = B = 3 ; Average throughput = 388 Mb/s è Links utilization 71.8% Sunnyvale Starlight (CHI) CERN (GVA) RR GbE Switch POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck R POS 10 Gb/s R 10GE Host #1
Effect of the MTU Starlight (Chi) CERN (GVA) u Two TCP streams share a 1 Gb/s bottleneck u RTT=117 ms u MTU = 3000 Bytes ; Additive increment = 3; Avg. throughput over a period of 6000s = 310 Mb/s u MTU = 9000 Bytes; Additive increment = 1; Avg. throughput over a period of 6000s = 325 Mb/s u Link utilization : 61,5 % RR GbE Switch Host #1 POS 2.5 Gb/s 1 GE Host #2 Host #1 Host #2 1 GE Bottleneck
Next Work u Taking into account the value of the MTU in the evaluation of the additive increment: è Define a reference: è For example: r Reference: MTU = 9000 bytes => Add. Increment = 1 r MTU = 1500 bytes => Add. Increment = 6 r MTU = 3000 bytes => Add. Increment = 3 u Taking into account the square of the RTT in the evaluation of the additive increment: è Define a reference: è For example: r Reference: RTT=10 ms => Add. Increment = 1 r RTT=100ms => Add. Increment = 100 r RTT=200ms => Add. Increment = 400 u Combining the two formulas above: u Periodic evaluation of the RTT and the MTU. u How to define the references?
Scalable TCP u For cwnd>lwnd, replace AIMD with new algorithm: è for each ACK in an RTT without loss: r cwnd i+1 = cwnd i + a è for each window experiencing loss: r cwnd i+1 = cwnd i – (b x cwnd i ) u Kelly’s proposal during internship at CERN: (lwnd,a,b) = (16, 0.01, 0.125) è Trade-off between fairness, stability, variance and convergence u Advantages: è Responsiveness improves dramatically for gigabit networks è Responsiveness is independent of capacity
Scalable TCP: Responsiveness Independent of Capacity
Scalable TCP vs. TCP NewReno: Benchmarking Number of flows TCP TCP + new dev driver Scalable TCP u Responsiveness for RTT=200 ms and MSS=1460 bytes: è Scalable TCP: 2.7 s è TCP NewReno (AIMD): r ~3 min at 100 Mbit/s r ~1h 10min at 2.5 Gbit/s r ~4h 45min at 10 Gbit/s u Bulkthroughput tests with C=2.5 Gbit/s. Flows transfer 2 Gbytes and start again for 1200s u Bulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 Gbytes and start again for 1200s u For details, see paper and code at: è
Fast TCP u Equilibrium properties è Uses end-to-end delay and loss è Achieves any desired fairness, expressed by utility function è Very high utilization (99% in theory) u Stability properties è Stability for arbitrary delay, capacity, routing & load è Robust to heterogeneity, evolution, … è Good performance r Negligible queueing delay & loss (with ECN) r Fast response
FAST TCP performance
1 flow 2 flows 7 flows 9 flows 10 flows Average utilization 95% 92% 90% FAST u Standard MTU u Utilization averaged over > 1hr 1hr 6hr 1.1hr6hr 88%
FAST TCP performance Linux TCP Linux TCP FAST Average utilization 19% 27% 92% FAST u Standard MTU u Utilization averaged over 1hr txq=100txq= % 16% 48% Linux TCP Linux TCP FAST 2G 1G
Internet2 Land Speed record u On February 27-28, 2003, over a Terabyte of data was transferred in less than an hour between the Level(3) Gateway in Sunnyvale, near SLAC, and CERN. u The data passed through the TeraGrid Router at StarLight from memory to memory as a single TCP/IP stream at an average rate of 2.38 Gbits/s (using large windows and 9KByte "jumbo frames"). u This beat the former record by a factor of approximately 2.5 and used the US-CERN link at 99% efficiency
Internet2 LSR tesbed
Conclusion u To achieve high throughput over high latency/bandwidth network, we need to : è Set the initial slow start threshold (ssthresh) to an appropriate value for the delay and bandwidth of the link. è Avoid loss r By limiting the max cwnd size è Recover fast if loss occurs: r Larger cwnd increment r Smaller window reduction after a loss r Larger packet size (Jumbo Frame) u Is standard MTU the largest bottleneck? u How to define the fairness? è Taking into account the MTU è Taking into account the RTT u Which is the best new TCP implementation?