1 FAST TCP for Multi-Gbps WAN: Experiments and Applications Les Cottrell & Fabrizio Coccetti– SLAC Prepared for the Internet2, Washington, April Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), by the SciDAC base program.
2 Outline High throughput challenges New TCP stacks Tests on Unloaded (testbed) links –Performance of multi-streams –Performance of various stacks Tests on Production networks –Stack comparisons with single streams –Stack comparisons with multiple streams –Fairness Where do I find out more?
3 High Speed Challenges After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s –i.e. loss rate of 1 in ~ 2 Gpkts (3Tbits), or BER of 1 in 3.6*10 12 PCI bus limitations (66MHz * 64 bit = 4.2Gbits/s at best) At 2.5Gbits/s and 180msec RTT requires 120MByte window Some tools (e.g. bbcp) will not allow a large enough window – (bbcp limited to 2MBytes) Slow start problem at 1Gbits/s takes about 5-6 secs for 180msec link, –i.e. if want 90% of measurement in stable (non slow start), need to measure for 60 secs –need to ship >700MBytes at 1Gbits/s Sunnyvale-Geneva, 1500Byte MTU, stock TCP
4 New TCP Stacks Reno (AIMD) based, loss indicates congestion –Back off less when see congestion –Recover more quickly after backing off Scalable TCP: exponential recovery –Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table Vegas based, RTT indicates congestion –Caltech FAST TCP, quicker response to congestion, but … Standard Scalable High Speed cwnd=38pkts~0.5Mbits
5 Typical testbed 12*2cpu servers 4 disk servers GSRGSR 6*2cpu servers Sunnyvale 6*2cpu servers 4 disk servers OC192/POS (10Gbits/s) 2.5Gbits/s T640T640 Sunnyvale section deployed for SC2002 (Nov 02) (EU+US) Geneva Chicago SNV CHIAMS GVA > 10,000 km
6 Testbed Collaborators and sponsors Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart NIKHEF/UvA: Cees DeLaat, Antony Antony CERN: Olivier Martin, Paolo Moroni ANL: Linda Winkler DataTAG, StarLight, TeraGrid, SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies Cisco, Level(3), Intel DoE, European Commission, NSF
7 Windows and Streams Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput Effectively reduces impact of a loss by 1/n, and improves recovery time by 1/n Optimum windows & streams changes with changes (e.g. utilization) in path, hard to optimize n Can be unfriendly to others
8 Even with big windows (1MB) still need multiple streams with Standard TCP Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams ANL, Caltech & RAL reach a knee (between 2 and 24 streams) above this gain in throughput slow
9 Stock vs FAST TCP MTU=1500B Need to measure all parameters to understand effects of parameters, configurations: –Windows, streams, txqueuelen, TCP stack, MTU, NIC card –Lot of variables Examples of 2 TCP stacks –FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1) Stock TCP, 1500B MTU 65ms RTT FAST TCP, 1500B MTU 65ms RTT FAST TCP, 1500B MTU 65ms RTT
10 TCP stacks with 1500B txqueuelen
11 Jumbo frames, new TCP stacks at 1 Gbits/s SNV-GVA But: Jumbos not part of GE or 10GE standard Not widely deployed in end networks
12 Production network tests SLACCERN SURFnet APAN Abilene CalREN Stanford ESnet CHICAGO SEATTL E CHI GVA Caltech RTT = 25 ms AMS APAN RTT = 147 ms NIKHEF RTT = 158 ms CERN RTT = 202 ms OC 48 OC 192 OC 12 OC 48 OC 12 SNV OC 12 All 6 hosts have 1GE interfaces (2 SLAC hosts send simultaneously) Competing flows, no jumbos Host running “New” TCP Host running Reno TCP Remote host
13 High Speed TCP vs Reno – 1 Stream Checked Reno vs Reno 2 hosts and very similar as expected 2 separate SLAC sending simultaneously to 1 receiver (2 iperf processes), 8MB window, pre-flush TCP config, 1500B MTU RTT bursty = congestion?
14 Nb large RTT=congestion?
15 Large RTTs => poor FAST
16 Scalable vs multi-streams SLAC to CERN, duration 60s, RTT 207ms, 8MB window
17 FAST & Scalable vs. Multi-stream Reno (SLAC>CERN ~230ms) Reno 1 streams 87 Mbits/s average FAST 1 stream 244 Mbits/s average Bottleneck capacity 622Mbits/s For short duration, very noisy, hard to distinguish Reno 8 streams 150 Mbits/s average FAST 1 stream 200 Mbits/s average Congestion events often sync
18 Scalable & FAST TCP with 1 stream vs Reno with n streams
19 Fairness FAST vs Reno Reno alone 221Mbps Fast alone 240Mbps Reno (45Mbps) & FAST (285Mbps) competing 1 Stream, 16MB window, SLAC to CERN
20 Summary (very preliminary) With single flow & empty network: –Can saturate 2.5 Gbps with standard TCP & jumbos –Can saturate 1Gbps with new stacks & 1500B frame or with standard & jumbos With production network, –FAST can take a while to get going –Once going, FAST TCP with one stream looks good compared to multi-stream RENO –FAST can back down early compared to RENO –More work needed on fairness Scalable –Does not look as good vs. multi-stream Reno
21 What’s next? Go beyond 2.5Gbits/s Disk-to-disk throughput & useful applications –Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors Further evaluate new stacks with real-world links, and other equipment –Other NICs –Response to congestion, pathologies –Fairness –Deploy for some major (e.g. HENP/Grid) customer applications Understand how to make 10GE NICs work well with 1500B MTUs Move from “hero” demonstrations to commonplace
22 More Information 10GE tests –www-iepm.slac.stanford.edu/monitoring/bulk/10ge/www-iepm.slac.stanford.edu/monitoring/bulk/10ge/ –sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.htmlsravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html TCP stacks –netlab.caltech.edu/FAST/netlab.caltech.edu/FAST/ –datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdfdatatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf – Stack comparisons –www-iepm.slac.stanford.edu/monitoring/bulk/fast/www-iepm.slac.stanford.edu/monitoring/bulk/fast/ – –www-iepm.slac.stanford.edu/monitoring/bulk/tcpstacks/www-iepm.slac.stanford.edu/monitoring/bulk/tcpstacks/
23 Extras
24 FAST TCP vs. Reno – 1 stream N.b. RTT curve for Caltech shows why FAST performs poorly against Reno (too polite?)
25 Scalable vs. Reno - 1 stream 8MB windows, 2 hosts, competing
26 Other high speed gotchas Large windows and large number of streams can cause last stream to take a long time to close. Linux memory leak Linux TCP configuration caching What is the window size actually used/reported 32 bit counters in iperf and routers wrap, need latest releases with 64bit counters Effects of txqueuelen (number of packets queued for NIC) Routers do not pass jumbos Performance differs between drivers and NICs from different manufacturers –May require tuning a lot of parameters