Download presentation
Presentation is loading. Please wait.
Published byJuliana Quinn Modified over 6 years ago
1
Characterization and Evaluation of TCP and UDP-based Transport on Real Networks
Les Cottrell, Saad Ansari, Parakram Khandpur, Ruchi Gupta, Richard Hughes-Jones, Michael Chen, Larry McIntosh, Frank Leers SLAC, Manchester University, Chelsio and Sun Protocols for Fast Long Distance Networks, Lyon, France February, 2005 Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP
2
Project goals Evaluate various techniques for achieving high bulk-throughput on fast long-distance real production WAN links Compare & contrast: ease of configuration, throughput, convergence, fairness, stability etc. For different RTTs Recommend “optimum” techniques for data intensive science (BaBar) transfers using bbftp, bbcp, GridFTP Validate simulator & emulator findings & provide feedback Problems with real networks: They do not stay still, routes change, links are upgraded, pings are blocked, hosts go down, hosts have configurations changed, routers are misconfigured and/or crash Must avoid impacting production traffic, need agreement of remote administrators and/or network folks Many of the stacks are alpha or beta releases and are in the kernel: kernel crashes, developer wants to add new feature, measurer wants stability or throw out previous measurements. Some stacks were not available in time (Westwood+, GridDT)
3
Techniques rejected Jumbo frames
Not an IEEE standard May break some UDP applications Not supported on SLAC LAN Sender mods only, HENP model is few big senders, lots of smaller receivers Simplifies deployment, only a few hosts at a few sending sites So no Dynamic Right Sizing (DRS) Runs on production nets No router mods (XCP/ECN)
4
Software Transports Advanced TCP stacks
To overcome AIMD congestion behavior of Reno based TCPs BUT: SLAC “datamover” are all based on Solaris, while advanced TCPs currently are Linux only SLAC production systems people concerned about non-standard kernels, ensuring TCP patches keep current with security patches for SLAC supported Linux version So also very interested in transport that runs in user space (no kernel mods) Evaluate UDT from UIC folks
5
Hardware Assists For 1Gbits/s paths, cpu, bus etc. not a problem
For 10Gbits/s they are more important NIC assistance to the CPU is becoming popular Checksum offload Interrupt coalescence Large send/receive ofload (LSO/LRO) TCP Offload Engine (TOE) Several vendors for 10Gbits/s NICs, at least one for 1Gbits/s NIC But currently restricts to using NIC vendor’s TCP implementation Most focus is on the LAN Cheap alternative to Infiniband, MyriNet etc.
6
Protocols Evaluated TCP (implementations as of April 2004) UDP
Linux 2.4 New Reno with SACK: single and parallel streams (Reno) Scalable TCP (Scalable) Fast TCP HighSpeed TCP (HSTCP) HighSpeed TCP Low Priority (HSTCP-LP) Binary Increase Control TCP (BICTCP) Hamilton TCP (HTCP) Layering TCP (LTCP) UDP UDT v2.
7
Methodology (1Gbit/s) Chose 3 paths from SLAC
Caltech (10ms), Univ Florida (80ms), CERN (180ms) Used iperf/TCP and UDT/UDP to generate traffic Each run was 16 minutes, in 7 regions SLAC bottleneck Iperf or UDT TCP/UDP Caltech/UFL/CERN Ping 1/s iperf ICMP/ping traffic Slow start is ~4-6 seconds, so for 2 minutes ~95% of measurement is after initial slow start. Iperf and UDT report incremental throughputs at one second intervals. Txqueuelen = 1000 apart from Fast where it is 100. Each run repeated 3-5 times at different times. 4 mins 2 mins
8
Behavior Indicators Achievable throughput
Stability S= σ/μ (standard deviation/average) Intra-protocol fairness F =
9
Behavior wrt RTT 10ms (Caltech): Throughput, Stability (small is good), Fairness minimum (over regions 2 thru 6) (closer to 1 is better) Excl. FAST ~ 720±64Mbps, S~0.18±0.04, F~0.95 FAST ~ 400±120Mbps, S=0.33, F~0.88 80ms (U. Florida): Throughput, Stability All ~ 350±103Mbps, S=0.3±0.12, F~0.82 180ms (CERN): All ~ 340±130Mbps, S=0.42±0.17, F~0.81 The Stability and Fairness effects are more manifest on longer RTT, so focus on CERN
10
Reno single stream Low performance on fast long distance paths
AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Remaining flows do not take up slack when flow removed Multiple streams increase recovery rate Congestion has a dramatic effect SLAC to CERN Stacked graphs, smoothed 1 second iperf measurements to 5 second intervals. Recovery is slow RTT increases when achieves best throughput
11
Fast Also uses RTT to detect congestion
RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others 2nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from Stable RTT infers it is friendly to other flows since does not produce congestion (as measured by RTT). SLAC-CERN
12
HTCP One of the best performers Throughput is high
Big effects on RTT when achieves best throughput Flows share equally Appears to need >1 flow to achieve best throughput Other runs with HTCP show similar behavior on two flows Two flows share equally SLAC-CERN
13
BICTCP Needs > 1 flow for best throughput
14
UDTv2 Similar behavior to better TCP stacks
RTT very variable at best throughputs Intra-protocol sharing is good Behaves well as flows add & subtract
15
Overall Proto Avg thru (Mbps) S (σ/μ) min (F) σ (RTT) MHz/ Mbps Scal.
423±115 0.27 0.83 22 0.64 BIC 412±117 0.28 0.98 55 0.71 HTCP 402±113 0.99 57 0.65 UDT 390±136 0.35 0.95 49 1.2 LTCP 376±137 0.36 0.56 41 0.67 Fast 335±110 0.33 0.58 9 0.66 HSTCP 255±187 0.73 0.79 25 0.9 Reno 248±163 0.6 0.63 HSTCP-LP 228±114 0.5 33 Scalable is one of best, but inter-protocol is poor (see Bullot et al.) BIC & HTCP are about equal UDT is close, BUT cpu intensive (used to be much (factor of 10) worse) Fast gives low RTT values & variability All TCP protocols use similar cpu (HSTCP looks poor because throughput low)
16
10Gbps tests At SC2004 using two 10Gbps dedicated paths between Pittsburgh and Sunnyvale Using Solaris 10 (build 69) and Linux 2.6 On Sunfire Vx0z (dual & quad 2.4GHz 64 bit AMD Opterons) with PCI-X 133MHz 64 bit Only 1500 Byte MTUs Achievable performance limits (using iperf) Reno TCP (multi-flows) vs UDTv2, TOE (Chelsio) vs no TOE (S2io)
17
Results UDT limit was ~ 4.45Gbits/s
Cpu limited TCP Limit was about 7.5±0.07 Gbps, regardless of: Whether LAN (back to back) or WAN WAN used 2MB window & 16 streams Whether Solaris 10 or Linux 2.6 Whether S2io or Chelsio NIC Gating factor=PCI-X Raw bandwidth 8.53Gbps But transfer broken into segments to allow interleaving E.g. with max memory read byte count of 4096Bytes with Intel Pro/10GbE LR NIC limit is 6.83Gbits/s One host with 4 cpus & 2 NICs sent 11.5±0.2Gbps to two dual cpu hosts with 1 NIC each Two hosts to two hosts (1 NIC/host) 9.07Gbps goodput forward & 5.6Gbps reverse
18
TCP CPU Utilization CPU power important Each cpu=2.4GHz
Throughput increases with flows Util. not linear(throughput) Depends on flows too Chelsio(TOE) Normalize GHz/Gbps Chelsio + TOE + Linux 2.6.6 S2io + CKS offload + Sol10 S2io supports LSO but Sol10 did not, so not used Microsoft reports 0.017GHz/Gbps with Windows+S2io/LSO, 1 flow
19
Conclusions Need testing on real networks
Controlled simulation & emulation critical for understanding BUT need to verify, and results look different than expected (e.g. Fast) Most important for transoceanic paths UDT looks promising, still needs work for > 6Gbits/s Need to evaluate various offloads (TOE, LSO ...) Need to repeat inter-protocol fairness vs Reno New buses important, need NICs to support then evaluate
20
Further Information Web site with lots of plots & analysis
Inter-protocols comparison (Journal of Grid Comp, PFLD04) SC2004 details www-iepm.slac.stanford.edu/monitoring/bulk/sc2004/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.