Web100/Net100 at Oak Ridge National Lab Tom Dunigan August 1, 2002
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Web100 at ORNL Funding and goals Web100 tools and insights –Java bandwidth server –instrumented probes and log daemon –trace daemons –my favorite Web100 variables TCP tuning with Web100 –tuning daemon (WAD) –tuning buffer sizes, slow-start, AIMD/VMSS, delayed ACK, reordering, parallel Web100 needs
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100: developing network-aware operating systems DOE-funded (Office of Science) project ($1M/yr, 3 yrs beginning 9/01) Principal investigators –Matt Mathis, PSC ( ) –Brian Tierney, LBNL ( ) –Tom Dunigan, ORNL ( ) Florence Fowler Nagi Rao Objective: –measure and understand end-to-end network and application performance –tune network applications (grid and bulk transfer) – first year emphasis : bulk transfer over high delay/bandwidth nets Components (leverage Web100) –Network Tool Analysis Framework (NTAF) tool design and analysis active network probes and passive sensors network metrics data base –transport protocol analysis –tuning daemon (WAD) to tune network flows based on network metrics
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Web100 tools Java applet bandwidth/client tester –measure in/out data rates –report flow characteristics – Try it – INSIGHTS : what happened, what you can expect from server log: – 25,755 flows – 53% with loss, 23% timeouts Post-transfer statistics –ttcp100/iperf100 –Web100 daemon avoid modifying applications log designated paths/ports/variables – INSIGHTS : later...
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Web100 tools Tracer daemon –collect Web100 variables at 0.1 second intervals –config file specifies source/port dest/port web100 variables (current/delta) –log to disk with timestamp and CID –C and python (LBL-based) – INSIGHTS : watch uninstrumented app’s (GridFTP) analyze flow dynamics with plots (cwnd, ssthresh, re-xmits,RTT…) analyze tuned flows aggregate parallel flow data # traced config file #local lport remote rport #v=value d=delta d PktsOut d PktsRetrans v CurrentCwnd v SampledRTT
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory My favorite Web100 variables Post-transfer –CurrentMSS/Timeouts: PIX firewall problems –RetransThresh: out of order packets –MaxCwnd/MaxSsthresh: path capacity, linux 2.4 caching –MinRTT/MaxRTT/*RTO: queuing, bandwidth-delay –SendStall/OtherReductions: linux 2.4 slowups –MaxRwinRcvd/Sndbuf: buffer limits, web100 wscale clamp –CongestionSignals/PacketsRetrans: loss intensity –SndLimTime* : bottleneck Dynamic –CongestionSignals/PacketsRetrans/CurrentCwnd: type of loss, when (ss) –SampledRTT: queueing delays –CurrentSsthresh/Pktsout: recovery, timeouts –CurrentRwinRcvd: linux 2.4 window advertisement
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory PIX SACK problem Web100 reports timeouts into ORNL, not at other sites ?? Theory 1: yet another linux 2.4 TCP feature our TCP-over-UDP: no timeouts Tcpdump/tcptrace/xplot of flow both inside and outside ORNL ? Tcptrace bug -- SACK blocks wrong for one of the dumps… NOT. ORNL PIX firewall randomizing TCP sequence numbers, but failed to adjust SACK blocks RESULT: TCP timeouts
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory TCP tuning with Web100+/Net100 Path characterization (NTAF) –both active and passive measurement –data base of measurement data –NTAF/Web100 hosts at PSC, NCAR,LBL,ORNL Application tuning (tuning daemon, WAD) –Web100 extensions disable Linux 2.4 caching/SendStall event notification more tuning options –daemon tunes application at start up static tuning information query NTAF and calculate optimum TCP parameters –dynamically tune application (Web100 feedback) adjust parameters during flow split optimum among parallel flows Transport protocol optimizations –what to tune? –is it fair? stable?
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 TCP tuning TCP performance –reliable/stable/fair –need buffer = bandwidth*RTT ORNL/NERSC (80 ms, OC12) need 6 MB –TCP slow-start and loss recovery proportional to MSS/RTT slow on today’s high delay/bandwidth paths –TCP is lossy be design TCP tuning –set optimal (?) buffer size –avoid losses modified slow-start reduce bursts anticipate (Vegas?) loss reorder threshold –speed recovery bigger MTU or “virtual MSS” modified AIMD (0.5,1) delayed ACKs and initial window ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow start. Standard TCP with del ACK takes 10 minutes to recover!
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 TCP tuning Work-around Daemon (WAD) –tune unknowing sender/receiver at startup and/or during flow –Web100 kernel extensions uses netlink to alert daemon of socket open/close Besides existing Web100 buffer tuning, new code and WAD_* variables knobs to disable Linux 2.4 caching and sendstall –config file with static tuning data mode specifies dynamic tuning (Floyd AIMD, NTAF buffer size, concurrent streams) –daemon periodically polls NTAF for fresh tuning data –written in C (LBL has python version) WAD config file [bob] src_addr: src_port: 0 dst_addr: dst_port: 0 mode: 1 sndbuf: rcvbuf: wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 delack: 0 floyd: 1
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory WAD tuning results (your mileage may vary …) Classic buffer tuning : ORNL to PSC, OC12, 80ms RTT network-challenged app. gets 10 Mbs same app., WAD/NTAF tuned buffer get 143 Mbs Virtual MSS tune TCP’s additive increase (WAD_AI) add K segments per RTT during recovery k=6 like GigE jumboframe
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory WAD tuning Modified slow-start and AI ORNL to NERSC, OC12, 80 ms RTT often losses in slow start WAD tuned Floyd slowstart (WAD_MaxThresh) and AI (6) WAD tuned AIMD and slow start ORNL to CERN, OC12, 150ms RTT parallel streams AIMD (1/(2k),k) WAD tune single stream (0.125,4) WAD_MD Can tuned single stream compete with parallel streams ? pre-tune Floyd AIMD or dynamically adjust tune concurrent flows -- subdivide buffer
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Net100 TCP tuning Reorder threshold seeing more out of order packets WAD tune a bigger reorder threshold Linux 2.4 does a good job already LBL to ORNL (using our TCP-over-UDP) dup3 case had 289 retransmits, but all were unneeded! WAD could turn off delayed ACKs -- 2x improvement in recovery rate and slowstart linux 2.4 already turns off delayed ACKs for initial slow-start WARNING : could be unfair, probably stable use only on intranet Web100 has proven very useful for experimenting with TCP tuning options.
UT-BATTELLE U.S. Department of Energy Oak Ridge National Laboratory Futures Net100 –analyze effectiveness of current tuning options –NTAF probes -- characterizing a path to tune a flow –additional tuning algorithms –parallel/multipath selection/tuning –WAD-to-WAD tuning Web100 extensions –Web100 trace files -- log all data efficiently –variable for count of duplicate data segments at receiver –remove wscale restriction