Sustained Wide-Area TCP Memory Transfers Over Dedicated Connections Nagi Rao, Oak Ride National Laboratory raons@ornl.gov Don Towsley, Gayane Vardoyan, University of Massachusetts Brad Settlemyer, Los Alamos National Laboratory Ian Foster, Raj Kettimuthu, Argonne National Laboratory IEEE International Symposium on High Performance and Smart Computing August 26, 2015, New York Sponsored by U.S. Department of Defense U.S. Department of Energy
Outline Motivation and Background Throughput Measurements Emulation Testbed Throughput Analytical Models TCP over dedicated connections and small memory transfers Monotonicity of throughput Concavity Analysis Conclusions
Background A Class of HPC Applications require memory transfers over wide-area connections computational monitoring of code on supercomputer coordinated remote computations on two remote supercomputers dedicated network connections are increasingly deployed also in commercial space – big data Network transfers: Current models and experiments mainly driven by Internet connections shared connections: losses due to other traffic analytical models - complicated to handle “complex” traffic Memory transfers over dedicated connections limited measurements: most common are over internet analytical models: majority based on losses (internal and external) somewhat “easier” environments to collect measurements: dedicated connections: easier to emulate/simulate
TCP Throughput Models and Measurements Existing TCP memory transfer models congestion avoidance mode is prominent: derived using loss rates: slow-start is often “approximated out” round trip time appears in denominator (with a coefficient) loss rate parameter appears in the denominator For small datasets over dedicated connections: there are very few, often, no losses models with loss rate parameter in the denominator become unstable Summary of our measurements: 10Gbps emulated connections Five different TCP versions: Reno, Scalable TCP, Hamilton TCP, Highspeed TCP, CUBIC (default in Linux) Range of rtt: 0-366ms cross-country connections: 100ms Observations: throughput depends critically on rtt different TCP versions are effective in different rtt ranges throughput profiles has two modes – sharp contrast to well-known convex convex: longer rtt concave: shorter to medium rtt
TCP Throughput Profiles Most common TCP throughput profile convex function of rtt example, Mathis et al Observed Dual-mode profiles: throughput measuremen CUBIC ScalableTCP Smaller RTT Concave region Larger RTT Convex region Throughput at rtt loss-rate concave region Throughput - Gbps convex region RTT - ms
Desired Features of Concave Region Concave regions is very desirable throughput does not decay as fast rate of decrease slows down as rtt Function is concave iff derivative is non-increasing not satisfied by Mathis model: Measurements: throughput profiles for rtt: 0-366ms Concavity: small rtt region Only some TCP versions: CUBIC, Hamilton TCP Scalable TCP Not for some TCP versions: Reno Highspeed TCP These are loadable linux modules
Our Contributions Analytical model to capture concave region For smaller datasets over dedicated connections: there are very few, often, no losses models with loss rate parameter in denominator become unstable transfers last beyond slow start (e.g. unlike Mellia et al 2002) Result 1: Throughput is decreasing function of RTT Result 2: Robust slow-start leads to concave profile for smaller rtt Observations from measurements: throughput depends critically on rtt different TCP versions are effective in different rtt ranges throughput profiles has two modes – sharp contrast to well-known convex profile of TCP convex: longer rtt concave: shorter to medium rtt Practical solutions Choose TCP version for the connection rtt: Five different TCP versions: Reno, Scalable TCP, Hamilton TCP, Highspeed TCP, CUBIC (default in Linux) Load kernel module in Linux
Multiple Multi-Bus Host-to-Host Memory/Disk/File Transfers LAN switch LAN switch NIC router NIC router host NIC NIC memory-to-memory: iperf host HBA HBA FCAs NIC NIC HBA HBA HBA FCAs host storage switch HBA HBA FCAs measurements: iperf xddprof disk array disk/file profile: xddprof storage controller disk array
Throughput Measurements: 10GigE ANUE Connection Emulator bohr04 HP 48-core 4-socket Linux host connection emulation: latency 0-800 ms segment loss: 0.1% Periodic Poisson Gaussian Uniform Not used here ANUE 10GigE bohr05 HP 48-core 4-socket Linux host Connection Emulation: physical packets are sent to/from hosts delayed inside emulator for rtt More accurate than simulators bohr04 bohr05 TCP connection RTT loss rate 10Gbps iperf TCP iperf TCP
Throughput Measurements: SONET OC192 ANUE Emulator feynman 1 HP 16-core 2-socket Linux host Ethernet-SONET conversion fiber loopback e300 10GE-OC192 ANUE OC192 feynman 2 HP 16-core 2-socket Linux host connection emulation: latency 0-800 ms loss: 0% only OC192 ANUE: unable to emulate losses Available congestion control modules: CUBIC (default) Scalable TCP Reno Hamilton TCP Highspeed TCP feynman 1 feynman 2 TCP connection RTT loss rate 9.6 Gbps iperf TCP iperf TCP
CUBIC and Scalable TCP For dedicated 10G links: fairly stable statistics Throughput - Gbps Throughput - Gbps Scalable TCP CUBIC RTT - ms RTT - ms
CUBIC and Scalable TCP For dedicated OC192 – 9.6Gbps connections: Not all TCP Versions have concave regions Scalable TCP Hamilton TCP CUBIC Reno Highspeed TCP
TCP Parameters and Dynamics TCP and Connection Parameters: Congestion window: Instantaneous throughput: Connection capacity: Connection round trip time Average throughput during observation window: Dedicated Connections: under close-to-zero losses Two Regimes: Slow Start: increases by 1 for each acknowledgement until it reaches (b) Congestion Avoidance: increases with acknowledgement and decreases with inference of loss (such as time-out) details of increase and decrease depend on TCP version simple case: Reno Additive Increase and Multiplicative Decrease (AIMD) acknowledgement: loss inferred:
TCP Dynamics for Dedicated Connections Dynamics are dominated by acknowledgements: Most, if any, losses are by TCP itself – buffer overflows Very rarely, physical layer losses lead to TCP/IP losses Resultant Dynamics: Congestion window mostly increases – slow start and into congestion avoidance Sending and receiving rates are limited by Send and receive host buffers – TCP, IP, kernel, NIC, IO and others Connection capacity and round-trip time Throughput increases until it reaches connection capacity small delay bandwidth product: transition during slow start large delay bandwidth product: transition after slow start To a first-order approximation: Overall qualitative properties of throughput can be inferred without finer details of congestion avoidance phase for smaller data transfers
TCP Throughput Peaks During Slow start Short Connections: Smaller Delay-Bandwidth Product loss event link capacity reached congestion avoidance slow start time
TCP Throughput Peaks During Slow start Longer Connections: Higher Delay-Bandwidth Product loss event slow start congestion avoidance time
Throughput Estimation: Decrease with rtt For smaller rtt, link capacity is reached while in slow start throughput during period of observation During, slow-start, throughput doubles for every rtt Monotonicity of throughput: reduces to
Concave Profile of Throughput Condition of concave throughput profile: for By substituting the terms, we obtain This reduces to concavity of simpler form: follows from decreasing
TCP Throughput: Decreases in general Using monotonity for smaller rtt: suffices to show for regions CA1 By using the generic form: It suffices to show: (i) (ii) In summary, throughput decreases with rtt for all TCP versions
UDP-Based Transport: UDT For dedicated 10G links, UDT provides higher throughput than CUBIC (linux default) TCP and UDT Throughput transition-point depends on connection parameters – rtt, loss rate, host – NIC parameters, IP and UDP parameters Disk-to-Disk transfers (xdd) have lower transfer rate UDT xdd-read CUBIC Single stream xdd-write
Conclusions Future Work Summary: For smaller memory transfers over dedicated connections: Collected systematic throughput measurements: different rtt ranges throughput profiles has two modes – sharp contrast to well-known convex profiles convex: longer rtt concave: shorter to medium rtt Developed analytical models to explain the two modes Our Results also lead to practical solutions Choose TCP version for the connection rtt: Five different TCP versions: Reno, Scalable TCP, Hamilton TCP, Highspeed TCP, CUBIC (default in Linux) Load congestion avoidance kernel module Future Work Large data transfers – congestion avoidance region Disk and File transfers – effects of IO limits TCP-UDT trade-offs Parallel TCP streams
Thank you