The Effects of Systemic Packets Loss on Aggregate TCP Flows Thomas J. Hacker May 8, 2002 Internet 2 Member Meeting
Problem High performance computing community is making use of parallel TCP sockets to increase end-to-end throughput There are concerns about the effectiveness, fairness and efficiency of parallel flows This research uses simulation to investigate the effectiveness, fairness and efficiency questions Based on simulations with empirically based loss model, parallel TCP is effective and efficient, but not always fair May be possible to improve fairness
Outline Introduction Motivation Background Simulation Evaluation Conclusion
Introduction HPC community needs high speed bulk throughput Using parallel TCP flows to increase throughput Examples Bbcp - Stanford Linear Accelerator (SLAC) Globus - Argonne National Lab GridFTP – Grid Forum and ANL Storage Resource Broker – San Diego Supercomputer Center PSockets Library – University of Illinois at Chicago SLAC has extensive measurements that demonstrates successful use
Introduction Actual end-to-end network throughput is much less than expected Host and Network tuning helps a little Infrastructure upgrades help a little But after tuning, throughput still much less than expected Network measurements gathered from infrastructure show available unused bandwidth (“head room”) Observed packet loss rate from transfers are too high to support high throughput bulk data transfers
Introduction Networking community discourages use of Parallel TCP flows May cause congestion collapse at worst Unfair to single stream flows at best This is based on the belief that packet losses are due exclusively to network overload
Motivation This research examines the use of parallel TCP flows on shared networks Goals of the research are to determine if parallel TCP is Effective Fair Efficient
Motivation Effective Does the use of parallel TCP flows increase aggregate throughput? Fair Does the use of parallel TCP flows steal bandwidth from competing TCP flows? Efficient Does the use of parallel TCP flows improve the overall efficiency of the network bottleneck?
Outline Introduction Motivation Background Simulation Evaluation Conclusion
Background Factors that affect TCP throughput Maximum Segment Size (MSS) Maximum TCP segment size Limited by maximum frame size supported by network Round Trip Time (RTT) Depends on Length of network Load on network (queueing delays) Packet Loss Rate Number of packets dropped / Number of packets transmitted Packet losses considered a sign of overload
Background Packet Loss Most dynamic factor of the three High rates of packet loss limits throughput Cause assumed to be exclusively from overload Statistical distribution of packet loss is important
Background Sources of Packet Loss Network bottleneck overload Other sources Hardware and Software Bugs Faulty Hardware Others…
Background Implication When there is no congestion, packet loss from other sources limits throughput Evidence of non congestion packet loss Lack of recorded drops in routers Underutilized network links Packet drops present in TCP sessions that are not due to overload
Background Parallel TCP flows Overcomes effects of packet loss on throughput Recovers from loss faster than single stream Averages out effects of non-congestion related packet losses
Outline Introduction Motivation Background Simulation Evaluation Conclusion
Simulation NS2 simulation built to investigate the effectiveness, fairness, and efficiency of parallel TCP flows
Simulation Loss Model in simulator is critical Measurements from real transfers used to build loss model 153 data transfers from U-M to Caltech Performed over 3 days Packet traces from experiments analyzed to extract losses Source of Loss Network operations centers certified no router drops during test Bandwidth graph for network bottleneck showed underutilization
Simulation Observed Loss Characteristics
Simulation Right hand side of histogram
Simulation Left Hand Side of Histogram Intraburst Losses Collection of exponential distributions Between 61% and 78% of analyzed intrabursts fit an exponential distribution Right Hand Side of Histogram Interburst Losses Fits a normal distribution
Simulation Loss Models Considered Constant Loss Probability Random I.I.D. Poisson Loss Arrival Unconditional and Conditional Loss A.k.a 2-state Markov or Gilbert Kth Order Markov Loss Model Extended Gilbert Model
Simulation 6-state Markov Model selected 6 states were enough to simulate throughput equivalent to observed Markov chain used to drive a Markov Modulated Poisson Process (MMPP) 1 state is the loss state, 5 states no-loss Sojourn time and transition probabilities from observed data Poisson Loss Model used for the Loss State
Simulation MultiState Loss Model in ns2 used to implement MMPP loss model Extension made to ns2 to support MultiState Loss Model on multiple links in the simulator Each simulation instance was run 10 times with different random seeds for the Loss Model Total number of all simulations was over 3000
Outline Introduction Motivation Background Simulation Evaluation Conclusion
Evaluation Effectiveness Fairness Efficiency
Evaluation Effectiveness Question Does the use of parallel TCP flows increase aggregate throughput? Addressing the Question Between 1 and 6 parallel flows simulated No Cross Traffic
Evaluation Effectiveness Results
Evaluation Effectiveness Conclusion Parallel flows improve aggregate throughput in the presence of systemic non-congestion related packet loss Corroboration of simulation results with observed results
Evaluation Effectiveness Fairness Efficiency
Evaluation Fairness Question Does the use of parallel TCP flows steal bandwidth from competing TCP flows? Addressing the Question Between 1 to 12 parallel flows Between 1 to 5 cross streams of competing single stream traffic
Evaluation Reading the Graphs Total Parallel Flow Throughput Total Single Stream Flow Throughput Network Bottleneck is 100 Mb/sec
Evaluation
Fairness Conclusions Fair when there is approximately more than 10% unused bandwidth Unfair when there is no available bandwidth Parallel TCP flows steal bandwidth from competing single stream flows to increase throughput when no unused bandwidth
Evaluation Improving Fairness Parallel flow aggressiveness due to Increased recovery rate over single stream Fractional response to packet drops If we could make parallel flows only as aggressive as a single stream, can we preserve effectiveness and efficiency while improving fairness?
Evaluation Slight modification to the TCP congestion avoidance algorithm If n parallel flows are used, increase congestion window one packet for every n packets successfully transmitted, rather than one packet for every one packet successfully transmitted Overall aggressiveness of n parallel flows is then the same as one single TCP flow Simulation for 1 and 5 cross streams run with 1 to 20 parallel streams to investigate boundries
Evaluation
Parallel flows with modification are about ½ as aggressive as parallel flows with no modification Also found some asymptotic behavior as the number of parallel flows increased
Evaluation Asymptotic behavior Derived aggregate throughput of parallel flow with modified TCP
Evaluation
Fairness Conclusions Fair when there is more than 10% available bandwidth in bottleneck Parallel flows steal from single stream flows when bottleneck is over 90% utilized TCP modification Reduces aggressiveness Curbs ability of parallel flow to steal bandwidth as number of flows increase
Evaluation Effectiveness Fairness Efficiency
Evaluation Efficiency Results Efficiency is increased when parallel flows used if there is unused bandwidth in bottleneck When all nodes use same number of parallel flows Efficiency maintained Fairness maintained
Outline Introduction Motivation Background Simulation Evaluation Conclusion
Conclusions Parallel flows are Effective Fair when bottleneck is utilized less than 90% Unfair when bottleneck is near saturation Efficient TCP congestion avoidance algorithm can be modified to Reduce aggressiveness by approximately 1/2 Maintain effectiveness and efficiency
Future Work Implement modified algorithm for assessment Further investigate loss models Parameterization of loss models Assessment of end-to-end networks loss characteristics Investigate optimal TCP response to observed loss characteristics Investigate stochastic analysis of parallel TCP over wide area networks