Advanced Computer Networking Internet Congestion Control
Principles of Congestion Control informally: “too many sources sending too much data too fast for network to handle” manifestations: lost packets (buffer overflow at routers) long delays (queuing in router buffers) a highly important problem! H1 H2 R1 H3 A1(t) 10Mb/s D(t) 1.5Mb/s A2(t) 100Mb/s behnam shafagaty
Causes/costs of congestion: scenario 1 two senders, two receivers one router, infinite buffers no retransmission behnam shafagaty
Causes/costs of congestion: scenario 1 Throughput increases with load Maximum total load C (Each session C/2) Large delays when congested The load is stochastic behnam shafagaty
Causes/costs of congestion: scenario 2 one router, finite buffers sender retransmission of lost packet behnam shafagaty
Causes/costs of congestion: scenario 2 l in out = always: (goodput) Like to maximize goodput! “perfect” retransmission: retransmit only when loss: Actual retransmission of delayed (not lost) packet makes larger (than perfect case) for same . l in out > l in l out behnam shafagaty
Causes/costs of congestion: scenario 2 out out out ’in ’in “costs” of congestion: more work (retrans) for given “goodput” unneeded retransmissions: link carries (and delivers) multiple copies of pkt behnam shafagaty
Packet delay and throughput as functions of load behnam shafagaty
Congestion Control Congestion control involves two tasks: -Detect congestion -Limit sending rate behnam shafagaty
TCP & AQM Example congestion measure pl(t) Loss (Reno) DropTail RED REM,PI,AVQ xi(t) TCP: Reno Vegas Example congestion measure pl(t) Loss (Reno) Queuing delay (Vegas) behnam shafagaty
TCP Congestion Control End-End control (no network assistance) Assumes long delays (packet loss) is due to congestion behnam shafagaty
Congestion Control II TCP uses slow start and Additive Increase/multiplicative decrease (AIMD) to deal with congestion Van Jacobson 1988 outlined these ideas slow-start roughly: whenever starting traffic or recovering from congestion, start cwnd at the size of a single segment and increase it (up to a point) as ACKs show up behnam shafagaty
AIMD (Additive Increase / Multiplicative Decrease) CongestionWindow (cwnd) is a variable held by the TCP source for each connection. cwnd is set based on the perceived level of congestion. The Host receives implicit (packet drop) or explicit (packet mark) indications of internal congestion. MaxWindow :: min (CongestionWindow, AdvertisedWindow) EffectiveWindow = MaxWindow – (LastByteSent -LastByteAcked) behnam shafagaty
Additive Increase Additive Increase is a reaction to perceived available capacity. Linear Increase basic idea:: For each “cwnd’s worth” of packets sent, increase cwnd by 1 packet. In practice, cwnd is incremented fractionally for each arriving ACK. increment = (MSS /cwnd) cwnd = cwnd + increment behnam shafagaty
Additive Increase Add one packet each RTT behnam shafagaty Source Destination Add one packet each RTT Additive Increase behnam shafagaty
Multiplicative Decrease The key assumption is that a dropped packet and the resultant timeout are due to congestion at a router or a switch. Multiplicate Decrease:: TCP reacts to a timeout by halving cwnd. cwnd is not allowed below the size of a single packet. behnam shafagaty
AIMD: Some Notes It has been shown that AIMD is a necessary condition for TCP congestion control to be stable. Because the simple CC mechanism involves timeouts that cause retransmissions, it is important that hosts have an accurate timeout mechanism. Timeouts set as a function of average RTT and standard deviation of RTT. behnam shafagaty
Typical TCP Congestion window Evolution behnam shafagaty
AIMD: Two users, One link Fairness Rate of User 2 BW limit Rate of User 1 behnam shafagaty
Slow Start Linear additive increase takes too long to ramp up a new TCP connection from cold start. Beginning with TCP Tahoe, the slow start mechanism was added to provide an initial exponential increase in the size of cwnd. behnam shafagaty
Slow Start 1- The source starts with cwnd = 1. 2- Every time an ACK arrives, cwnd is incremented. cwnd is effectively doubled per RTT “epoch”. Two slow start situations: At the very beginning of a connection {cold start}. When the connection goes dead waiting for a timeout to occur (i.e, the advertized window goes to zero!) behnam shafagaty
Slow Start Slow Start Add one packet per ACK behnam shafagaty Source Destination Slow Start Add one packet per ACK Slow Start behnam shafagaty
Fast Retransmit Fast Retransmit Basic Idea:: use duplicate ACKs to signal lost packet. Fast Retransmit Upon receipt of three duplicate ACKs, the TCP Sender retransmits the lost packet. behnam shafagaty
Fast Retransmit Generally, fast retransmit eliminates about half timeouts. This yields roughly a 20% improvement in throughput. Note – fast retransmit does not eliminate all the timeouts due to small window sizes at the source. behnam shafagaty
Fast Retransmit Fast Retransmit Based on three duplicate ACKs behnam shafagaty
TCP Congestion Window Trace behnam shafagaty
Fast Recovery Fast Recovery Fast recovery was added with TCP Reno. Fast Recovery In congestion avoidance mode, if duplicate acks are received, reduce cwnd to half. If n successive duplicate acks are received, we know that receiver got n segments after lost segment: Advance cwnd by that number. behnam shafagaty
Adaptive Retransmissions RTT:: Round Trip Time between a pair of hosts on the Internet. How to set the TimeOut value? The timeout value is set as a function of the expected RTT. Consequences of a bad choice? behnam shafagaty
Original Algorithm Keep a running average of RTT and compute TimeOut as a function of this RTT. Send packet and keep timestamp ts . When ACK arrives, record timestamp ta . SampleRTT = ta - ts behnam shafagaty
Original Algorithm Compute a weighted average: EstimatedRTT = α x EstimatedRTT + (1- α) x SampleRTT Original TCP spec: α in range (0.8,0.9) TimeOut = 2 x EstimatedRTT behnam shafagaty
Karn/Partidge Algorithm An obvious flaw in the original algorithm: Whenever there is a retransmission it is impossible to know whether to associate the ACK with the original packet or the retransmitted packet. behnam shafagaty
Associating the ACK? behnam shafagaty
Karn/Partidge Algorithm Do not measure SampleRTT when sending packet more than once. For each retransmission, set TimeOut to double the last TimeOut. { Note – this is a form of exponential backoff based on the believe that the lost packet is due to congestion.} behnam shafagaty
Jaconson/Karels Algorithm The problem with the original algorithm is that it did not take into account the variance of SampleRTT. Difference = SampleRTT – EstimatedRTT EstimatedRTT = EstimatedRTT + (δ x Difference) Deviation = δ (|Difference| - Deviation) where δ is a fraction between 0 and 1. behnam shafagaty
Jaconson/Karels Algorithm TCP computes timeout using both the mean and variance of RTT TimeOut = µ x EstimatedRTT + Φ x Deviation where based on experience µ = 1 and Φ = 4. behnam shafagaty
Algorithms behnam shafagaty
Early TCP Pre-1988 Go-back-N ARQ Receiver window flow control Detects loss from timeout Retransmits from lost packet onward Receiver window flow control Prevent overflows at receive buffer Flow control: self-clocking behnam shafagaty
Why Flow Control? October 1986, Internet had its first congestion collapse Link LBL to UC Berkeley 400 yards, 3 hops, 32 Kbps throughput dropped to 40 bps factor of ~1000 drop! 1988, Van Jacobson proposed TCP flow control behnam shafagaty
Effect of Congestion Packet loss Retransmission Reduced throughput Congestion collapse due to Unnecessarily retransmitted packets Undelivered or unusable packets Congestion may continue after the overload! throughput behnam shafagaty load
Window Flow Control ~ W packets per RTT Source 1 2 W 1 2 W time data ACKs Destination 1 2 W 1 2 W time ~ W packets per RTT Lost packet detected by missing ACK behnam shafagaty
Window flow control Limit the number of packets in the network to window W Source rate = bps If W too small then rate « capacity If W too big then rate > capacity => congestion Adapt W to network (and conditions) W = BW x RTT behnam shafagaty
Congestion Control TCP seeks to Window flow control Achieve high utilization Avoid congestion Share bandwidth Window flow control Source rate = packets/sec Adapt W to network (and conditions) W = BW x RTT behnam shafagaty
TCP Window Flow Controls Receiver flow control Avoid overloading receiver Set by receiver awnd: receiver (advertised) window Network flow control Avoid overloading network Set by sender Infer available network capacity cwnd: congestion window Set W = min (cwnd, awnd) behnam shafagaty
Receiver Flow Control Receiver advertises awnd with each ACK Window awnd closed when data is received and ack’d opened when data is read Size of awnd can be the performance limit (e.g. on a LAN) sensible default ~16kB behnam shafagaty
Network Flow Control Source calculates cwnd from indication of network congestion Congestion indications Losses Delay Marks Algorithms to calculate cwnd Tahoe, Reno, Vegas, RED, REM … behnam shafagaty
TCP Congestion Controls Tahoe (Jacobson 1988) Slow Start Congestion Avoidance Fast Retransmit Reno (Jacobson 1990) Fast Recovery Vegas (Brakmo & Peterson 1994) New Congestion Avoidance RED (Floyd & Jacobson 1993) Probabilistic marking REM (Athuraliya & Low 2000) Clear buffer, match rate behnam shafagaty
Variants Tahoe & Reno AQM NewReno SACK Rate-halving Mod.s for high performance AQM RED, ARED, FRED, SRED BLUE, SFB REM, PI, AVQ behnam shafagaty
TCP Tahoe (Jacobson 1988) window time SS CA SS: Slow Start CA: Congestion Avoidance behnam shafagaty
Slow Start Start with cwnd = 1 (slow start) On each successful ACK increment cwnd cwnd cnwd + 1 Exponential growth of cwnd each RTT: cwnd 2 x cwnd Enter CA when cwnd >= ssthresh behnam shafagaty
Slow Start sender receiver cwnd cwnd + 1 (for each ACK) cwnd 1 RTT data packet 1 RTT ACK 2 3 4 5 6 7 8 cwnd cwnd + 1 (for each ACK) behnam shafagaty
Congestion Avoidance Starts when cwnd ssthresh On each successful ACK: cwnd cwnd + 1/cwnd Linear growth of cwnd each RTT: cwnd cwnd + 1 behnam shafagaty
Congestion Avoidance sender receiver cwnd 1 data packet ACK 2 1 RTT 3 4 cwnd cwnd + 1 (for each cwnd ACKS) behnam shafagaty
Packet Loss Assumption: loss indicates congestion Packet loss detected by Retransmission TimeOuts (RTO timer) Duplicate ACKs (at least 3) 1 2 3 4 5 6 Packets Acknowledgements 7 behnam shafagaty
Fast Retransmit Wait for a timeout is quite long Immediately retransmits after 3 dupACKs without waiting for timeout Adjusts ssthresh flightsize = min(awnd, cwnd) ssthresh max(flightsize/2, 2) Enter Slow Start (cwnd = 1) behnam shafagaty
Successive Timeouts When there is a timeout, double the RTO Keep doing so for each lost retransmission Exponential back-off Max 64 seconds1 Max 12 restransmits1 1 - Net/3 BSD behnam shafagaty
Summary: Tahoe Basic ideas Gently probe network for spare capacity Drastically reduce rate on congestion Windowing: self-clocking Other functions: round trip time estimation, error recovery for every ACK { if (W < ssthresh) then W++ (SS) else W += 1/W (CA) } for every loss { ssthresh = W/2 W = 1 behnam shafagaty
TCP Tahoe behnam shafagaty
Fast retransmission/fast recovery TCP Reno (Jacobson 1990) SS CA Fast retransmission/fast recovery behnam shafagaty
Fast recovery Motivation: prevent `pipe’ from emptying after fast retransmit Idea: each dupACK represents a packet having left the pipe (successfully received) Enter FR/FR after 3 dupACKs Set ssthresh max(flightsize/2, 2) Retransmit lost packet Set cwnd ssthresh + ndup (window inflation) Wait till W=min(awnd, cwnd) is large enough; transmit new packet(s) On non-dup ACK (1 RTT later), set cwnd ssthresh (window deflation) Enter CA After FR/FR, when CA is entered, cwnd is half of the window when lost was detected. So the effect of lost is halving the window. [Source: RFC 2581, Fall & Floyd, “Simulation based Comparison of Tahoe, Reno, and SACK TCP”] behnam shafagaty
Example: FR/FR Fast retransmit Fast recovery Retransmit on 3 dupACKs 1 2 3 4 5 6 8 7 1 7 4 9 4 4 11 10 time Exit FR/FR 4 time R 8 cwnd 8 ssthresh Fast retransmit Retransmit on 3 dupACKs Fast recovery Inflate window while repairing loss to fill pipe behnam shafagaty
Summary: Reno Basic ideas Fast recovery avoids slow start dupACKs: fast retransmit + fast recovery Timeout: fast retransmit + slow start dupACKs congestion avoidance FR/FR timeout slow start retransmit behnam shafagaty
NewReno: Motivation 1 8 FR/FR 8 unack’d pkts 2 5 S 1 2 3 4 5 6 7 8 9 3 timeout time 9 D time On 3 dupACKs, receiver has packets 2, 4, 6, 8, cwnd=8, retransmits pkt 1, enter FR/FR Next dupACK increment cwnd to 9 After a RTT, ACK arrives for pkts 1 & 2, exit FR/FR, cwnd=5, 8 unack’ed pkts No more ACK, sender must wait for timeout Example: Cwnd = 10. Sender sends packets 1, 2, …, 10. Packets 1, 3, …, 9 are lost, packets 2, 4, …, 10 are received. When 3 dupACK are received, receiver has (at least) received packets 2, 4, 6, 8. Sender retransmits packet 1, and waits, until dupACK due to arrival of packet 10 has been arrived, and then ACK due to retransmitted packet 1 has arrived, acknowledging packets 1 and 2. This last ACK takes Reno out of Fast Recovery, with cwnd = 5. There are now 8 outstanding packets: 3, 4, …, 10. So sender cannot transmit any packet. Note that the sender will not receive any more dupACK since the window has been exhausted. It must wait, until timer expires for packet 3, and then retransmit and goes to slow start. behnam shafagaty
NewReno Fall & Floyd ‘96, (RFC 2583) Motivation: multiple losses within a window Partial ACK acknowledges some but not all packets outstanding at start of FR Partial ACK takes Reno out of FR, deflates window Sender may have to wait for timeout before proceeding Idea: partial ACK indicates lost packets Stays in FR/FR and retransmits immediately Retransmits 1 lost packet per RTT until all lost packets from that window are retransmitted Eliminates timeout behnam shafagaty
SACK Mathis, Mahdavi, Floyd, Romanow ’96 (RFC 2018, RFC 2883) Motivation: Reno & NewReno retransmit at most 1 lost packet per RTT Pipe can be emptied during FR/FR with multiple losses Idea: SACK provides better estimate of packets in pipe SACK TCP option describes received packets On 3 dupACKs: retransmits, halves window, enters FR Updates pipe = packets in pipe Increment when lost or new packets sent Decrement when dupACK received Transmits a (lost or new) packet when pipe < cwnd Exit FR when all packets outstanding when FR was entered are acknowledged [Sources: M. Mathis, J. Mahdavi, S. Floyd and A. Romanow, “TCP Selective Acknowledgement Options”, RFC 2018, Oct. 1996 K. Fall and S. Floyd, “Simulation-based comparisons of Tahoe, Reno and SACK TCP”, Computer Communication Review, July 1996 ] behnam shafagaty
TCP Vegas (Brakmo & Peterson 1994) window time SS CA Reno with a new congestion avoidance algorithm Converges (provided buffer is large) ! behnam shafagaty
Congestion avoidance Each source estimates number of its own packets in pipe from RTT Adjusts window to maintain estimate between ad and bd for every RTT { if W/RTTmin – W/RTT < a then W ++ if W/RTTmin – W/RTT > b then W -- } for every loss W := W/2 behnam shafagaty
Implications Congestion measure = end-to-end queueing delay At equilibrium Zero loss Stable window at full utilization Approximately weighted proportional fairness Nonzero queue, larger for more sources Convergence to equilibrium Converges if sufficient network buffer Oscillates like Reno otherwise behnam shafagaty
Wireless TCP Reno uses loss as congestion measure In wireless, significant losses due to Fading Interference Handover Not buffer overflow (congestion) Halving window too drastic Small throughput, low utilization behnam shafagaty
Proposed solutions Ideas Approaches Hide from source noncongestion losses Inform source of noncongestion losses Approaches Link layer error control Split TCP Snoop agent SACK+ELN (Explicit Loss Notification) Sources: Balakrishnan, Padmanabhan, Seshan and Katz, “A comparison of mechanisms for improving TCP performance over wireless links”, ToN, 5(6):756-769, Dec 1997 behnam shafagaty
Third approach Problem Reno uses loss as congestion measure Two types of losses Congestion loss: retransmit + reduce window Noncongestion loss: retransmit Previous approaches Hide noncongestion losses Indicate noncongestion losses Our approach Eliminates congestion losses (buffer overflows) behnam shafagaty
Third approach Router REM capable Host Do not use loss as congestion measure Vegas REM Idea REM clears buffer Only noncongestion losses Retransmits lost packets without reducing window behnam shafagaty
Performance Goodput behnam shafagaty