TCP and UDP
2 The Internet Transport Layer Two transport layer protocols supported by the Internet: Reliable: The Transport Control Protocol (TCP) Unreliable The Unreliable Datagram Protocol (UDP)
3 UDP UDP is an unreliable transport protocol that can be used in the Internet UDP does not provide: connection management flow or error control guaranteed in-order packet delivery UDP is almost a “null” transport layer
4 Why UDP? No connection needs to be set up Throughput may be higher because UDP packets are easier to process, especially at the source The user doesn’t care if the data is transmitted reliably The user wants to implement his or her own transport protocol
5 UDP Frame Format 32 bits Source PortDestination Port UDP lengthUDP checksum (optional) Data
6 UDP checksum Sender: treat segment contents as sequence of 16-bit integers checksum: 1’s complement of (1’s complement sum of segment contents) sender puts checksum value into UDP checksum field Receiver: compute checksum of received segment check if computed checksum equals checksum field value: NO - error detected YES - no error detected. But maybe errors nonetheless? More later …. Goal: detect “errors” (e.g., flipped bits) in transmitted segment
7 Internet Checksum Example Note When adding numbers, a carryout from the most significant bit needs to be added to the result Example: add two 16-bit integers Wraparound the carry sum Checksum (complement)
8 TCP TCP provides the end-to-end reliable connection that IP alone cannot support The protocol Frame format Connection management Retransmission Flow control Congestion control
9 TCP Frame Format Sequence Number Acknowledgement number Options (0 or more 32-bit words) ChecksumUrgent Pointer Window SizeHL FINFIN SYNSYN RSTRST PSHPSH ACKACK URGURG Data 32 bits Source PortDestination Port
10 TCP Frame Fields Source & Destination Ports 16 bit port identifiers for each packet Sequence number The packet’s unique sequence ID Acknowledgement number The sequence number of the next packet expected by the receiver
11 TCP Frame Fields (cont’d) Window size Specifies how many bytes may be sent after the first acknowledged byte Checksum Checksums the TCP header and IP address fields Urgent Pointer Points to urgent data in the TCP data field
12 TCP Frame Fields (cont’d) Header bits URG = Urgent pointer field in use ACK = Indicates whether frame contains acknowledgement PSH = Data has been “pushed”. It should be delivered to higher layers right away. RST = Indicates that the connection should be reset SYN = Used to establish connections FIN = Used to release a connection
13 TCP Connection Establishment Three-way Handshake SYN (seq=x) SYN (seq=y, ACK=x+1) SYN (seq=x+1, ACK=y+1) Host AHost B
14 TCP Connection Tear-down Two double handshakes: FIN (seq=x) ACK (ACK=x+1) ACK (ACK=y+1) Host AHost B FIN (seq=y) A->B torn down B->A torn down
15 TCP Retransmission When a packet remains unacknowledged for a period of time, TCP assumes it is lost and retransmits it TCP tries to calculate the round trip time (RTT) for a packet and its acknowledgement From the RTT, TCP can guess how long it should wait before timing out
16 Round Trip Time (RTT) RTT = Time for packet to arrive at destination + Time for ACK to return from destination Network Time for data to arrive Time for ACK to return
17 RTT Calculation 2KSEQ=0 ACK = 2048 ReceiverSender RTT 0.9 sec 2.2 sec RTT = 2.2 sec sec. = 1.3 sec
18 Smoothing the RTT measurement First, we must smooth the round trip time due to variations in delay within the network: SRTT = SRTT + (1- ) RTT arriving ACK The smoothed round trip time (SRTT) weights previously received RTTs by the parameter is typically equal to 0.875
19 Retransmission Timeout Interval (RTO) The timeout value is then calculated by multiplying the smoothed RTT by some factor (greater than 1) called Timeout = SRTT This coefficient of is included to allow for some variation in the round trip times.
20 Example Initial SRTT = 1.50 0.875, = 4.0 RTT Meas.SRTT 1.5 s s 1.50 2.2 s 1.44 1.0 s 1.54 0.8 s 1.47 3.1 s Timeout 1.50 6.00 1.44 5.76 1.54 6.16 1.47 5.88 1.39 s
21 Problem with RTT Calculation 2KSEQ=0 ReceiverSender Sender Timeout 2KSEQ=0 RTT? ACK = 2048
22 Karn’s Algorithm Retransmission ambiguity Measure RTT from original data segment Measure RTT from most recent segment Either way there is a problem in RTT estimate One solution Never update RTT measurements based on acknowledgements from retransmitted packets Problem: Sudden change in RTT can cause system never to update RTT Primary path failure leads to a slower secondary path
23 Karn’s algorithm Use back-off as part of RTT computation Whenever packet loss, RTO is increased by a factor Use this increased RTO as RTO estimate for the next segment (not from SRTT) Only after an acknowledgment received for a successful transmission is the timer set to new RTT obtained from SRTT
24 Another Problem with RTT Calculation RTT measurements can sometimes fluctuate severely smoothed RTT (SRTT) is not a good reflection of round- trip time in these cases Solution: Use Jacobson/Karels algorithm: Error =RTT - SRTT SRTT = SRTT + Error Dev = Dev + h(|Error| - Dev) Timeout = SRTT+ Dev
25 Jacobson/Karels Algorithm Example Initial SRTT , Dev RTT Meas.SRTT 1.5 s 1.0 s 2.2 s 1.0 s 0.8 s 3.1 s Error Dev. Timeout Error = RTT - SRTT SRTT = SRTT + ( Error) Dev = Dev + [ (|Error| - Dev)] Timeout = SRTT + ( Dev) 2.0 s
26 Example RTT computation
27 TCP Flow Control TCP uses a modified version of the sliding window In acknowledgements, TCP uses the “Window size” field to tell the sender how many bytes it may transmit TCP uses bytes, not packets, as sequence numbers
28 TCP Flow Control (cont’d) Send Number of bytes in packet (N) Sequence number of first data byte in packet (SEQ) NSEQ Recv Window size at the receiver (WIN) ACKWIN Sequence number of next expected byte (ACK) Important information in TCP/IP packet headers ACK bit set Contained in IP header Contained in TCP header
29 Example TCP session (1)remus:$ tcpdump -S host scully Kernel filter, protocol ALL, datagram packet socket tcpdump: listening on all devices 15:15: eth0 > remus.4706 > scully.echo: S : (0) win :15: eth0 remus.4706: S : (0) ack win :15: eth0 > remus.4706 > scully.echo: : (0) ack win remus: telnet scully 7 A
30 Example TCP session Packet 1: 15:15: eth0 > remus.4706 > scully.echo: S : (0) win (DF) Packet 2: 15:15: eth0 remus.4706: S : (0) ack win 8760 <mss 1460) Packet 3: 15:15: eth0 > remus.4706 > scully.echo: : (0) ack win TimestampSource IP/portDest IP/port Flags Options Start Sequence Number Acknowledgement Number Window End Sequence Number
31 TCP data transfer Packet 4: 15:15: eth0 > remus.4706 > scully.echo: P : (3) ack win Packet 5: 15:15: eth0 remus.4706: P : (3) ack win 8760 data # bytes
32 TCP Flow Control (cont’d) 2KSEQ=0 ACK = 2048 WIN = KSEQ=2048 ACK = 4096 WIN = 0 ACK = 4096 WIN = KSEQ=4096 Application does a 2K write Application does a 3K write Sender is blocked Sender may send up to 2K Empty 2K Full 2K 1K Application reads 2K 04K Receiver’s buffer ReceiverSender
33 TCP Flow Control (cont’d) A NSEQ Piggybacking: Allows more efficient bidirectional communication ACKWIN B NSEQACKWIN Data from A to B ACK for data from B to A Data from B to A ACK for data from A to B
34 TCP Congestion Control Recall: Network layer is responsible for congestion control However, TCP/IP blurs the distinction In TCP/IP: the network layer (IP) simply handles routing and packet forwarding congestion control is done end-to-end by TCP
35 Self-Clocking Model Sender Receiver Fast link Bottleneck link Data Acks 1. Send Burst 2. Receive data packet 3. Send Acknowledgement 4. Receive Acknowledgement 5. Send a data packet PbPb PrPr ArAr AbAb ArAr Given: P b = P r = A r =A b =A r (in units of time) Sending a packet on each ACK keeps the bottleneck link busy
36 Changing bottleneck bandwidth one router, finite buffers sender retransmission of lost packet finite shared output link buffers Host A in : original data Host B out ' in : original data, plus retransmitted data
37 TCP Congestion Control Goal: achieve self-clocking state Even if don’t know bandwidth of bottleneck Bottleneck may change over time Two phases to keep bottleneck busy: Slow-start ramps up to the bottleneck limit Packet loss signals we passed bandwidth of bottleneck Congestion Avoidance tries to maintain self clocking mode once established
38 TCP Congestion Window TCP introduces a second window, called the “congestion window” This window maintains TCP’s best estimate of amount of outstanding data to allow in the network to achieve self-clocking
39 TCP Congestion Window To determine how many bytes it may send, the sender takes the minimum of the receiver window and the congestion window Example: If the receiver window says the sender can transmit 8K, but the congestion window is only 4K, then the sender may only transmit 4K If the congestion window is 8K but the receiver window says the sender can transmit 4K, then the sender may only transmit 4K
40 TCP Slow Start Phase TCP defines the “maximum segment size” as the maximum size a TCP packet can be (including header) TCP Slow Start: Congestion window starts small, at 1 segment size Each time a transmitted segment is acknowledged, the congestion window is increased by one maximum segment size On each ack, cwnd=cwnd +1
41 TCP Slow Start (cont’d) 1KA sends 1 segment to B B ACKs the segment 2KA sends 2 segments to B B ACKs both segments 4KA sends 4 segments to B B ACKs all four segments 8KA sends 8 segments to B B ACKs all eight segments 16K… and so on Congestion Window SizeEvent
42 TCP Slow Start (cont’d) Congestion window size grows exponentially (i.e. it keeps on doubling) Packet losses indicate congestion Packet losses are determined by using timers at the sender When a timeout occurs, the congestion window is reduced to one maximum segment size and everything starts over
43 TCP Slow Start When connection begins, increase rate exponentially until first loss event: double CongWin every RTT done by incrementing CongWin for every ACK received Summary: initial rate is slow but ramps up exponentially fast Host A one segment RTT Host B time two segments four segments
44 TCP Slow Start (cont’d) Congestion window Transmission Number Timed out Transmissions 1 Maximum Segment Size
45 TCP Slow Start (cont’d) TCP Slow Start by itself is inefficient Although the congestion window builds exponentially, it drops to 1 segment size every time a packet times out This leads to low throughput
46 TCP Linear Increase Threshold Establish a threshold at which the rate increase is linear instead of exponential to improve efficiency Algorithm : Start the threshold at 64K (ssthresh) Slow start Once the threshold is passed, only increase the congestion window size by 1 segment size for each congestion window of data transmitted For each ack received, cwnd = cwnd + (mss*mss)/cwnd If a timeout occurs, reset the congestion window size to 1 segment and set threshold to max(2*mss,1/2 of MIN(sliding window, congestion window))
47 TCP Linear Increase Threshold Phase Congestion window Transmission Number 1K 20K 32K Timeout occurs when MIN(sliding window, congestion window) = 40K Example: Maximum segment size = 1K Assume SSthresh=32K Thresholds 40K
48 TCP Fast Retransmit Another enhancement to TCP congestion control Idea: When sender sees 3 duplicate ACKs, it assumes something went wrong The packet is immediately retransmitted instead of waiting for it to timeout Why? Note that acks sent by the receiver when it receives a packet Dup ack implies something is getting through Better than time out
49 TCP Fast Retransmit Example ReceiverSender 1KSEQ=20481KSEQ=3072 ACK = 2048 WIN = 30K 1KSEQ=4096 ACK = 2048 WIN = 31K ACK = 2048 WIN = 29K 1KSEQ=5120 ACK = 2048 WIN = 28K Fast Retransmit occurs (2nd packet is now retransmitted w/o waiting for it to timeout) 1KSEQ=2048 ACK = 7168 WIN = 26K MSS = 1K 1KSEQ=6144 ACK = 2048 WIN = 27K Duplicate ACK #1 Duplicate ACK #2 Duplicate ACK #3 ACK of new data
50 TCP Fast Recovery Yet another enhancement to TCP congestion control Idea: Don’t do a slow start after a fast retransmit Instead, use this algorithm: Drop threshold to max(2*mss,1/2 of MIN(sliding window, congestion window)) Set congestion window to threshold + 3 * MSS For each duplicate ACK (after the fast retransmit), increment congestion window by MSS When next non-duplicate ACK arrives, set congestion window equal to the threshold
51 TCP Fast Recovery Example Sender 1KSEQ=2048 ACK = 7168 WIN = 26K Fast Retransmit Occurs ACK = 2048 WIN = 27K ACK = 2048 WIN = 28K 1KSEQ=6144 SW=29K,TH=15K, CW=20K Continuing with the Fast Retransmit Example... SW=28K,TH=15K, CW=20K SW=28K, TH=10K, CW=13K SW=27K, TH=10K, CW=14K SW=26K, TH=10K, CW=10K MSS=1K Sliding Window (SW) Congestion Threshold (TH) Congestion Window (CW)
52 Resulting TCP Sawtooth Congestion window Transmission Number 1K 20K 32K Slow Start 40K Linear Mode Bottleneck Capacity In steady state, window oscillates around the bottleneck’s capacity (I.e. number of outstanding bytes in transit) Sawtooth
53 TCP Recap Timeout Computation Timeout is a function of 2 values the weighted average of sampled RTTs The sampled variance of each RTT Congestion control: Goal: Keep the self-clocking pipe full in spite of changing network conditions 3 key Variables: Sliding window (Receiver flow control) Congestion window (Sender flow control) Threshold (Sender’s slow start vs. linear mode line)
54 TCP Recap (cont) Slow start Add 1 segment for each ACK to the congestion window -Double’s the congestion window’s volume each RTT Linear mode (Congestion Avoidance) Add 1 segment’s worth of data to each congestion window Adds 1 segment per RTT
55 Algorithm Summary: TCP Congestion Control When CongWin is below Threshold, sender in slow-start phase, window grows exponentially. When CongWin is above Threshold, sender is in congestion- avoidance phase, window grows linearly. When a triple duplicate ACK occurs, Threshold set to max( FlightSize/2,2*mss) and CongWin set to Threshold+3*mss. (Fast retransmit, Fast recovery) When timeout occurs, Threshold set to max( FlightSize/2,2*mss) and CongWin is set to 1 MSS. FlightSize: The amount of data that has been sent but not yet acknowledged.
56 TCP sender congestion control EventStateTCP Sender ActionCommentary ACK receipt for previously unacked data Slow Start (SS) CongWin = CongWin + MSS, If (CongWin > Threshold) set state to “Congestion Avoidance” Resulting in a doubling of CongWin every RTT ACK receipt for previously unacked data Congesti on Avoidanc e (CA) CongWin = CongWin+MSS * (MSS/CongWin) Additive increase, resulting in increase of CongWin by 1 MSS every RTT Loss event detected by triple duplicate ACK SS or CAThreshold = max(FlightSize/2,2*mss) CongWin = Threshold+3*mss, Set state to “Congestion Avoidance” Fast recovery, implementing multiplicative decrease. CongWin will not drop below 1 MSS. TimeoutSS or CAThreshold = max(FlightSize/2,2*mss), CongWin = 1 MSS, Set state to “Slow Start” Enter slow start Duplicate ACKSS or CAIncrement duplicate ACK count for segment being acked CongWin and Threshold not changed
57 Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K TCP connection 1 bottleneck router capacity R TCP connection 2 TCP Fairness
58 Why is TCP fair? Two competing sessions: Additive increase gives slope of 1, as throughout increases multiplicative decrease decreases throughput proportionally R R equal bandwidth share Connection 1 throughput Connection 2 throughput congestion avoidance: additive increase loss: decrease window by factor of 2
59 Fairness (more) Fairness and UDP Multimedia apps often do not use TCP do not want rate throttled by congestion control Instead use UDP: pump audio/video at constant rate, tolerate packet loss Research area: TCP friendly Fairness and parallel TCP connections nothing prevents app from opening parallel connections between 2 hosts. Web browsers do this Example: link of rate R supporting 9 connections; new app asks for 1 TCP, gets rate R/10 new app asks for 11 TCPs, gets R/2 !
60 Delay modeling Q: How long does it take to receive an object from a Web server after sending a request? Ignoring congestion, delay is influenced by: TCP connection establishment data transmission delay slow start Notation, assumptions: Assume one link between client and server of rate R S: MSS (bits) O: object size (bits) no retransmissions (no loss, no corruption) Window size: First assume: fixed congestion window, W segments Then dynamic window, modeling slow start
61 Fixed congestion window (1) First case: WS/R > RTT + S/R: ACK for first segment in window returns before window’s worth of data sent delay = 2RTT + O/R
62 Fixed congestion window (2) Second case: WS/R < RTT + S/R: wait for ACK after sending window’s worth of data sent delay = 2RTT + O/R + (K-1)[S/R + RTT - WS/R]
63 TCP Delay Modeling: Slow Start (1) Now suppose window grows according to slow start Will show that the delay for one object is: where P is the number of times TCP idles at server: - where Q is the number of times the server idles if the object were of infinite size. - and K is the number of windows that cover the object.
64 TCP Delay Modeling: Slow Start (2) Example: O/S = 15 segments K = 4 windows Q = 2 P = min{K-1,Q} = 2 Server idles P=2 times Delay components: 2 RTT for connection estab and request O/R to transmit object time server idles due to slow start Server idles: P = min{K-1,Q} times
65 TCP Delay Modeling (3)
66 TCP Delay Modeling (4) Calculation of Q, number of idles for infinite-size object, is similar (see HW). Recall K = number of windows that cover object How do we calculate K ?