TCP Congestion Control at the Network Edge

TCP Congestion Control at the Network Edge
Jennifer Rexford Fall 2017 (TTh 1:30-2:50 in CS 105) COS 561: Advanced Computer Networks

Original TCP Design Internet

Wireless and Data-Center Networks
Internet wireless network

Wireless Networks

TCP Design Setting Relatively low packet loss
E.g., hopefully less than 1% Okay to retransmit lost packets from the sender Loss is caused primarily by congestion Use loss as an implicit signal of congestion … and reduce the sending rate Relatively stable round-trip times Use RTT estimate in retransmission timer End-points are always on Stable end-point IP addresses Use IP addresses as end-point identifiers

Problems in Wireless Networks
Limited bandwidth High latencies High bit-error rates Temporary disconnections Slow handoffs Mobile device disconnects to save energy, bearers, etc. Internet

Link-Level Retransmission
Retransmit over the wireless link Hide packet losses from end-to-end … by retransmitting lost packets on wireless link Works for any transport protocol

Split Connection Two TCP connections Other optimizations
Between fixed host and the base station Between base and the mobile device Other optimizations Compression, just-in-time delivery, etc.

Burst Optimization Radio wakeup is expensive Burst optimization
Establish a bearer Use battery and signaling resources Burst optimization Send bigger chunks less often … to allow the mobile device to go to idle state

Lossless Handover Mobile moves from one base station to another
Packets in flight still arrive at the old base station … and could lead to bursty loss (and TCP timeout) Old base station can buffer packets Send buffered packets to the new base station Internet

Freezing the Connection
Mobile device can predict temporary disconnection E.g., fading, handoff Mobile can ask the fixed host to stop sending Advertise a receive window of 0 Benefits Avoids wasted transmission of data Avoid loss that triggers timeouts, decrease in cwnd, etc.

Data-Center Networks

Modular Network Topology
Containers Racks Multiple servers Top-of-rack switches

Tree-Like Topologies . . . Many equal-cost paths
CR CR AR AR AR AR S S S S . . . S S S S S S S S A A … A A A … A A A … A A A … A Many equal-cost paths Small round-trip times (e.g., < 250 microseconds)

Commodity Switches Low-cost switches Simple memory architecture
Especially for top-of-rack switches Simple memory architecture Small packet-buffer space Shared buffer over all input ports Simple drop-tail queues

Multi-Tier Applications
Front end Server Aggregator … … Aggregator Aggregator Aggregator … … Worker Worker Worker Worker Worker 16

Application Mix Partition-aggregate workflow Diverse mix of traffic
Multiple workers working in parallel Straggler slows down the entire system Many workers send response at the same time Diverse mix of traffic Low latency for short flows High throughput for long flows Multi-tenancy Many tenants sharing the same platform Running network to high levels of utilization Small number of large flows on links

TCP Incast Problem Multiple workers transmitting to one aggregator
Many flows traversing the same link Burst of packets sent at (nearly) the same time … into a relatively small switch memory Leading to high packet loss Some results are slow to arrive May be excluded from the final results Developer software changes Limit the size of worker responses Randomize the sending time for responses

Queue Buildup Mix of long and short flows
Long flows fill up the buffers in switches … causing queuing delay (and loss) for the short flows E.g., queuing delay of 1-14 milliseconds Large relative to propagation delay E.g., 100 microseconds intra-rack E.g., 250 microseconds inter-rack Leading to RTT variance and big throughput drop Shared switch buffers Short flows on one port … affected by long flows on other ports

TCP Outcast Problem Mix of flows at two different input ports
Many inter-rack flows Few intra-rack flows Destined for same output Burst of packet arrivals Arriving on one input port Causing bursty loss for the other Harmful to the intra-rack flows Lose multiple packets Loss detected by timeout Irony: worse throughput despite lower RTT! AR AR S S S S S S A A … A A A … A

Delayed Acknowledgments
Sending ACKs can be expensive E.g., send 40-byte ACK packet for each data packet Delay ACKs reduce the overhead Receiver waits before sending the ACK … in the hope of piggybacking the ACK on a response Delayed-ACK mechanism Set a timer when the data arrives (e.g., 200 msec) Piggyback the ACK or send ACK for every other packet … or send an ACK after the timer expires Timeout for delayed ACK is an eternity!! Disable delayed ACKs, or shorten the timer

Data-Center TCP (DCTCP)
Key observation TCP reacts to the presence of congestion … not to the extent of congestion Measuring extent of congestion Mark packets when buffer exceeds a threshold Reacting to congestion Reduce cwnd in proportion to fraction of marked packets Benefits React early, as queue starts to build Prevent harm to packets on other ports Get workers to reduce sending rate early

Poor Multi-Path Load Balancing
Multiple shortest paths between pairs of hosts Spread the load over multiple paths Equal-cost multipath Round robin Hash-based Uneven load Elephant flows congest some paths … while other paths are lightly loaded Reducing congestion Careful routing of elephant flows

Discussion DC-TCP paper

Questions What problem does the paper solve, and why?
What are novel/interesting aspects of the solution? What were the strengths about the paper? What were the weaknesses of the paper? What follow-on work would you recommend to build on, or extend, or strengthen, the work?

TCP Performance Debugging
Optional Reading of SNAP paper

Challenges of Datacenter Diagnosis
Large complex applications Hundreds of application components Tens of thousands of servers New performance problems Update code to add features or ﬁx bugs Change components while app is in operation Old performance problems (human factors) Developers may not understand network well Small packets, delayed ACK, etc. We have many low-level protocols such as….that is a mystery to developers without a networking background, but these protocols may have significant performance impact. It could be a disaster for a database expert to work with a TCP stack - slide 34: audience probably won't know what "Nagle's algorithm" and "delayed ACK" are. you can finesse this out loud, and say that networking protocols have many low-level mechanisms (and view these as examples you don't expect the audience to know). in practice, I think Microsoft doesn't really "hot swap" in the new code, but rather brings up new images for a service and gradually phase out the old ones your point is more that change in constant. Just stress the point about constant influx of new developers out loud. silly window syndrome

Diagnosis in Data Centers
Packet trace: Filter out trace for long delay req. App logs: #Reqs/sec Response time 1% req.>200ms delay Host App Too expensive Application- specific Packet sniffer OS Google and microsoft 100K$ a few monitoring machines monitor two racks of servers SNAP: Diagnose net-app interactions Switch logs: #bytes/pkts per minute Generic, fine-grained, and lightweight Too coarse-grained

Collect Data in TCP Stack
TCP understands net-app interactions Flow control: How much data apps want to read/write Congestion control: Network delay and congestion Collect TCP-level statistics Defined by RFC 4898 Already exists in today’s Linux and Windows OSes

TCP-level Statistics Cumulative counters Instantaneous snapshots
Packet loss: #FastRetrans, #Timeout RTT estimation: #SampleRTT, #SumRTT Receiver: RwinLimitTime Calculate the difference between two polls Instantaneous snapshots #Bytes in the send buffer Congestion window size, receiver window size Representative snapshots based on Poisson sampling Two types: elevate the difference to the beginning of the patter. Some are easier to deal with - cumulative, some are hard - requires being lucky enough to sample at the instant something interesting happens Counters will catch every event that happens even when the polls are too large Sampling periodically may miss some value, so we choose Poisson which can guarantee that we can get a statistically accurate overview of these values. Poisson sampling can make sure the distribution of sampling data is meaningful … Example variables, there are many others… PASTA, Independent of underlying statistical distribution Data are used for classify and to show details for people to … Why Poisson sampling? – to get meaningful values…

Life of Data Transfer Application generates the data Sender App
No network problem Copy data to send buffer Send buffer not large enough TCP sends data to the network Fast retransmission Timeout Receiver receives the data and ACK Not reading fast enough (CPU, disk, etc.) Not ACKing fast enough (Delayed ACK) Sender App Send Buffer Network Here we show a simple example on the basic data transfer stages that can help illustrate the problems that come from different stages. this simple example useful to explain where performance impairments happen. make clear that this is the simple life cycle shown so you can explain the very useful taxonomy that we came up with. Receiver

Cross-connection correlation Performance Classifier
SNAP Architecture Management System Topology, routing Conn  proc/app At each host for every connection Cross-connection correlation Collect data Performance Classifier Shared resource: host, link, or switch Overview to give sense of what SNAP is Tuning polling rate to reduce overhead Input Topology, routing information Mapping from connections to processes/apps Sharing the same switch/link, app code Offending app, host, link, or switch

Pinpoint Problems via Correlation
Correlation over shared switch/link/host Packet loss for all the connections going through one switch/host Pinpoint the problematic switch

Pinpoint Problems via Correlation
Correlation over application Same application has problem on all machines Report aggregated application behavior

Reducing SNAP Overhead
Data volume: 120 Bytes per connection per poll CPU overhead: 5% for polling 1K connections with 500 ms interval Increases with #connections and polling freq. Solution: Adaptive tuning of polling frequency Reduce polling frequency to stay within a target CPU Devote more polling to more problematic connections E.g., 35% for polling 5K connections with 50 ms interval 5% for polling 1K connections with 500 ms interval

Characterizing Performance Limitations
#Apps that are limited for > 50% of the time Send Buffer Send buffer not large enough 1 App Network Fast retransmission Timeout 6 Apps 6 apps self inflicted packet loss" "incast Life of transfer: be sure to talk about ack One connection is always limited by one component, it’s good for the network, if it’s limited by apps Nagle and delayed ack… : small data which trigger Nagle’s algo. 8 Apps Not reading fast enough (CPU, disk) Not ACKing fast enough (Delayed ACK) Receiver 144 Apps

Three Example Problems
Delayed ACK affects delay sensitive apps Congestion window allows sudden burst Significant timeouts for low-rate flows

Problem 1: Delayed ACK Delayed ACK affected many delay sensitive apps
even #pkts per record  1,000 records/sec odd #pkts per record  5 records/sec Delayed ACK was used to reduce bandwidth usage and server interrupts A B Data ACK every other packet ACK …. Proposed solutions: Delayed ACK should be disabled in data centers Data point out 1000s txn/s versus 5, based on parity of number of packets in request. Delayed ack disable cost (at least mention) configuration-file distribution service 1M connections… - clarify up front that Delayed ACK is a mechanism in TCP, not something added by the developers or operators in the data center. 200 ms ACK

Problem 2: Sudden Bursts
Increase congestion window to reduce delay To send 64 KB data with 1 RTT Developers intentionally keep congestion window large Disable slow start restart in TCP Drops after an idle time Window At any time, cwd is large enough …. But cwd gets reduced … Aggregator distributing requests t

Slow Start Restart SNAP diagnosis Proposed solutions
Significant packet loss Congestion window is too large after an idle period Proposed solutions Change apps to send less data during congestion New transport protocols that consider both congestion and delay

Problem 3: Timeouts for Low-rate Flows
SNAP diagnosis More fast retranmissions for high-rate flows (1-10MB/s) More timeouts with low-rate flows (10-100KB/s) Proposed solutions Reduce timeout time in TCP stack New ways to handle packet loss for small flows Problem Low-rate flows are not the cause of congestion But suffer more from congestion

Queue Management Background Slides

Router Processor Switching Fabric control plane data plane Line card

Line Cards (Interface Cards, Adaptors)
Packet handling Packet forwarding Buffer management Link scheduling Packet filtering Rate limiting Packet marking Measurement to/from link Receive lookup Transmit to/from switch

Packet Switching and Forwarding
Link 1, ingress Link 1, egress “4” Choose Egress Link 2 Link 2, ingress Choose Egress Link 2, egress R1 “4” Link 1 Link 3 Link 3, ingress Choose Egress Link 3, egress Link 4 Link 4, ingress Choose Egress Link 4, egress

Queue Management Issues
Scheduling discipline Which packet to send? Some notion of fairness? Priority? Drop policy When should you discard a packet? Which packet to discard? Goal: balance throughput and delay Huge buffers minimize drops, but add to queuing delay (thus higher RTT, longer slow start, …)

FIFO Scheduling and Drop-Tail
Access to the bandwidth: first-in first-out queue Packets only differentiated when they arrive Access to the buffer space: drop-tail queuing If the queue is full, drop the incoming packet ✗

Bursty Loss From Drop-Tail Queuing
TCP depends on packet loss Packet loss is indication of congestion TCP additive increase drives network into loss Drop-tail leads to bursty loss Congested link: many packets encounter full queue Synchronization: many connections lose packets at once

Slow Feedback from Drop Tail
Feedback comes when buffer is completely full … even though the buffer has been filling for a while Plus, the filling buffer is increasing RTT … making detection even slower Better to give early feedback Get 1-2 connections to slow down before it’s too late!

Random Early Detection (RED)
Router notices that queue is getting full … and randomly drops packets to signal congestion Packet drop probability Drop probability increases as queue length increases Else, set drop probability f(avg queue length) Probability Drop Average Queue Length

Properties of RED Drops packets before queue is full
In the hope of reducing the rates of some flows Drops packet in proportion to each flow’s rate High-rate flows selected more often Drops are spaced out in time Helps desynchronize the TCP senders Tolerant of burstiness in the traffic By basing the decisions on average queue length

Problems With RED Hard to get tunable parameters just right
How early to start dropping packets? What slope for increase in drop probability? What time scale for averaging queue length? RED has mixed adoption in practice If parameters aren’t set right, RED doesn’t help Many other variations in research community Names like “Blue” (self-tuning), “FRED”…

Feedback: From Loss to Notification
Early dropping of packets Good: gives early feedback Bad: has to drop the packet to give the feedback Explicit Congestion Notification (ECN) Router marks the packet with an ECN bit Sending host interprets as a sign of congestion Requires participation of hosts and the routers

TCP Congestion Control at the Network Edge

Similar presentations

Presentation on theme: "TCP Congestion Control at the Network Edge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TCP Congestion Control at the Network Edge

Similar presentations

Presentation on theme: "TCP Congestion Control at the Network Edge"— Presentation transcript:

Similar presentations

About project

Feedback