OTCP: SDN-Managed Congestion Control for Data Center Networks Simon Jouet simon.jouet@glasgow.ac.uk https://netlab.dcs.gla.ac.uk School of Computing Science
Background on TCP “For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working – exponential backoff-” Congestion Avoidance and Control, Van Jacobson, 1988 Conservative Congestion Control Settings Minimum Retransmission Timeout (RTOmin) 200ms Initial Retransmission Timeout (RTOinit) 1s Initial Congestion Window (IW) 10 segments IEEE/IFIP NOMS - 26/04/2016
Partition Aggregate Traffic Light request to workers Synchronous replies Multiple Flows Typical of DC applications MapReduce Memcached Apache Spark … Bottleneck link Reply k Query k IEEE/IFIP NOMS - 26/04/2016
TCP Throughput Incast Collapse Many flows share same egress queue Packet dropped when buffers are full RTO is used as recovery mechanism Bursts of traffic separated by long idle period Result in low throughput and long flow completion times S RTOinit (1s) Buffer occupancy IW = 3 RTO (>200ms) RTO 2x RTO S Time IEEE/IFIP NOMS - 26/04/2016
DC Networks “[…] a WSC server is deployed in a relatively well-known environment, leading to possible optimizations for increased performance. […] lower packet losses than in long-distance Internet connections. Thus we can tune transport or messaging parameters (timeouts, window sizes, etc.) for higher communication efficiency.” The Datacenter as a Computer, Luiz André Barroso, Urs Hölzle, 2009 Compute environment specific settings RTOmin = Route Latency RTOmax = Route + Buffer latency CWNDmax = Route BDP CWNDinit (IW) = BDP / Flow fan-in Core Controller 1G 1ms Agg In DC the network properties or known or discoverable 2 – 3 orders of magnitude difference with the Internet and conservative values 1G 0.2ms ToR 10x1G 0.1ms x10 IEEE/IFIP NOMS - 26/04/2016
OTCP Information Gathering Add timestamp to topology discovery (OFDP) Controller – Switch – Switch - Controller Controller OpenFlow Request/Reply Controller – Switch - Controller ARP Probe packets Controller – Switch – Host – Switch - Controller x10 Port status for link speed Queue config for buffer sizes IEEE/IFIP NOMS - 26/04/2016
OTCP Calculations Network properties Example: Flow through Core Buffer depth of 60 packets Throughput of 1Gbps Expected Flow Fan-in α = 100 Example: Flow through Core Measured latency 5571µs 𝑅𝑇𝑂𝑚𝑖𝑛 = 6𝑚𝑠 𝑅𝑇𝑂𝑚𝑎𝑥 = 𝑅𝑇𝑂𝑚𝑖𝑛 + 60 ∗ 𝑀𝑆𝑆 1𝐺𝑏𝑝𝑠 ∗10=12.771𝑚𝑠 𝑅𝑇𝑂𝑖𝑛𝑖𝑡 = 𝑅𝑇𝑂𝑚𝑎𝑥 ∗ 2=25𝑚𝑠 𝐵𝐷𝑃=𝐿𝑎𝑡𝑒𝑛𝑐𝑦∗ 1𝐺𝑏𝑝𝑠=476𝑀𝑆𝑆 𝐼𝑊 = 𝐵𝐷𝑃 𝛼 =5 Controller x10 IEEE/IFIP NOMS - 26/04/2016
Parameters Propagation Controller exposes a northbound JSON/REST API Agent in the end-hosts connect to the API endpoint Controller calculate per-route congestion control values Push to agent on topological changes Agent update the host routing table RTT (µs) RTOmin (ms) RTOmax RTOinit CWNDmax (MSS) IW ToR 629 1 2.069 4 49 Agg 1485 2 5.805 12 127 Core 5571 6 12.771 25 476 5 IEEE/IFIP NOMS - 26/04/2016
OTCP Improvements Match the congestion control settings to the network Improve Flow completion time Improve Throughput and Goodput Improve Flow fairness Reduce latency jitter Buffer occupancy S RTOinit (4ms) S RTO (1ms) IW = 1 S Time IEEE/IFIP NOMS - 26/04/2016
FCT Evaluation (a) Mean FCT (b) 95th Percentile (s) (a) Mean FCT (s) (b) 95th Percentile IEEE/IFIP NOMS - 26/04/2016
Goodput Evaluation CDF of Flow goodput experiencing incast collapse IEEE/IFIP NOMS - 26/04/2016
Conclusion Implemented OTCP Centralized controller-based congestion control settings measurement Calculate per-route parameters based on the operating environment Improve soft-realtime partition-aggregate traffic 12x FCT improvement at the mean, 31x at the 95th percentile Low and stable latency, no bursts from the IW Higher and fairer goodput IEEE/IFIP NOMS - 26/04/2016
Questions? IEEE/IFIP NOMS - 26/04/2016