TCP & Data Center Networking

Slides:

Advertisements

Similar presentations

TCP--Revisited. Background How to effectively share the network? – Goal: Fairness and vague notion of equality Ideal: If N connections, each should get.

Advertisements

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

Congestion Control Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.

1 Transport Protocols & TCP CSE 3213 Fall April 2015.

TDTS21 Advanced Networking

Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Congestion Control Created by M Bateman, A Ruddle & C Allison As part of the TCP View project.

TCP Congestion Control Dina Katabi & Sam Madden nms.csail.mit.edu/~dina 6.033, Spring 2014.

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.

Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,

Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.

Congestion control in data centers

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

Explicit Congestion Notification ECN Tilo Hamann Technical University Hamburg-Harburg, Germany.

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

TCP Congestion Control

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Second year review Resource Pooling Damon Wischik, UCL.

Practical TDMA for Datacenter Ethernet

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.

IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.

3: Transport Layer3b-1 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for network to handle”

Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.

Multipath TCP design, and application to data centers Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke.

TCP Throughput Collapse in Cluster-based Storage Systems

Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.

Congestion control for Multipath TCP (MPTCP) Damon Wischik Costin Raiciu Adam Greenhalgh Mark Handley THE ROYAL SOCIETY.

DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.

Queueing and Active Queue Management Aditya Akella 02/26/2007.

Lecture 9 – More TCP & Congestion Control

Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.

TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.

1 SIGCOMM ’ 03 Low-Rate TCP-Targeted Denial of Service Attacks A. Kuzmanovic and E. W. Knightly Rice University Reviewed by Haoyu Song 9/25/2003.

CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.

T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429/556 Introduction to Computer Networks Principles of Congestion Control Some slides.

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring

1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

Revisiting Transport Congestion Control Jian He UT Austin 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

15-744: Computer Networking L-14 Data Center Networking III.

Data Center TCP (DCTCP)

Data Center TCP (DCTCP)

Improving Datacenter Performance and Robustness with Multipath TCP

Improving Datacenter Performance and Robustness with Multipath TCP

Multipath TCP Yifan Peng Oct 11, 2012

Microsoft Research Stanford University

AMP: A Better Multipath TCP for Data Center Networks

Data Center TCP (DCTCP)

Lecture 16, Computer Networks (198:552)

TCP Congestion Control

Lecture 17, Computer Networks (198:552)

Transport Layer: Congestion Control

AMP: An Adaptive Multipath TCP for Data Center Networks

Presentation transcript:

TCP & Data Center Networking TCP & Data Center Networking: Overview TCP Incast Problem & Possible Solutions DC-TCP MPTCP (multipath TPC) Please read the following papers [InCast] [DC-TCP] [MPTCP] CSci5221: TCP and Data Center Networking

TCP Congestion Control: Recap Designed to address network congestion problem reduce sending rates when network conges How to detect network congestion at end systems? Assume packet losses (& re-ordering)  network congestion How to adjust sending rates dynamically? AIMD (additive increase & multiplicative decrease): no packet loss in one RTT: W  W+1 packet loss in one RTT: W  W/2 How to determine the initial sending rates? probe the network available bandwidth via “slow start” W:=1; no loss in one RTT: W  2W Fairness: assume everyone will use the same algorithm

TCP Congestion Control: Devils in the Details How to detect packet losses? e.g., as opposed to late-arriving packets? estimate (average) RTT times, and set a time-out threshold called RTO (Retransmission Time-Out) timer packets arriving very late are treated as if they were lost! RTT and RTO estimations: Jacobson’s algorithm Compute estRTT and devRTT using exponential smoothing: estRTT := (1-a)estRTT + sampleRTT (a>0 small, e.g., a=0.125) devRTT:=(1-a)devRTT + a|sampleRTT-devRTT| Set RTO conservatively: RTO:= max{minRTO, estRTT + 4xdevRTT} where minRTO = 200 ms Aside: many variants of TCP: Tahoe, Reno, Vegas, ...

But …. Internet vs. data center network: Internet propagation delay: 10-100 ms data center propagation delay: 0.1 ms packet size 1 KB, link capacity 1 Gbps  packet transmission time is 0.01 ms

What Special about Data Center Transport Application requirements (particularly, low latency) Particular traffic patterns customer facing vs. internal: often co-exist internal: e.g., Google file system Map-Reduce … Commodity switches: shallow buffer And time is money!

How does search work? Partition/Aggregate Application Structure Art is… Picasso 1. 2. Art is a lie… 3. ….. TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure Deadline = 250ms Deadline = 50ms Deadline = 10ms Time is money Strict deadlines (SLAs) Missed deadline Lower quality result Many requests per query Tail-latency matters Picasso 1. 2. 3. ….. 1. Art is a lie… 2. The chief… “Everything you can imagine is real.” “I'd like to live as a poor man with lots of money.“ “Computers are useless. They can only give you answers.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “Inspiration does exist, but it must find you working.” “Art is a lie that makes us realize the truth. “The chief enemy of creativity is good sense.“

Data Center Workloads (Query) Partition/Aggregate Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive Partition/Aggregate (Query) Short messages [50KB-1MB] (Coordination, Control state) Large flows [1MB-100MB] (Data update)

Flow Size Distribution > 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

A Simple Data Center Network Model 1 2 3 N packet size S_DATA small buffer B link capacity C switch Logical data block (S) (e.g., 1 MB) Ethernet: 1-10Gbps aggregator Server Request Unit (SRU) (e.g., 32 KB) Round Trip Time (RTT): 100-10us N servers

TCP Incast Problem Req. sent 7-8 dropped Rsp. sent 7-8 resent Vasudevan et al. (SIGCOMM’09) Synchronized fan-in congestion:  Caused by Partition/Aggregate. Worker 1 Aggregator Worker 2 RTOmin = 200 ms Worker 3 Worker 4 TCP timeout Req. sent 7-8 dropped Rsp. sent 7-8 resent 1 – 6 done Link Idle! time

TCP Throughput Collapse Cluster Setup 1Gbps Ethernet Unmodified TCP S50 Switch 1MB Block Size Collapse! TCP Incast Cause of throughput collapse: coarse-grained TCP timeouts And, here are the results of the experiment. On the Y axis we have Throughput (Goodput), and on the X axis we have the number of servers involved in the transfer. Initially the throughput is 900Mbps, close to the maximum achievable in the network. As we scale the number of servers, by around 7 servers we notice a drastic collapse in throughput down to 100Mbps (an order of magnitude lower than the max). This TCP throughput collapse is called TCP Incast, and the cause for this is coarse-grained TCP timeouts.

Incast in Bing MLA Query Completion Time (ms) 1. Incast really happens – see this actual screenshot from production tool 2. People care, they’ve solved it at application by jittering. 3. They care about the 99.9th percentile, 1-1000 customers

Problem Statement How to provide high goodput for data center 4/20/2017 TCP retransmission timeouts How to provide high goodput for data center applications? TCP throughput degradation N High-speed, low-latency network (RTT ≤ 0.1 ms) Highly-multiplexed link (e.g., 1000 flows) Highly-synchronized flows on bottleneck link Limited switch buffer size (e.g., 32 KB) 13

µsecond Retransmission Timeouts (RTO) One Quick Fix: µsecond TCP + no minRTO µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) RTT tracked in milliseconds 200ms 200µs? RTO is a max of 2 values. We have to make sure that both these values are in µseconds. This means lowering both the minRTO bound to µseconds (or getting rid of it), and tracking RTT is µseconds (currently tracked in milliseconds) Track RTT in µsecond 0?

Solution: µsecond TCP + no minRTO Proposed solution Throughput (Mbps) Unmodified TCP more servers  Our solution, of using microsecond granularity timers, solves the problem of TCP throughput collapse. The red line is the result of running servers with our modified TCP stack. It solves the problem here for 47 servers, and we have found in simulation this solution scales to thousands of servers. Higher is better and Red is the quick fix solution High throughput for up to 47 servers Simulation scales to thousands of servers

TCP in the Data Center TCP does not meet demands of applications. Requires large queues for high throughput: Adds significant latency. Wastes buffer space, esp. bad with shallow-buffered switches. Operators work around TCP problems. Ad-hoc, inefficient, often expensive solutions No solid understanding of consequences, tradeoffs

Data Center Workloads (Query) Partition/Aggregate Bursty, Delay-sensitive Delay-sensitive Throughput-sensitive Partition/Aggregate (Query) Short messages [50KB-1MB] (Coordination, Control state) Large flows [1MB-100MB] (Data update)

Flow Size Distribution > 65% of Flows are < 1MB > 95% of Bytes from Flows > 1MB

Queue Buildup Large flows buildup queues. Measurements in Bing cluster Sender 1 Large flows buildup queues. Increase latency for short flows. Receiver How was this supported by measurements? Send 2 Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms

Data Center Transport Requirements High Burst Tolerance Incast due to Partition/Aggregate is common. Low Latency Short flows, queries 3. High Throughput Continuous data updates, large file transfers The challenge is to achieve these three together.

DCTCP: Main Idea React in proportion to the extent of congestion. Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 5% Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?”

DCTCP: Algorithm Switch side: Mark packets when Queue Length > K. Don’t Mark B Sender side: Maintain running average of fraction of packets marked (α). Adaptive window decreases: Note: decrease factor between 1 and 2. very simple marking mechanism not all the tunings other aqms have on the source side, the source is tryign to estimate the fraction of packets getting marked using the obs that there is a stream of ecn marks coming back – more info in the stream than in any single bit trying to maintain smooth rate variations to operate well even when using shallow buffers, and only a few flows (no stat mux) F over the last RTT. In TCP there is always a way to get the next RTT from the window size. Comes from the self-clocking of TCP. Only changing the decrease. Simplest version – makes a lot of sense. So generic could apply it to any algorithm – CTCP, CUBIC – how to cut its window leaving increase part to what it already does. Have to be careful here.

DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Not ns2. Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, ECN Marking Thresh = 30KB

Multi-path TCP (MPTCP) In a data center with rich path diversity (e.g., Fat-Tree or Bcube), can we use multipath to get higher throughput? Initially, there is one flow.

In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow.

In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

The MPTCP protocol MPTCP is a replacement for TCP which lets you use multiple paths simultaneously. user space MPTCP The sender stripes packets across paths The receiver puts the packets in the correct order socket API MPTCP TCP IP addr addr1 addr2

Design goal 1: Multipath TCP should be fair to regular TCP at shared bottlenecks To be fair, Multipath TCP should take as much capacity as TCP at a bottleneck link, no matter how many paths it is using. A multipath TCP flow with two subflows Regular TCP This is the very first thing that comes to mind with multipath TCP, and it’s something that many other people have solved in different ways. This is just a warm-up... Design Goal 3 is a much “richer” generalization of this goal, which accommodates different topologies, different RTTs. So there’s no point giving an evaluation here. Strawman solution: Run “½ TCP” on each path

Design goal 2: MPTCP should use efficient paths Each flow has a choice of a 1-hop and a 2-hop path. How should we split its traffic? 12Mb/s 12Mb/s 12Mb/s

Design goal 2: MPTCP should use efficient paths If each flow split its traffic 1:1 ... 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s

Design goal 2: MPTCP should use efficient paths If each flow split its traffic 2:1 ... 12Mb/s 9Mb/s 12Mb/s 9Mb/s 9Mb/s 12Mb/s

Design goal 2: MPTCP should use efficient paths If each flow split its traffic 4:1 ... 12Mb/s 10Mb/s 12Mb/s 10Mb/s 10Mb/s 12Mb/s

Design goal 2: MPTCP should use efficient paths If each flow split its traffic ∞:1 ... 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s

Design goal 2: MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Theoretical solution (Kelly+Voice 2005; Han, Towsley et al. 2006) Theorem: MPTCP should send all its traffic on its least-congested paths. This will lead to the most efficient allocation possible, given a network topology and a set of available paths.

Design goal 3: MPTCP should be fair compared to TCP Design Goal 2 says to send all your traffic on the least congested path, in this case 3G. But this has high RTT, hence it will give low throughput. wifi path: high loss, small RTT 3G path: low loss, high RTT Goal 3a. A Multipath TCP user should get at least as much throughput as a single-path TCP would on the best of the available paths. Goal 3b. A Multipath TCP flow should take no more capacity on any link than a single-path TCP would.

How does MPTCP try to achieve all this? Design goals Goal 1. Be fair to TCP at bottleneck links Goal 2. Use efficient paths ... Goal 3. as much as we can, while being fair to TCP Goal 4. Adapt quickly when congestion changes Goal 5. Don’t oscillate redundant “Goal 3. Be fair to TCP” means two things: fair to the user, i.e. user doesn’t suffer by switching from TCP to MPTCP; fair to the network, i.e. network doesn’t suffer when users switch from TCP to MPTCP. It subsumes Goal 1. There are two more goals. Paper discusses Goal 4 at length. Goal 5: we didn’t see any oscillations in our evaluation; theory papers predict no oscillation, for an idealized model. How does MPTCP try to achieve all this?

How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 Design goal 3: At any potential bottleneck S that path r might be in, look at the best that a single-path TCP could get, and compare to what I’m getting. MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 Design goal 2: We want to shift traffic away from congestion. To achieve this, we increase windows in proportion to their size. MPTCP works pretty much like TCP, i.e. window increases and window decreases on each path. The decrease is the same as TCP. When there is only one path available, this formula reduces to the TCP formula. We derived a throughput formula for this congestion control algorithm, and checked that it satisfies the design goals.

MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput. MPTCP shifts its traffic away from the congested link. Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide.

MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput. throughput [Mb/s] Packet-level simulations of BCube (125 hosts, 25 switches, 100Mb/s links) and measured average throughput, for three traffic matrices. For two of the traffic matrices, MPTCP and ½ TCP (strawman) were as good. For one of the traffic matrices, MPTCP got 19% higher throughput. perm. traffic matrix sparse traffic matrix local traffic matrix