TCP Incast in Data Center Networks

Slides:

Advertisements

Similar presentations

Advertisements

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

1 Transport Protocols & TCP CSE 3213 Fall April 2015.

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.

Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.

The War Between Mice and Elephants LIANG GUO, IBRAHIM MATTA Computer Science Department Boston University ICNP (International Conference on Network Protocols)

Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.

The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP

Congestion control in data centers

Chapter 3 Transport Layer slides are modified from J. Kurose & K. Ross CPE 400 / 600 Computer Communication Networks Lecture 12.

TDC365 Spring 2001John Kristoff - DePaul University1 Internetworking Technologies Transmission Control Protocol (TCP)

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

Explicit Congestion Notification ECN Tilo Hamann Technical University Hamburg-Harburg, Germany.

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

1 Chapter 3 Transport Layer. 2 Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4.

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

Data Communication and Networks

ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.

3: Transport Layer3b-1 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for network to handle”

Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.

TCP Throughput Collapse in Cluster-based Storage Systems

CS540/TE630 Computer Network Architecture Spring 2009 Tu/Th 10:30am-Noon Sue Moon.

TFRC: TCP Friendly Rate Control using TCP Equation Based Congestion Model CS 218 W 2003 Oct 29, 2003.

UDT: UDP based Data Transfer Protocol, Results, and Implementation Experiences Yunhong Gu & Robert Grossman Laboratory for Advanced Computing / Univ. of.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

U NDERSTANDING TCP I NCAST T HROUGHPUT C OLLAPSE IN D ATACENTER N ETWORKS Presenter: Aditya Agarwal Tyler Maclean.

Chapter 12 Transmission Control Protocol (TCP)

High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.

HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.

Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March

Deadline-based Resource Management for Information- Centric Networks Somaya Arianfar, Pasi Sarolahti, Jörg Ott Aalto University, Department of Communications.

TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.

1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:

Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.

ECE 4110 – Internetwork Programming

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

HP Labs 1 IEEE Infocom 2003 End-to-End Congestion Control for InfiniBand Jose Renato Santos, Yoshio Turner, John Janakiraman HP Labs.

© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Congestion Control 0.

Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

ICTCP: Incast Congestion Control for TCP in Data Center Networks By: Hilfi Alkaff.

Data Center TCP (DCTCP)

Accelerating Peer-to-Peer Networks for Video Streaming

Data Center TCP (DCTCP)

Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers

Topics discussed in this section:

Chapter 3 outline 3.1 transport-layer services

OTCP: SDN-Managed Congestion Control for Data Center Networks

Chapter 6 TCP Congestion Control

Reddy Mainampati Udit Parikh Alex Kardomateas

COMP 431 Internet Services & Protocols

HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,

Chapter 3 outline 3.1 Transport-layer services

Transport Protocols over Circuits/VCs

Transport Layer Unit 5.

Microsoft Research Stanford University

Lecture 19 – TCP Performance

Chapter 6 TCP Congestion Control

CS4470 Computer Networking Protocols

Lecture 16, Computer Networks (198:552)

Lecture 17, Computer Networks (198:552)

Transport Layer: Congestion Control

Chapter 3 outline 3.1 Transport-layer services

Presentation transcript:

TCP Incast in Data Center Networks A study of the problem and proposed solutions

Outline TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

Outline TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

TCP Incast – Problem Description Incast jargons: Barrier Synchronized Workload SRU (Server Request Unit) Goodput, Throughput MTU BDP and TCP acronyms like RTT, RTO, CA, AIMD, etc.

TCP Incast – Problem A typical implementation scenario in the Data Centers

TCP Incast - Problem Many-to-one barrier synchronized workload: Receiver requests k blocks of data from S storage servers. Each block of data striped across S storage servers Each server responses with a “fixed” amount of data. (fixed-fragment workload) Client won’t request block k+1 until all the fragments of block k have been received. Datacenter scenario: k=100 S = 1-48 fragment size : 256KB

TCP Incast - Problem Goodput Collapse

TCP Incast - Problem Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted. But large RTO implies that retransmission will be delayed resulting in decrease in goodput

Outline TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

Motivation Internet datacenters support a myriad of service and applications. Google, Microsoft, Yahoo, Amazon Vast majority of datacenter use TCP for communication between nodes. Companies like Facebook have adopted UDP as their transport layer protocol to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed. Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage

Incast in Bing (Microsoft) Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

Challenges Minimum changes to TCP implementation needed Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO Have to address internal and external flows Cannot afford large buffer at the switch because it is costly Solution needs to be easily deployed and should be cost effective

Outline TCP Incast - Problem Description Characteristics of the problem and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

Proposed Solutions Solutions can be divided into Application level solutions Transport layer solutions Transport layer solutions aided by switch’s ECN and QCN capabilities. Alternative way to categorize the solutions Avoiding timeouts in TCP Reducing RTOmin Replace TCP Call lower layer functionalities like Ethernet Flow control for help

Understanding the problem… Collaborated study by EECS Berkeley and Intel labs[1] Their study focused on proving this problem is general, deriving an analytical model Studying the impact of various modifications to TCP on incast behavior.

Different RTO Timers Observations: Initial goodput min occurs at the same number of servers. Smaller RTO timer value has faster goodput “recovery” rate The decrease rate after local max is the same between different min RTO settings.

Decreasing the RTO – proportional increase in the goodput Surprisingly, 1ms RTO with delayed ack enabled was a better performer Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT

QUANTITATIVE MODEL: D: total amount of data to be sent, 100 blocks of 256KB L: total transfer time of the workload without and RTO events. R: the number of RTO events during the transfer S: number of server: r: the value of the minimum RTO timer value I : Interpacket wait time Modeling of R and I was done based on empirically observed behavior Net goodput:

Key Observations A smaller minimum RTO timer value means larger goodput values for the initial minimum. The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times. The second order goodput peak occurs at a higher number of senders for a larger RTO timer value The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum. After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.

Application level solution[5] No changes required to the TCP stack or network switches Based on scheduling server responses to the same data block so that no data loss occurs Caveats: Retransmissions can be interesting Scheduling at the application level cannot be easily synchronized Limited control over transport layer

Application level solution

Application level solution

ICTCP-Incast Congestion Control for TCP in Data Center Networks[8] Features Solution based on modifying Congestion window dynamically Can choose implementation on the receiver side only focuses on avoiding packet losses before the incast congestion occurs Test implementation on Windows NDIS Novelties in the solution: Using Available bandwidth to coordinate the receive window increase in all incoming connections Per flow congestion control is performed independently in slotted time of RTT Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

Design considerations Receiver knows how much throughput is achieved and what is the available bandwidth While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion Only low latency flows less than 2ms are considered Receive window increase is determined by the available bandwidth Frequency of receive window based congestion control should be per- flow Receive window based scheme should adjust the window according to link congestion and application requirement

ICTCP Algorithm Control trigger: Available bandwidth Calculate available bandwidth Estimate the potential throughput/flow increase before increasing receive window Time divided into two slots For each network interface, measure available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot

ICTCP Algorithm per connection control interval: 2*RTT to estimate the throughput of a TCP connection for receive window adjustment, the shortest time scale is an RTT for that connection control interval for a TCP connection is 2*RTT in ICTCP One RTT latency for adjusted window to take effect One additional RTT for measuring throughput with the newly adjusted window For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.

ICTCP Algorithm Window adjustment on single connection Receive window is adjusted based on its incoming measured throughput Measured throughput is current requirement of the application over that TCP connection Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window Define ratio of throughput difference Make receive adjustment based on the following conditions MSS and i increase receive window if it’s now in global second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased. decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS. Otherwise, keep current receive window.

ICTCP Algorithm Fairness Controller for multiple connections Fairness considered only for low latency flows Decrease window for fairness only when BWA < 0.2C For window decrease, we cut the receive window by one MSS3, for some selected TCP connections. Select those connections that have receive window larger than the average window value of all connections. For window increase, this is automatically achieved by our window adjustment

ICTCP Experimental Results Testbed 47 servers 1 LB4G 48-port Gigabit Ethernet switch Gigabit Ethernet Broadcom NIC at the hosts Windows Server 2008 R2 Enterprise 64-bit version

Issues with ICTCP ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver ICTCP for future high bandwidth low latency networks

DCTCP Features TCP like protocol for data centers It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts Claim is that DCTCP provides better throughput than TCP using 90% less buffer space Provides high burst tolerance and low latency for short flows Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.

DCTCP Overview Applications in data centers largely require Low latency for short flows High burst tolerance High utilization for long flows Low latency for short flows have real time deadlines of about approximately 10-100ms To avoid continuously modifying internal data structures, high utilization for long flows is essentia Study analyzed production traffic from app. 6000 servers with app. 150 TB of traffic for a period of 1 month Query traffic (of 2KB to 20KB) experience incast impairment

DCTCP Overview (Contd) Proposed DCTCP uses ECN capability available in most modern switches Uses multi-bit feedback on congestion from single bit stream of ECN marks Essence of the proposal is to keep switch buffer occupancies persistently low, while maintaining high throughput to control queue length at switches, use Active Queue management(AQM) approach that uses explicit feedback from congested switches Claim is also that only as much as 30 LoC to TCP and setting of a single parameter on switches is needed DCTCP focuses on 3 problems Incast Queue Buildup Buffer pressure Our area

DCTCP Algorithm Mainly concentrates on extent of congestion rather than just the presence of it. Act of deriving multi-bit feedback from single bit sequence of marks Three components of the algorithm Simple marking at the switch ECN-echo at the receiver Controller at the sender

DCTCP Simple marking at the switch ECN-ECHO at the receiver: An arriving packet is marked with CE (Congestion Experienced) codepoint if the queue occupancy is greater than K (marking threshold) Marking is not based on average queue length , but instantaneous ECN-ECHO at the receiver: Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets CWR from the sender DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on the packet

DCTCP Controller at the sender: Sender maintains an estimate of the fraction of packets that are marked and updated every window When 𝛼 is close to 0 , low congestion and 𝛼 close to 1 indicate high congestion While TCP cuts its window by half, DCTCP uses 𝛼 to determine the sender’s window size cwnd  cwnd x (1 - 𝛼/2)

DCTCP Modeling of when window reaches W*(when K is at the critcal point) Maximum size Q max of the queue depends on the number of synchronously sending servers N Lower bound for K can be derived by

DCTCP How DCTCP solves Incast? TCP suffers from timeouts when N>10 DCTCP senders receive ECN marks, slow their rate Suffers timeouts when N large enough to overwhelm static buffer size Solution is Dynamic Buffering

Outline TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

Evaluation of proposed solutions Application level solution Genuine retransmissions  cascading timeouts  congestion Scheduling at the application level cannot be easily synchronized Limited control over transport layer ICTCP- Solution that needs minimal change and is cost effective ICTCP scalability to a large number of TCP connections is an issue Extending ICTCP to handle congestion in general cases has a limited solution ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies DCTCP- Solution that needs minimal change but requires switch support DCTCP requires dynamic buffering for larger number of senders

Conclusion No solution completely solves the problem other than configuring less RTO Solutions have less focus on foreground an background traffic together Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!

References Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse in Datacenter Networks” in Proc. of ACM WREN, 2009. Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center Networks”,Distributed Computing Systems Workshops (ICDCSW), 2011 Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 2011 Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-IEEE,2011 Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011 Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010 Hongyun Zheng, Changjia Chen, Chunming Qiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011 Haitao Wu, Zhenqian Feng, Chuanxiong Guo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November 2010