TCP Incast in Data Center Networks

Name: TCP Incast in Data Center Networks
Uploaded: 2017-10-18T02:04:07+00:00
Duration: PTM18S58
Channel: Richard Summers
Description: TCP Incast in Data Center Networks

TCP Incast in Data Center Networks
A study of the problem and proposed solutions

Outline TCP Incast - Problem Description Motivation and challenges
Proposed Solutions Evaluation of proposed solutions Conclusion References

TCP Incast – Problem Description
Incast jargons: Barrier Synchronized Workload SRU (Server Request Unit) Goodput, Throughput MTU BDP and TCP acronyms like RTT, RTO, CA, AIMD, etc.

TCP Incast – Problem A typical implementation scenario in the Data Centers

TCP Incast - Problem Many-to-one barrier synchronized workload:
Receiver requests k blocks of data from S storage servers. Each block of data striped across S storage servers Each server responses with a “fixed” amount of data. (fixed-fragment workload) Client won’t request block k+1 until all the fragments of block k have been received. Datacenter scenario: k=100 S = 1-48 fragment size : 256KB

TCP Incast - Problem Goodput Collapse

TCP Incast - Problem Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted. But large RTO implies that retransmission will be delayed resulting in decrease in goodput

Motivation Internet datacenters support a myriad of service and applications. Google, Microsoft, Yahoo, Amazon Vast majority of datacenter use TCP for communication between nodes. Companies like Facebook have adopted UDP as their transport layer protocol to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed. Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage

Incast in Bing (Microsoft)
Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

Challenges Minimum changes to TCP implementation needed
Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO Have to address internal and external flows Cannot afford large buffer at the switch because it is costly Solution needs to be easily deployed and should be cost effective

Outline TCP Incast - Problem Description
Characteristics of the problem and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References

Proposed Solutions Solutions can be divided into
Application level solutions Transport layer solutions Transport layer solutions aided by switch’s ECN and QCN capabilities. Alternative way to categorize the solutions Avoiding timeouts in TCP Reducing RTOmin Replace TCP Call lower layer functionalities like Ethernet Flow control for help

Understanding the problem…
Collaborated study by EECS Berkeley and Intel labs[1] Their study focused on proving this problem is general, deriving an analytical model Studying the impact of various modifications to TCP on incast behavior.

Different RTO Timers Observations:
Initial goodput min occurs at the same number of servers. Smaller RTO timer value has faster goodput “recovery” rate The decrease rate after local max is the same between different min RTO settings.

Decreasing the RTO – proportional increase in the goodput
Surprisingly, 1ms RTO with delayed ack enabled was a better performer Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT

QUANTITATIVE MODEL: D: total amount of data to be sent, 100 blocks of 256KB L: total transfer time of the workload without and RTO events. R: the number of RTO events during the transfer S: number of server: r: the value of the minimum RTO timer value I : Interpacket wait time Modeling of R and I was done based on empirically observed behavior Net goodput:

Key Observations A smaller minimum RTO timer value means larger goodput values for the initial minimum. The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times. The second order goodput peak occurs at a higher number of senders for a larger RTO timer value The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum. After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.

Application level solution[5]
No changes required to the TCP stack or network switches Based on scheduling server responses to the same data block so that no data loss occurs Caveats: Retransmissions can be interesting Scheduling at the application level cannot be easily synchronized Limited control over transport layer

Application level solution

ICTCP-Incast Congestion Control for TCP in Data Center Networks[8]
Features Solution based on modifying Congestion window dynamically Can choose implementation on the receiver side only focuses on avoiding packet losses before the incast congestion occurs Test implementation on Windows NDIS Novelties in the solution: Using Available bandwidth to coordinate the receive window increase in all incoming connections Per flow congestion control is performed independently in slotted time of RTT Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

Design considerations
Receiver knows how much throughput is achieved and what is the available bandwidth While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion Only low latency flows less than 2ms are considered Receive window increase is determined by the available bandwidth Frequency of receive window based congestion control should be per- flow Receive window based scheme should adjust the window according to link congestion and application requirement

ICTCP Algorithm Control trigger: Available bandwidth
Calculate available bandwidth Estimate the potential throughput/flow increase before increasing receive window Time divided into two slots For each network interface, measure available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot

ICTCP Algorithm per connection control interval: 2*RTT
to estimate the throughput of a TCP connection for receive window adjustment, the shortest time scale is an RTT for that connection control interval for a TCP connection is 2*RTT in ICTCP One RTT latency for adjusted window to take effect One additional RTT for measuring throughput with the newly adjusted window For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.

ICTCP Algorithm Window adjustment on single connection
Receive window is adjusted based on its incoming measured throughput Measured throughput is current requirement of the application over that TCP connection Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window Define ratio of throughput difference Make receive adjustment based on the following conditions MSS and i increase receive window if it’s now in global second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased. decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS. Otherwise, keep current receive window.

ICTCP Algorithm Fairness Controller for multiple connections
Fairness considered only for low latency flows Decrease window for fairness only when BWA < 0.2C For window decrease, we cut the receive window by one MSS3, for some selected TCP connections. Select those connections that have receive window larger than the average window value of all connections. For window increase, this is automatically achieved by our window adjustment

ICTCP Experimental Results
Testbed 47 servers 1 LB4G 48-port Gigabit Ethernet switch Gigabit Ethernet Broadcom NIC at the hosts Windows Server 2008 R2 Enterprise 64-bit version

Issues with ICTCP ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver ICTCP for future high bandwidth low latency networks

DCTCP Features TCP like protocol for data centers
It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts Claim is that DCTCP provides better throughput than TCP using 90% less buffer space Provides high burst tolerance and low latency for short flows Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.

DCTCP Overview Applications in data centers largely require
Low latency for short flows High burst tolerance High utilization for long flows Low latency for short flows have real time deadlines of about approximately ms To avoid continuously modifying internal data structures, high utilization for long flows is essentia Study analyzed production traffic from app servers with app. 150 TB of traffic for a period of 1 month Query traffic (of 2KB to 20KB) experience incast impairment

DCTCP Overview (Contd)
Proposed DCTCP uses ECN capability available in most modern switches Uses multi-bit feedback on congestion from single bit stream of ECN marks Essence of the proposal is to keep switch buffer occupancies persistently low, while maintaining high throughput to control queue length at switches, use Active Queue management(AQM) approach that uses explicit feedback from congested switches Claim is also that only as much as 30 LoC to TCP and setting of a single parameter on switches is needed DCTCP focuses on 3 problems Incast Queue Buildup Buffer pressure Our area

DCTCP Algorithm Mainly concentrates on extent of congestion rather than just the presence of it. Act of deriving multi-bit feedback from single bit sequence of marks Three components of the algorithm Simple marking at the switch ECN-echo at the receiver Controller at the sender

DCTCP Simple marking at the switch ECN-ECHO at the receiver:
An arriving packet is marked with CE (Congestion Experienced) codepoint if the queue occupancy is greater than K (marking threshold) Marking is not based on average queue length , but instantaneous ECN-ECHO at the receiver: Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets CWR from the sender DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on the packet

DCTCP Controller at the sender:
Sender maintains an estimate of the fraction of packets that are marked and updated every window When 𝛼 is close to 0 , low congestion and 𝛼 close to 1 indicate high congestion While TCP cuts its window by half, DCTCP uses 𝛼 to determine the sender’s window size cwnd  cwnd x (1 - 𝛼/2)

DCTCP Modeling of when window reaches W*(when K is at the critcal point) Maximum size Q max of the queue depends on the number of synchronously sending servers N Lower bound for K can be derived by

DCTCP How DCTCP solves Incast? TCP suffers from timeouts when N>10
DCTCP senders receive ECN marks, slow their rate Suffers timeouts when N large enough to overwhelm static buffer size Solution is Dynamic Buffering

Evaluation of proposed solutions
Application level solution Genuine retransmissions  cascading timeouts  congestion Scheduling at the application level cannot be easily synchronized Limited control over transport layer ICTCP- Solution that needs minimal change and is cost effective ICTCP scalability to a large number of TCP connections is an issue Extending ICTCP to handle congestion in general cases has a limited solution ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies DCTCP- Solution that needs minimal change but requires switch support DCTCP requires dynamic buffering for larger number of senders

Conclusion No solution completely solves the problem other than configuring less RTO Solutions have less focus on foreground an background traffic together Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!

References Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse in Datacenter Networks” in Proc. of ACM WREN, 2009. Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center Networks”,Distributed Computing Systems Workshops (ICDCSW), 2011 Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 2011 Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-IEEE,2011 Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011 Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010 Hongyun Zheng, Changjia Chen, Chunming Qiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011 Haitao Wu, Zhenqian Feng, Chuanxiong Guo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November 2010

TCP Incast in Data Center Networks

Similar presentations

Presentation on theme: "TCP Incast in Data Center Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TCP Incast in Data Center Networks

Similar presentations

Presentation on theme: "TCP Incast in Data Center Networks"— Presentation transcript:

Similar presentations

About project

Feedback