Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.

Slides:



Advertisements
Similar presentations
Balaji Prabhakar Active queue management and bandwidth partitioning algorithms Balaji Prabhakar Departments of EE and CS Stanford University
Advertisements

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
1 CONGESTION CONTROL. 2 Congestion Control When one part of the subnet (e.g. one or more routers in an area) becomes overloaded, congestion results. Because.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli University of Calif, Berkeley and Lawrence Berkeley National Laboratory SIGCOMM.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
Explicit Congestion Notification (ECN) RFC 3168 Justin Yackoski DEGAS Networking Group CISC856 – TCP/IP Thanks to Namratha Hundigopal.
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Congestion control in data centers
Explicit Congestion Notification ECN Tilo Hamann Technical University Hamburg-Harburg, Germany.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168) Limited Transmit (RFC 3042)
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168)
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
Data Communication and Networks
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #8 Explicit Congestion Notification (RFC 3168) Limited Transmit.
Random Early Detection Gateways for Congestion Avoidance
Data Center TCP (DCTCP)
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Advanced Computer Networks : RED 1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking,
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
TCP & Data Center Networking
TCP Incast in Data Center Networks
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
1 Lecture 14 High-speed TCP connections Wraparound Keeping the pipeline full Estimating RTT Fairness of TCP congestion control Internet resource allocation.
Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.
DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.
Queueing and Active Queue Management Aditya Akella 02/26/2007.
AQM & TCP models Courtesy of Sally Floyd with ICIR Raj Jain with OSU.
Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Random Early Detection (RED) Router notifies source before congestion happens - just drop the packet (TCP will timeout and adjust its window) - could make.
Explicit Congestion Notification (ECN) RFC 3168
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Analysis and Design of an Adaptive Virtual Queue (AVQ) Algorithm for AQM By Srisankar Kunniyur & R. Srikant Presented by Hareesh Pattipati.
1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Data Center TCP (DCTCP)
Data Center TCP (DCTCP)
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
Internet Networking recitation #9
Chapter 3 outline 3.1 transport-layer services
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Router-Assisted Congestion Control
Packet Transport Mechanisms for Data Center Networks
CONGESTION CONTROL.
Microsoft Research Stanford University
Data Center TCP (DCTCP)
Internet Networking recitation #10
Lecture 16, Computer Networks (198:552)
Reconciling Mice and Elephants in Data Center Networks
Lecture 17, Computer Networks (198:552)
Transport Layer: Congestion Control
Presentation transcript:

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data Center TCP (DCTCP) 1 Microsoft Research, Stanford University

OutLook Introduction Communications in Data Centers The DCTCP Algorithm Analysis and Experiments Conclusion 2

Introduction Communications in Data Centers The DCTCP Algorithm Analysis and Experiments Conclusion 3

Data Center Packet Transport Cloud computing service provider – Amazon,Microsoft,Google Need to build highly available, highly performant computing and storage infrastructure using low cost, commodity components 4

we focus on soft real-time applications – Supporting  Web search  Retail  Advertising  Recommendation – Require three things from the data center network  Low latency for short flows  High burst tolerance  High utilization for long fows 5

Two major contributions – First  Measure and analyze production traffic  Extracting application patterns and needs  Impairments that hurt performance are identified – Second  Propose Data Center TCP (DCTCP)  Evaluate DCTCP at 1 and 10Gbps speeds on ECN-capable commodity switches 6

Production traffic >150TB of compressed data,collected over the course of a month from ~6000 servers The measurements reveal that 99.91% of traffic in our data center is TCP traffic The traffic consists of three kinds of traffic —query traffic (2KB to 20KB in size) — delay sensitive short messages (100KB to 1MB) —throughput sensitive long flows (1MB to 100MB) Our key learning from these measurements is that to meet the requirements of such a diverse mix of short and long flows, switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flows. DCTCP is designed to do exactly this. 7

TCP in the Data Center We’ll see TCP does not meet demands of apps. – Incast  Suffers from bursty packet drops  Not fast enough utilize spare bandwidth – Builds up large queues:  Adds significant latency.  Wastes precious buffers, esp. bad with shallow-buffered switches. Operators work around TCP problems. ‒Ad-hoc, inefficient, often expensive solutions Our solution: Data Center TCP 8

Introduction Communications in Data Centers The DCTCP Algorithm Analysis and Experiments Conclusion 9

Partition/Aggregate Application Structure 10

TLA MLA Worker Nodes ……… Partition/Aggregate 11 Picasso “Everything you can imagine is real.”“Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Time is money  Strict deadlines (SLAs) Missed deadline  Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms

Workloads Partition/Aggregate (Query) Short messages [50KB-1MB] ( C oordination, Control state) Large flows [1MB-50MB] ( D ata update) 12 Delay-sensitive Throughput-sensitive

Impairments 13 Switche s Incast Queue Buildup Buffer Pressur e Impairments

Switches Like most commodity switches in clusters are shared memory switches that aim to exploit statistical multiplexing gain through use of logically common packet buffers available to all switch ports. Packets arriving on an interface are stored into a high speed multi-ported memory shared by all the interfaces. Memory from the shared pool is dynamically allocated to a packet by a MMU(attempts to give each interface as much memory as it needs while preventing unfairness by dynamically adjusting the maximum amount of memory any one interface can take). Building large multi-ported memories is very expensive, so most cheap switches are shallow buffered, with packet buffer being the scarcest resource. The shallow packet buffers cause three specific performance impairments,which we discuss next. 14

Incast 15 TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Synchronized mice collide.  Caused by Partition/Aggregate.

Queue Buildup 16 Sender 1 Sender 2 Receiver Big flows buildup queues.  Increased latency for short flows. Measurements in Bing cluster  For 90% packets: RTT < 1ms  For 10% packets: 1ms < RTT < 15ms

Buffer Pressure Indeed, the loss rate of short flows in this traffic pattern depends on the number of long flows traversing other ports The bad result is packet loss and timeouts, as in incast, but without requiring synchronized flows. 17

Data Center Transport Requirements High Burst Tolerance –Incast due to Partition/Aggregate is common. 2. Low Latency –Short flows, queries 3. High Throughput –Large file transfers The challenge is to achieve these three together.

Balance Between Requirements 19 High Burst Tolerance High Throughput Low Latency DCTCP Deep Buffers:  Queuing Delays Increase Latency Shallow Buffers:  Bad for Bursts & Throughput Reduced RTO min (SIGCOMM ‘09)  Doesn’t Help Latency AQM – RED:  Avg Queue Not Fast Enough for Incast Objective: Low Queue Occupancy & High Throughput

Introduction Communications in Data Centers The DCTCP Algorithm Analysis and Experiments Conclusion 20

Review TCP Congestion Control Four Stage:  Slow Start  Congestion Avoidance  Quickly Retransmission  Quickly Recovery Router must maintain one or more queues on port, so it is important to control queue – Two queue control algorithm  Queue Management Algorithm: manage the queue length through dropping packets when necessary  Queue Scheduling Algorithm: determine the next packet to send 21

Two queue management algorithm Passive management Algorithm: dropping packets after queue is full. – Traditional Method  Drop-tail  Random drop  Drop front – Some drawbacks  Lock-out: several flows occupy queue exclusively, prevent the packets from others flows entering queue  Full queues: send congestion signals only when the queues are full, so the queue is full state in quite a long period Active Management Algorithm(AQM) 22

AQM (dropping packets before queue is full) RED(Random Early Detection)[RFC2309]  Calculate the average queue length(aveQ): Estimate the degree of congestion  Calculate probability of dropping packets (P): according to the degree of congestion. (two threshold: minth, maxth) abeQ<minth:don’t drop packets Minth<abeQ<maxth: drop packets in P abeQ>maxth: drop all packets Drawback: drop packets sometimes when queue isn’t full ECN(Explicit Congestion Notification)[RFC3168] An method to use multibit feed-back notifying congestion instead of dropping packets 23

ECN Routers or Switches must support it.(ECN-capable) – Set two bits by the ECN field in the IP packet header  ECT (ECN-Capable Transport): set by sender, To display the sender’s transmission protocol whether support ECN or not.  CE(Congestion Experienced): set by routers or switches, to display whether congestion occur or not. – Set two bits field in TCP header  ECN-Echo: receiver notify sender that it has received CE packet  CWR(Congestion Window Red-UCed): sender notify receiver that it has decreased the congestion window Integrate ECN with RED 24

2 ECN working principle ETC CE IP 头部 TCP 头部 ETC CE CWR ACK ECN-Echo 1 TCP 头部 CWR 3 4 源端 路由器目的端

Review: The TCP/ECN Control Loop 26 Sender 1 Sender 2 Receiver ECN Mark (1 bit) ECN = Explicit Congestion Notification

Two Key Ideas 1.React in proportion to the extent of congestion, not its presence. Reduces variance in sending rates, lowering queuing requirements. 2.Mark based on instantaneous queue length. Fast feedback to better deal with bursts. 27 ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5%

Data Center TCP Algorithm Switch side: – Mark packets when Queue Length > K. 28 Sender side: – Maintain an estimate of fraction of packets marked (α). In each RTT: – where F is the fraction of packets that were marked in the last window of data – 0 < g < 1 is the weight given to new samples against the past in the estimation of α  Adaptive window decreases: B K Mark Don’t Mark

DCTCP in Action 29 Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB (Kbytes)

Introduction Communications in Data Centers The DCTCP Algorithm Analysis and Experiments Conclusion 30

Why it Works 1.High Burst Tolerance Large buffer headroom → bursts fit. Aggressive marking → sources react before packets are dropped. 2. Low Latency Small buffer occupancies → low queuing delay. 3. High Throughput ECN averaging → smooth rate adjustments, cwind low variance. 31

Packets sent in this RTT are marked. Analysis 32 Time (W*+1)(1-α/2) W* Window Size W*+1

Analysis 33 We are interested in computing the following quantities:  The maximum queue size (Q max )  The amplitude of queue oscillations (A)  The period of oscillations (T C )

Analysis Consider N infinitely long-lived flows with identical round-trip times RTT, sharing a single bottleneck link of capacity C. We further assume that the N flows are synchronized The queue size at time t is given by Q(t) = NW(t)-C×RTT (3) where W(t) is the window size of a single source The fraction of marked packets α S(W 1,W 2 ) denote the number of packets sent by the sender, while its window size increases from W 1 to W 2 > W1. Since this takes W2-W1 round-trip times, during which the average window size is (W1 +W2)/2 34

Analysis LetW* = (C × RTT +K)/N, This is the critical window size at which the queue size reaches K, and the switch starts marking packets with the CE codepoint. During the RTT it takes for the sender to react to these marks, its window size increases by one more packet, reaching W* + 1. Hence Plugging (4) into (5) and rearranging, we get: Assuming α is small, this can be simplified as. We can now compute A, T C and in Q max. 35

Analysis Note that the amplitude of oscillation in window size of a single flow, D, is given by: Since there are N flows in total Finally, using (3), we have: 36

Analysis How do we set the DCTCP parameters? 37  Marking Threshold(K). The minimum value of the queue occupancy sawtooth is given by: Choose K so that this minimum is larger than zero, i.e. the queue does not underflow. This results in:  Estimation Gain(g). The estimation gain g must be chosen small enough to ensure the exponential moving average (1) “spans” at least one congestion event. Since a congestion event occurs every TC round-trip times, we choose g such that: Plugging in (9) with the worst case value N = 1, results in the following criterion:

Evaluation Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory Numerous benchmarks – Incast – Queue Buildup – Buffer Pressure 38

Experiment implement 45 1G servers connected to a Triumph, a 10G server extern connection – 1Gbps links K=20 – 10Gbps link K=65 Generate query, and background traffic – 10 minutes, 200,000 background, 188,000 queries Metric: – Flow completion time for queries and background flows. 39 We use RTO min = 10ms for both TCP & DCTCP.

Baseline 40 Background Flows( 95th Percentile )Query Flows

Baseline 41 Background Flows( 95th Percentile )Query Flows Low latency for short flows.

Baseline 43 Background Flows( 95th Percentile )Query Flows Low latency for short flows. High burst tolerance for query flows.

(95th Percentile)Scaled Background & Query 10x Background, 10x Query 44

These results make three key points First, if our data center used DCTCP it could handle 10X larger query responses and 10X larger background flows while performing better than it does with TCP today. Second, while using deep buffered switches (without ECN) improves performance of query traffic, it makes performance of short transfers worse, due to queue build up. Third, while RED improves performance of short transfers, it does not improve the performance of query traffic, due to queue length variability. 44

Conclusions DCTCP satisfies all our requirements for Data Center packet transport. Handles bursts well Keeps queuing delays low Achieves high throughput Features: Very simple change to TCP and a single switch parameter K. Based on ECN mechanisms already available in commodity switch. 46