Lecture 16, Computer Networks (198:552)

Slides:

Advertisements

Similar presentations

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Advertisements

Computer Networking Lecture 20 – Queue Management and QoS.

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

Traffic Shaping Why traffic shaping? Isochronous shaping

Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.

PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,

Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.

Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.

Congestion control in data centers

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

SEDCL: Stanford Experimental Data Center Laboratory.

Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.

Data Center TCP (DCTCP)

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.

TCP & Data Center Networking

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.

DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.

Queueing and Active Queue Management Aditya Akella 02/26/2007.

CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.

Explicit Allocation of Best-Effort Service Goal: Allocate different rates to different users during congestion Can charge different prices to different.

TCP Traffic Characteristics—Deep buffer Switch

1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,

6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring

Revisiting Transport Congestion Control Jian He UT Austin 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Data Center TCP (DCTCP)

Advanced Computer Networks

Data Center TCP (DCTCP)

Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers

6.888 Lecture 5: Flow Scheduling

OTCP: SDN-Managed Congestion Control for Data Center Networks

HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,

TCP Congestion Control at the Network Edge

Congestion Control: The Role of the Routers

Router-Assisted Congestion Control

TCP Congestion Control

Packet Transport Mechanisms for Data Center Networks

Transport Layer Unit 5.

Microsoft Research Stanford University

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

Lecture 19 – TCP Performance

Queuing and Queue Management

Carnegie Mellon University, *Panasas Inc.

AMP: A Better Multipath TCP for Data Center Networks

Data Center TCP (DCTCP)

CS640: Introduction to Computer Networks

Fast Congestion Control in RDMA-Based Datacenter Networks

SICC: SDN-Based Incast Congestion Control For Data Centers Ahmed M

Centralized Arbitration for Data Centers

Congestion Control Reasons:

Reconciling Mice and Elephants in Data Center Networks

Lecture 17, Computer Networks (198:552)

EE 122: Lecture 10 (Congestion Control)

Transport Layer: Congestion Control

Lecture 6, Computer Networks (198:552)

Presentation transcript:

Lecture 16, Computer Networks (198:552) Congestion Control in Data Centers Lecture 16, Computer Networks (198:552)

Transport inside the DC INTERNET 100Kbps–100Mbps links ~100ms latency Servers Fabric 10–40Gbps links ~10–100μs latency

Transport inside the DC INTERNET Fabric Interconnect for distributed compute workloads Specifically, while some of the traffic in data center networks is sent across the Internet, the majority of data center traffic is between the servers within the data center and never leaves the data center. Servers web app cache data-base map-reduce HPC monitoring

What’s different about DC transport? Network characteristics Very high link speeds (Gb/s); very low latency (microseconds) Application characteristics Large-scale distributed computation Challenging traffic patterns Diverse mix of mice & elephants Incast Cheap switches Single-chip shared-memory devices; shallow buffers

Additional degrees of flexibility Flow priorities and deadlines Preemption and termination of flows Coordination with switches Packet header changes to propagate information

Data center workloads Mice and Elephants! Short messages Low Latency (e.g., query, coordination) Large flows (e.g., data update, backup) Low Latency High Throughput

Incast Synchronized fan-in congestion TCP timeout Worker 1 Aggregator RTOmin = 300 ms Worker 4 TCP timeout Vasudevan et al. (SIGCOMM’09)

Incast in Microsoft Bing MLA Query Completion Time (ms) 1. Incast really happens – see this actual screenshot from production tool 2. People care, they’ve solved it at application by jittering. 3. They care about the 99.9th percentile, 1-1000 customers Requests are jittered over 10ms window. Jittering switched off around 8:30 am. Jittering trades of median for high percentiles

DC transport requirements Low Latency Short messages, queries High Throughput Continuous data updates, backups High Burst Tolerance Incast The challenge is to achieve these together

Mohammad Alizadeh et al., SIGCOMM’10 Data Center TCP Mohammad Alizadeh et al., SIGCOMM’10

TCP widely used in the data center Apps use familiar interfaces TCP is deeply ingrained in the apps ... And developers’ minds However, TCP not really designed for data center environments Complex to work around TCP problems Ad-hoc, inefficient, often expensive solutions Practical deployment is hard  keep it simple!

Review: TCP algorithm ECN = Explicit Congestion Notification Additive Increase: W  W+1 per round-trip time Multiplicative Decrease: W  W/2 per drop or ECN mark Sender 1 Time Window Size (Rate) ECN Mark (1 bit) Receiver DCTCP is based on the existing Explicit Congestion Notification framework in TCP. ----- Meeting Notes (3/7/12 12:24) ----- Remember to mention "SAWTOOTH" Sender 2 ECN = Explicit Congestion Notification

TCP buffer requirement Bandwidth-delay product rule of thumb: A single flow needs C×RTT buffers for 100% Throughput. B 100% B < C×RTT 100% B B ≥ C×RTT Buffer Size Now in the case of TCP, the question of how much buffering is needed for high throughput has been studied and is known in the literature as the buffer sizing problem. So if we can find a way to lower the variance of the sending rates, then we can reduce the buffering requirements and that’s exactly what DCTCP is designed to do” Throughput

Reducing buffer requirements Appenzeller et al. (SIGCOMM ‘04): Large # of flows: is enough. Window Size Now, there are previous results that show in some circumstances, we don't need big buffers. Buffer Size Throughput 100%

Reducing buffer requirements Appenzeller et al. (SIGCOMM ‘04): Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC Measurements show typically only 1-2 large flows at each server Now, there are previous results that show in some circumstances, we don't need big buffers. Key observation: Low variance in sending rate  Small buffers suffice

DCTCP: Main idea Extract multi-bit feedback from single-bit stream of ECN marks Reduce window size based on fraction of marked packets

DCTCP: Main idea ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 5% Window Size (Bytes) Time (sec) TCP DCTCP Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?” - Standard deviation: TCP (33.6KB), DCTCP (11.5KB)

DCTCP algorithm Switch side: Sender side: B K Mark Don’t Switch side: Mark packets when Queue Length > K. Sender side: Maintain running average of fraction of packets marked (α). Adaptive window decreases: Note: decrease factor between 1 and 2. very simple marking mechanism not all the tunings other aqms have on the source side, the source is tryign to estimate the fraction of packets getting marked using the obs that there is a stream of ecn marks coming back – more info in the stream than in any single bit trying to maintain smooth rate variations to operate well even when using shallow buffers, and only a few flows (no stat mux) F over the last RTT. In TCP there is always a way to get the next RTT from the window size. Comes from the self-clocking of TCP. Only changing the decrease. Simplest version – makes a lot of sense. So generic could apply it to any algorithm – CTCP, CUBIC – how to cut its window leaving increase part to what it already does. Have to be careful here.

DCTCP mitigates Incast by creating a DCTCP vs TCP (KBytes) Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch Buffer is mostly empty DCTCP mitigates Incast by creating a large buffer headroom ECN Marking Thresh = 30KB

Why it works Low Latency 2. High Throughput 3. High Burst Tolerance Small buffer occupancies → low queuing delay 2. High Throughput ECN averaging → smooth rate adjustments, low variance 3. High Burst Tolerance Large buffer headroom → bursts fit Aggressive marking → sources react before packets are dropped

Setting parameters: A bit of analysis K How much buffering does DCTCP need for 100% throughput? Need to quantify queue size oscillations (Stability). Packets sent in this RTT are marked Time (W*+1)(1-α/2) W* Window Size W*+1

Setting parameters: A bit of analysis K How small can queues be without loss of throughput? Need to quantify queue size oscillations (Stability). for TCP: K > C x RTT K > (1/7) C x RTT

Convergence time DCTCP takes at most ~40% more RTTs than TCP “Analysis of DCTCP”, SIGMETRICS 2011 Intuition: DCTCP makes smaller adjustments than TCP, but makes them much more frequently DCTCP TCP In the first part of the talk, we established that what we need from DCTCP is to maintain small queues, without loss of throughput Now in the case of TCP, the question of how much buffering is needed for high throughput has been studied and is known in the literature as the buffer sizing problem. … and I’ll show you how to get low var…

Bing benchmark (baseline) Background Flows Query Flows To transition to scaled traffic: people always want to get more out of their network.

Bing benchmark (scaled 10x) Query Traffic (Incast bursts) Short messages (Delay-sensitive) Completion Time (ms) Incast Deep buffers fix incast, but increase latency DCTCP good for both incast & latency Emphasize that these are two traffic classes within same experiment

Discussion Between throughput, delay, and convergence time, what metrics are you willing to give up? Why? Are there other factors that may determine choice of K and B besides loss of throughput and max queue size? How would you improve on DCTCP? How could you add on flow prioritization over DCTCP?

Acknowledgment Slides heavily adapted from material by Mohammad Alizadeh