6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring 2016 1.

Slides:

Advertisements

Similar presentations

Martin Suchara, Ryan Witt, Bartek Wydrowski California Institute of Technology Pasadena, U.S.A. TCP MaxNet Implementation and Experiments on the WAN in.

Advertisements

Congestion Control and Fairness Models Nick Feamster CS 4251 Computer Networking II Spring 2008.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.

Finishing Flows Quickly with Preemptive Scheduling

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Computer Networking Lecture 20 – Queue Management and QoS.

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.

Lecture 18: Congestion Control in Data Center Networks 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.

CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.

Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.

PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

Active Queue Management: Theory, Experiment and Implementation Vishal Misra Dept. of Computer Science Columbia University in the City of New York.

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.

Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,

Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.

Congestion control in data centers

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

SEDCL: Stanford Experimental Data Center Laboratory.

A Switch-Based Approach to Starvation in Data Centers Alex Shpiner Joint work with Isaac Keslassy Faculty of Electrical Engineering Faculty of Electrical.

The Effect of Router Buffer Size on HighSpeed TCP Performance Dhiman Barman Joint work with Georgios Smaragdakis and Ibrahim Matta.

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

Data Center TCP (DCTCP)

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Information-Agnostic Flow Scheduling for Commodity Data Centers

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

Practical TDMA for Datacenter Ethernet

Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.

TCP & Data Center Networking

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

TCP Throughput Collapse in Cluster-based Storage Systems

TFRC: TCP Friendly Rate Control using TCP Equation Based Congestion Model CS 218 W 2003 Oct 29, 2003.

On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.

Understanding the Performance of TCP Pacing Amit Aggarwal, Stefan Savage, Thomas Anderson Department of Computer Science and Engineering University of.

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.

MaxNet NetLab Presentation Hailey Lam Outline MaxNet as an alternative to TCP Linux implementation of MaxNet Demonstration of fairness, quick.

High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.

DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.

Requirements for Simulation and Modeling Tools Sally Floyd NSF Workshop August 2005.

Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,

Murari Sridharan and Kun Tan (Collaborators: Jingmin Song, MSRA & Qian Zhang, HKUST.

Computer Networking Lecture 18 – More TCP & Congestion Control.

Thoughts on the Evolution of TCP in the Internet (version 2) Sally Floyd ICIR Wednesday Lunch March 17,

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

TCP Traffic Characteristics—Deep buffer Switch

1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,

1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.

Revisiting Transport Congestion Control Jian He UT Austin 1.

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Data Center TCP (DCTCP)

Data Center TCP (DCTCP)

6.888 Lecture 5: Flow Scheduling

TCP Congestion Control at the Network Edge

Router-Assisted Congestion Control

Packet Transport Mechanisms for Data Center Networks

Microsoft Research Stanford University

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

AMP: A Better Multipath TCP for Data Center Networks

Data Center TCP (DCTCP)

Lecture 16, Computer Networks (198:552)

Lecture 17, Computer Networks (198:552)

Presentation transcript:

6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring

INTERNET Servers Fabric 100Kbps–100Mbps links ~100ms latency 10–40Gbps links ~10–100μs latency Transport inside the DC

INTERNET Servers Fabric web app data- base map- reduce HPC monitoring cache Interconnect for distributed compute workloads Transport inside the DC

What’s Different About DC Transport? Network characteristics – Very high link speeds (Gb/s); very low latency (microseconds) Application characteristics – Large-scale distributed computation Challenging traffic patterns – Diverse mix of mice & elephants – Incast Cheap switches – Single-chip shared-memory devices; shallow buffers 4

Short messages (e.g., query, coordination) Large flows (e.g., data update, backup) Low Latency High Throughput Data Center Workloads Mice & Elephants

TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Synchronized fan-in congestion Incast  Vasudevan et al. (SIGCOMM’09)

Requests are jittered over 10ms window. Jittering switched off around 8:30 am. 7 MLA Query Completion Time (ms) Incast in Bing Jittering trades of median for high percentiles

DC Transport Requirements 8 1. Low Latency –Short messages, queries 2. High Throughput –Continuous data updates, backups 3. High Burst Tolerance –Incast The challenge is to achieve these together

High Throughput Low Latency Baseline fabric latency (propagation + switching): 10 microseconds

High Throughput Low Latency High throughput requires buffering for rate mismatches … but this adds significant queuing latency Baseline fabric latency (propagation + switching): 10 microseconds

Data Center TCP

TCP in the Data Center TCP [Jacobsen et al.’88] is widely used in the data center – More than 99% of the traffic Operators work around TCP problems ‒Ad-hoc, inefficient, often expensive solutions ‒TCP is deeply ingrained in applications Practical deployment is hard  keep it simple!

Review: The TCP Algorithm Sender 1 Sender 2 Receiver ECN = Explicit Congestion Notification Time Window Size (Rate) Additive Increase: W  W+1 per round-trip time Multiplicative Decrease: W  W/2 per drop or ECN mark ECN Mark (1 bit)

TCP Buffer Requirement Bandwidth-delay product rule of thumb: – A single flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B 100% B < C×RTT

Window Size (Rate) Buffer Size Throughput 100% Appenzeller et al. (SIGCOMM ‘04): – Large # of flows: is enough. 15 Reducing Buffer Requirements

Appenzeller et al. (SIGCOMM ‘04): – Large # of flows: is enough Can’t rely on stat-mux benefit in the DC. – Measurements show typically only 1-2 large flows at each server 16 Key Observation: Low variance in sending rate  Small buffers suffice Reducing Buffer Requirements

 Extract multi-bit feedback from single-bit stream of ECN marks – Reduce window size based on fraction of marked packets. ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5% DCTCP: Main Idea Window Size (Bytes) Time (sec) TCP DCTCP

DCTCP: Algorithm Switch side: – Mark packets when Queue Length > K. Sender side: – Maintain running average of fraction of packets marked (α).  Adaptive window decreases: – Note: decrease factor between 1 and 2. B K Mark Don’t Mark

(KBytes) Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch ECN Marking Thresh = 30KB DCTCP vs TCP Buffer is mostly empty DCTCP mitigates Incast by creating a large buffer headroom

1. Low Latency Small buffer occupancies → low queuing delay 2. High Throughput ECN averaging → smooth rate adjustments, low variance 3. High Burst Tolerance Large buffer headroom → bursts fit Aggressive marking → sources react before packets are dropped 21 Why it Works

DCTCP Deployments 21

Discussion 22

What You Said? Austin: “The paper's performance comparison to RED seems arbitrary, perhaps RED had traction at the time? Or just convenient as the switches were capable of implementing it?” 23

Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory Numerous micro-benchmarks – Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure Bing cluster benchmark – Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt Evaluation

25 Background FlowsQuery Flows Bing Benchmark (baseline)

Bing Benchmark (scaled 10x) Query Traffic (Incast bursts) Short messages (Delay-sensitive) Completion Time (ms) Incast Deep buffers fix incast, but increase latency DCTCP good for both incast & latency

What You Said Amy: “I find it unsatisfying that the details of many congestion control protocols (such at these) are so complicated!... can we create a parameter-less congestion control protocol that is similar in behavior to DCTCP or TIMELY?” Hongzi: “Is there a general guideline to tune the parameters, like alpha, beta, delta, N, T_low, T_high, in the system?” 27

Packets sent in this RTT are marked. How much buffering does DCTCP need for 100% throughput? 22  Need to quantify queue size oscillations (Stability). Time (W*+1)(1-α/2) W* Window Size W*+1 A bit of Analysis B K

How small can queues be without loss of throughput? 22  Need to quantify queue size oscillations (Stability). A bit of Analysis B K K > (1/7) C x RTT for TCP: K > C x RTT What assumptions does the model make?

What You Said Anurag: “In both the papers, one of the difference I saw from TCP was that these protocols don’t have the “slow start” phase, where the rate grows exponentially starting from 1 packet/RTT.” 30

DCTCP takes at most ~40% more RTTs than TCP – “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011 Intuition: DCTCP makes smaller adjustments than TCP, but makes them much more frequently 31 Convergence Time TCP DCTCP

TIMELY  Slides by Radhika Mittal (Berkeley)

Qualities of RTT Fine-grained and informative Quick response time No switch support needed End-to-end metric Works seamlessly with QoS

RTT correlates with queuing delay

What You Said Ravi: “The first thing that struck me while reading these papers was how different their approaches were. DCTCP even states that delay-based protocols are "susceptible to noise in the very low latency environment of data centers" and that "the accurate measurement of such small increases in queuing delay is a daunting task". Then, I noticed that there is a 5 year gap between these two papers… “ Arman: “They had to resort to extraordinary measures to ensure that the timestamps accurately reflect the time at which a packet was put on wire…” 35

Accurate RTT Measurement

Hardware Timestamps – mitigate noise in measurements Hardware Acknowledgements – avoid processing overhead Hardware Assisted RTT Measurement

Hardware vs Software Timestamps Kernel Timestamps introduce significant noise in RTT measurements compared to HW Timestamps.

Impact of RTT Noise Throughput degrades with increasing noise in RTT. Precise RTT measurement is crucial.

TIMELY Framework

Overview RTT Measurement Engine Timestamps RTT Rate Computation Engine Pacing Engine Rate Data Paced Data

Serialization Delay RECEIVER SENDER Propagation & Queuing Delay RTT = t completion – t send – Serialization Delay HW ack RTT t completion t send RTT Measurement Engine

Algorithm Overview Gradient-based Increase / Decrease

Algorithm Overview Gradient-based Increase / Decrease Time RTT gradient = 0

Algorithm Overview Gradient-based Increase / Decrease Time RTT gradient > 0

Algorithm Overview Gradient-based Increase / Decrease Time RTT gradient < 0

Algorithm Overview Gradient-based Increase / Decrease Time RTT

Algorithm Overview To navigate the throughput-latency tradeoff and ensure stability. Gradient-based Increase / Decrease

Why Does Gradient Help Stability? 49 Source Feedback higher order derivatives Observe not only error, but change in error – “anticipate” future state

What You Said Arman: “I also think that deducing the queue length from the gradient model could lead to miscalculations. For example, consider an Incast scenario, where many senders transmit simultaneously through the same path. Noting that every packet will see a long, yet steady, RTT, they will compute a near-zero gradient and hence the congestion will continue.” 50

Algorithm Overview Additive Increase Multiplicativ e Decrease T high T low To keep tail latency within acceptable limits. Better Burst Tolerance To navigate the throughput-latency tradeoff and ensure stability. Gradient-based Increase / Decrease

Discussion 52

TIMELY is implemented in the context of RDMA. – RDMA write and read primitives used to invoke NIC services. Priority Flow Control is enabled in the network fabric. – RDMA transport in the NIC is sensitive to packet drops. – PFC sends out pause frames to ensure lossless network. Implementation Set-up

“Congestion Spreading” in Lossless Networks 54 PAUSE

TIMELY vs PFC 55

TIMELY vs PFC 56

What You Said Amy: “I was surprised to see that TIMELY performed so much better than DCTCP. Did the lack of an OS-bypass for DCTCP impact performance? I wish that the authors had offered an explanation for this result.” 57

Next time: Load Balancing 58

59