Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.

Slides:



Advertisements
Similar presentations
A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.
Advertisements

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
1 Updates on Backward Congestion Notification Davide Bergamasco Cisco Systems, Inc. IEEE 802 Plenary Meeting San Francisco, USA July.
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
Advanced Computer Networking Congestion Control for High Bandwidth-Delay Product Environments (XCP Algorithm) 1.
Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Congestion control in data centers
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
SEDCL: Stanford Experimental Data Center Laboratory.
Sizing Router Buffers Nick McKeown Guido Appenzeller & Isaac Keslassy SNRC Review May 27 th, 2004.
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
Data Center TCP (DCTCP)
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Congestion Control for High Bandwidth-Delay Product Environments Dina Katabi Mark Handley Charlie Rohrs.
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Balaji Prabhakar Mohammad Alizadeh, Abdul Kabbani, and Berk Atikoglu Stanford University Stability Analysis of QCN:Stability Analysis of QCN: The Averaging.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.
IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.
TCP & Data Center Networking
TCP Incast in Data Center Networks
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.
B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
1 IEEE Meeting July 19, 2006 Raj Jain Modeling of BCN V2.0 Jinjing Jiang and Raj Jain Washington University in Saint Louis Saint Louis, MO
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
Murari Sridharan and Kun Tan (Collaborators: Jingmin Song, MSRA & Qian Zhang, HKUST.
Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.
1 Sheer volume and dynamic nature of video stresses network resources PIE: A lightweight latency control to address the buffer problem issue Rong Pan,
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
1 Three ways to (ab)use Multipath Congestion Control Costin Raiciu University Politehnica of Bucharest.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Revisiting Transport Congestion Control Jian He UT Austin 1.
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
15-744: Computer Networking L-14 Data Center Networking III.
VL2: A Scalable and Flexible Data Center Network
Data Center TCP (DCTCP)
Data Center TCP (DCTCP)
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
OTCP: SDN-Managed Congestion Control for Data Center Networks
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Router-Assisted Congestion Control
Packet Transport Mechanisms for Data Center Networks
Microsoft Research Stanford University
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
AMP: A Better Multipath TCP for Data Center Networks
Data Center TCP (DCTCP)
Centralized Arbitration for Data Centers
Lecture 16, Computer Networks (198:552)
Reconciling Mice and Elephants in Data Center Networks
Lecture 17, Computer Networks (198:552)
Understanding Congestion Control Mohammad Alizadeh Fall 2018
CS 401/601 Computer Network Systems Mehmet Gunes
Presentation transcript:

Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford University

Data Centers Huge investments: R&D, business – Upwards of $250 Million for a mega DC Most global IP traffic originates or terminates in DCs – In 2011 (Cisco Global Cloud Index): ~315ExaBytes in WANs ~1500ExaBytes in DCs 2

3 This talk is about packet transport inside the data center.

INTERNET Servers Fabric 4

INTERNET Servers Fabric 5 Layer 3 TCP Layer 3 TCP Layer 3: DCTCP Layer 2: QCN Layer 3: DCTCP Layer 2: QCN

TCP in the Data Center TCP is widely used in the data center (99.9% of traffic) But, TCP does not meet demands of applications – Requires large queues for high throughput:  Adds significant latency due to queuing delays  Wastes costly buffers, esp. bad with shallow-buffered switches Operators work around TCP problems ‒Ad-hoc, inefficient, often expensive solutions ‒No solid understanding of consequences, tradeoffs 6

7 TCP: ~1–10ms DCTCP & QCN: ~100μs HULL: ~Zero Latency Roadmap: Reducing Queuing Latency Baseline fabric latency (propagation + switching): 10 – 100μs

Data Center TCP with Albert Greenberg, Dave Maltz, Jitu Padhye, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan SIGCOMM 2010

Case Study: Microsoft Bing A systematic study of transport in Microsoft’s DCs – Identify impairments – Identify requirements Measurements from 6000 server production cluster More than 150TB of compressed data over a month 9

TLA MLA Worker Nodes ……… Search: A Partition/Aggregate Application Picasso “Everything you can imagine is real.” “Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Strict deadlines (SLAs) Missed deadline  Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms 10

TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Synchronized fan-in congestion:  Caused by Partition/Aggregate. 11 Incast  Vasudevan et al. (SIGCOMM’09)

Requests are jittered over 10ms window. Jittering switched off around 8:30 am. Jittering trades off median against high percentiles. MLA Query Completion Time (ms) 12 Incast in Bing

Partition/Aggregate (Query) Short messages [50KB-1MB] (Coordination, Control state) Large flows [1MB-100MB] (Data update) High Burst-Tolerance Low Latency High Throughput 13 Data Center Workloads & Requirements The challenge is to achieve these three together.

14 High Burst Tolerance High Throughput Low Latency Deep Buffers:  Queuing Delays Increase Latency Shallow Buffers:  Bad for Bursts & Throughput Tension Between Requirements We need: Low Queue Occupancy & High Throughput We need: Low Queue Occupancy & High Throughput

TCP Buffer Requirement Bandwidth-delay product rule of thumb: – A single flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B 100% B < C×RTT 15

Window Size (Rate) Buffer Size Throughput 100% Appenzeller et al. (SIGCOMM ‘04): – Large # of flows: is enough. 16 Reducing Buffer Requirements

Appenzeller et al. (SIGCOMM ‘04): – Large # of flows: is enough Can’t rely on stat-mux benefit in the DC. – Measurements show typically only 1-2 large flows at each server Key Observation: – Low Variance in Sending Rates  Small Buffers Suffice. Both QCN & DCTCP reduce variance in sending rates. – QCN: Explicit multi-bit feedback and “averaging” – DCTCP: Implicit multi-bit feedback from ECN marks 17 Reducing Buffer Requirements

How can we extract multi-bit feedback from single-bit stream of ECN marks? – Reduce window size based on fraction of marked packets. 18 ECN MarksTCPDCTCP Cut window by 50%Cut window by 40% Cut window by 50%Cut window by 5% DCTCP: Main Idea

DCTCP: Algorithm Switch side: – Mark packets when Queue Length > K. Sender side: – Maintain running average of fraction of packets marked (α).  Adaptive window decreases: – Note: decrease factor between 1 and 2. B K Mark Don’t Mark 19

20 Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, (Kbytes) ECN Marking Thresh = 30KB DCTCP vs TCP

Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory Numerous micro-benchmarks – Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure Bing cluster benchmark – Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt 21 Evaluation

Bing Benchmark 22 Query Traffic (Bursty) Short messages (Delay-sensitive) Completion Time (ms) incast Deep buffers fixes incast, but makes latency worse DCTCP good for both incast & latency

Analysis of DCTCP with Adel Javanmrd, Balaji Prabhakar SIGMETRICS 2011

DCTCP Fluid Model 24 × N/RTT(t) W(t) p(t) Delay p(t – R * ) C + − 1 0 K q(t) Switch LPF AIMD α(t) Source

Fluid Model vs ns2 simulations Parameters: N = {2, 10, 100}, C = 10Gbps, d = 100μs, K = 65 pkts, g = 1/16. N = 2N = 10N =

We make the following change of variables: The normalized system: The normalized system depends on only two parameters: Normalization of Fluid Model 26

System has a periodic limit cycle solution. Example: 30 Equilibrium Behavior: Limit Cycles Equilibrium Behavior: Limit Cycles

System has a periodic limit cycle solution. Example: 30 Equilibrium Behavior: Limit Cycles Equilibrium Behavior: Limit Cycles

Let X * = set of points on the limit cycle. Define: A limit cycle is locally asymptotically stable if δ > 0 exists s.t.: 31 Stability of Limit Cycles

32 x1x1 x2x2 x 2 = P(x 1 ) Stability of Poincaré Map ↔ Stability of limit cycle x * α = P(x * α ) Poincaré Map

Theorem: The limit cycle of the DCTCP system: is locally asymptotically stable if and only if ρ(Z 1 Z 2 ) < 1. -J F is the Jacobian matrix with respect to x. -T = (1 + h α )+(1 + h β ) is the period of the limit cycle. Proof: Show that P(x * α + δ) = x * α + Z 1 Z 2 δ + O(|δ| 2 ). 33 We have numerically checked this condition for: Stability Criterion

How big does the marking threshold K need to be to avoid queue underflow? B K 34 Parameter Guidelines

HULL: Ultra Low Latency with Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda To appear in NSDI 2012

34 TCP: ~1–10ms DCTCP: ~100μs ~Zero Latency How do we get this? What do we want? C Incoming Traffic TCP Incoming Traffic DCTCP K C

Phantom Queue 35 Link Speed C Switch Bump on Wire Key idea: – Associate congestion with link utilization, not buffer occupancy – Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001) Marking Thresh. γC γ < 1 creates “bandwidth headroom” γ < 1 creates “bandwidth headroom”

36 Throughput Switch latency (mean) Throughput & Latency vs. PQ Drain Rate

TCP traffic is very bursty – Made worse by CPU-offload optimizations like Large Send Offload and Interrupt Coalescing – Causes spikes in queuing, increasing latency 37 Example. 1Gbps flow on 10G NIC The Need for Pacing 65KB bursts every 0.5ms

38 Throughput Switch latency (mean) Throughput & Latency vs. PQ Drain Rate (with Pacing) Throughput & Latency vs. PQ Drain Rate (with Pacing)

The HULL Architecture 39 Phantom Queue Hardware Pacer DCTCP Congestion Control

More Details… Application DCTCP CC NIC Pacer LSO Host Switch Empty Queue PQ Large FlowsSmall Flows Link (with speed C) ECN Thresh. γ x C Large Burst Hardware pacing is after segmentation in NIC. Mice flows skip the pacer; are not delayed. 40

Load: 20% Switch Latency (μs)10MB FCT (ms) Avg99 th Avg99 th TCP111.51, DCTCP-30K DCTCP-PQ950- Pacer senders  1 receiver (80% 1KB flows, 20% 10MB flows). ~93% decrease Dynamic Flow Experiment 20% load ~17% increase

Processor sharing model for elephants – On a link of capacity 1, a flow of size x takes on average to complete (ρ is the total load). Example: (ρ = 40%) Slowdown = 50% Not 20% Slowdown = 50% Not 20% Slowdown due to bandwidth headroom

Slowdown: Theory vs Experiment 43 DCTCP-PQ800DCTCP-PQ900DCTCP-PQ950

Summary QCN – IEEE802.1Qau standard for congestion control in Ethernet DCTCP – Will ship with Windows 8 Server HULL – Combines DCTCP, Phantom queues, and hardware pacing to achieve ultra-low latency 44

Thank you!