Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.

Slides:



Advertisements
Similar presentations
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research.
Advertisements

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
Mohammad Alizadeh Adel Javanmard and Balaji Prabhakar Stanford University Analysis of DCTCP:Analysis of DCTCP: Stability, Convergence, and FairnessStability,
Congestion Control An Overview -Jyothi Guntaka. Congestion  What is congestion ?  The aggregate demand for network resources exceeds the available capacity.
Explicit Congestion Notification (ECN) RFC 3168 Justin Yackoski DEGAS Networking Group CISC856 – TCP/IP Thanks to Namratha Hundigopal.
IETF87 Berlin DCTCP implementation in FreeBSD
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Congestion control in data centers
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
Explicit Congestion Notification ECN Tilo Hamann Technical University Hamburg-Harburg, Germany.
1 Congestion Control. Transport Layer3-2 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168) Limited Transmit (RFC 3042)
1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug 1993), pp
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168)
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #8 Explicit Congestion Notification (RFC 3168) Limited Transmit.
Random Early Detection Gateways for Congestion Avoidance
Data Center TCP (DCTCP)
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Advanced Computer Networks : RED 1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking,
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
ICTCP: Incast Congestion Control for TCP in Data Center Networks∗
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan by liyong Data.
IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.
TCP & Data Center Networking
TCP Incast in Data Center Networks
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
Packet Transport Mechanisms for Data Center Networks Mohammad Alizadeh NetSeminar (April 12, 2012) Mohammad Alizadeh NetSeminar (April 12, 2012) Stanford.
DCTCP & CoDel the Best is the Friend of the Good Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 TSVAREA Jul 2012 Le mieux est l'ennemi du bien Voltaire.
1 IEEE Meeting July 19, 2006 Raj Jain Modeling of BCN V2.0 Jinjing Jiang and Raj Jain Washington University in Saint Louis Saint Louis, MO
Queueing and Active Queue Management Aditya Akella 02/26/2007.
A Self-Configuring RED Gateway Wu-chang Feng, Dilip Kandlur, Debanjan Saha, Kang Shin INFOCOM ‘99.
AQM & TCP models Courtesy of Sally Floyd with ICIR Raj Jain with OSU.
Kai Chen (HKUST) Nov 1, 2013, USTC, Hefei Data Center Networking 1.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Explicit Congestion Notification (ECN) RFC 3168
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
ECEN 619, Internet Protocols and Modeling Prof. Xi Zhang Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions.
6.888: Lecture 3 Data Center Congestion Control Mohammad Alizadeh Spring
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
15-744: Computer Networking L-14 Data Center Networking III.
Data Center TCP (DCTCP)
Data Center TCP (DCTCP)
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
Internet Networking recitation #9
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Congestion Control: The Role of the Routers
Router-Assisted Congestion Control
Packet Transport Mechanisms for Data Center Networks
Microsoft Research Stanford University
Data Center TCP (DCTCP)
Internet Networking recitation #10
SICC: SDN-Based Incast Congestion Control For Data Centers Ahmed M
Lecture 16, Computer Networks (198:552)
Reconciling Mice and Elephants in Data Center Networks
Lecture 17, Computer Networks (198:552)
CS 401/601 Computer Network Systems Mehmet Gunes
Presentation transcript:

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University Data Center TCP (DCTCP) 1 Presented by Yi-Zong (Andrew) Ou Credits: Slides modified from the original presentation

Data Center Packet Transport Vast large data centers: Huge investment R&D, business Packet transport inside the DC TCP rules (99.9% of traffic) Hoe does TCP do in the DC? 2

TCP in the Data Center TCP does not meet demands in DC and suffers from Bursty packet drops Incast [SIGCOMM ‘09] Builds up large queues:  Adds significant latency  Wastes precious buffers:  Bad with shallow-buffered switches  Deep-buffed switches are expensive Problems of current solutions to the TCP problems ‒Ad-hoc, inefficient, often expensive solutions ‒No solid understanding of consequences, tradeoffs 3

Case Study: Microsoft Bing Initial experimental measurements from 6000 servers production cluster Instrumentation passively collects logs ‒Application-level ‒Socket-level ‒Selected packet-level More than 150TB of compressed data over a month Note: the environment is different from the test bed 4

Partition/Aggregate Design Pattern Used in many large-scale web applications Web search, Social network composition, Ad selection Example: Facebook Partition/Aggregate ~ Multiget Aggregators: Web Servers Workers: Memcached Servers 5 Memcached Servers (Workers) Internet Web Servers

TLA MLA Worker Nodes ……… Partition/Aggregate Application Structure 6 Picasso “Everything you can imagine is real.”“Bad artists copy. Good artists steal.” “It is your work in life that is the ultimate seduction.“ “The chief enemy of creativity is good sense.“ “Inspiration does exist, but it must find you working.” “I'd like to live as a poor man with lots of money.“ “Art is a lie that makes us realize the truth. “Computers are useless. They can only give you answers.” ….. 1. Art is a lie… 2. The chief… 3. … Art is a lie… 3. ….. Art is… Picasso Time is money  Strict deadlines (SLAs) Missed deadline  Lower quality result Deadline = 250ms Deadline = 50ms Deadline = 10ms

Workloads in DC and metrics Partition/Aggregate [2KB-20KB] (Query) Short messages [50KB-1MB] (Coordination, update control state) Large flows [1MB-50MB] (Internal data update) 7 Delay-sensitive Throughput-sensitive

Switch Design and Performance Impairments Switches Incast Queue Buildup Buffer Pressure 8

Switches Shared memory switches for performance gain by sharing logically common packets buffer Memory from the shared pool is dynamically allocated to a packet by a MMU If the interface reaches the memory limit, the arrived packets drop Questions: Could we add more memory? 9

Switches Shared memory switches for performance gain by sharing logically common packets buffer Memory from the shared pool is dynamically allocated to a packet by a MMU If the interface reaches the memory limit, the arrived packets drop Questions: Could we add more memory? Tradeoff between memory size and cost Deep-buffered Switches are much more expensive Shallow-buffered Switches are more common Shallow-buffered Switches causes the three impairments Incast Queue buildup Buffer Pressure 10

Incast 11 TCP timeout Worker 1 Worker 2 Worker 3 Worker 4 Aggregator RTO min = 300 ms Due to possible synchronization caused by Partition/Aggregate pattern

Incast – possible solutions What are the possible solutions to the incast problem? 12

Incast – possible solutions What are the possible solutions to the incase problem? Jittering response by a random delay Batching responses in small groups Pros: Avoid the incast problem Cons: Increase the median response time 13

Incast in the production environment Requests are jittered over 10ms window Jittering switched off around 8:30 am Jittering trades off median against high percentiles 14 MLA Query Completion Time (ms)

Queue Buildup 15 Sender 1 Sender 2 Receiver Big flows buildup queues  Increased latency for short flows Measurements in Bing cluster  For 90% packets: RTT < 1ms  For 10% packets: 1ms < RTT < 15ms

Buffer Pressure The long, greedy TCP flows build up queues on their interfaces Queue built up reduces the amount of buffer available to absorb burst traffic from Partition/Aggregate pattern Result: packet loss and timeouts 16

DCTCP Objectives Solve Incast Queue Buildup Buffer Pressure Objectives: High burst rate tolerance Low latency for small flow High throughput for large flow With commodity hardware 17

The DCTCP Algorithm 18

ECN: Explicit Congestion Notification ECN allows end-to-end notification of network congestion without dropping packets Achieve by setting a mark in the IP header to indicate congestion ECN-Echo: signal the sender to reduce the amount of information it sends Congestion Window Reduced (CWR): acknowledge that the congestion-indication echoing was received Already in commodity hardware DCTCP relies on ECN 19

DCTCP Overview Achieve the goal by reacting in proportion to the extent of congestion, not its presence At switches side, use simple marking scheme Mark Congestion Experienced (CE) of codepoints as soon as the queue occupancy exceeds a fixed small threshold At the source side, the source reacts by reducing the window by a factor depending on the fraction of marked packets. The larger the fraction, the bigger the decrease factor (decrease more) 18

Data Center TCP Algorithm Switch side: Mark packets when Queue Length > K 19 Receiver side: In TCP, a receiver sets ECN-Echo flag in a series of ACK packets until it receives confirmation (CWR flag) from the sender that the congestion notification has been received. In DCTCP, a receiver accurately notifies the exact sequence of marked packet to the sender BK MarkDon’t Mark CE: Congestion Experienced

DCTCP Algorithm 22 Sender side: – Maintain running average of fraction of packets marked (α). In each RTT:  Adaptive window decreases: – Note: decrease factor between 1 and 2 – Highly congested, α = 1, Cwnd becomes half – Lower congested, smaller α, Cwnd decreases little

Does DCTCP solve the problems? 1. Queue buildup DCTCP senders start reacting as soon as queue length exceeds k → reduce queuing delay More buffer available (headroom) → absorb micro burst rate 2. Buffer pressure Congested port’s queue will not grow exceeding large → few ports will not consume all the memory 3. Incast Marking early and aggressively based on instantaneous queue length → source react fast to reduce the size of follow up bursts. Prevent buffer overflow and timeouts. 21

Evaluation Implemented in Windows stack. Real hardware, 1Gbps and 10Gbps experiments 90 server testbed Broadcom Triumph 48 1G ports – 4MB shared memory (shallow) Cisco Cat G ports – 16MB shared memory (deep) Broadcom Scorpion 24 10G ports – 4MB shared memory (shallow) Numerous micro-benchmarks – Throughput and Queue Length – Multi-hop – Queue Buildup – Buffer Pressure Cluster traffic benchmark 23 – Fairness and Convergence – Incast – Static vs Dynamic Buffer Mgmt

Cluster Traffic Benchmark Emulate traffic within 1 Rack of Bing cluster 45 1G servers, 10G server for external traffic Generate query, and background traffic Flow sizes and arrival times follow distributions seen in Bing Metric: Flow completion time for queries and background flows. 24

Baseline 25 Background FlowsQuery Flows

Baseline 25 Background FlowsQuery Flows Low latency for short flows.

Baseline 25 Background FlowsQuery Flows Low latency for short flows. High throughput for long flows.

Baseline 25 Background FlowsQuery Flows Low latency for short flows High throughput for long flows High burst tolerance for query flows

Scaled Background & Query 10x Background, 10x Query th percentile

Conclusions DCTCP satisfies all requirements for Data Center packet transport Handles bursts well Keeps queuing delays low Achieves high throughput Features: Very simple change to TCP and a single switch parameter Based on mechanisms already available, ECN (no need to replace hardware) 27

Thank you 32

Data Center Transport Requirements High Burst Tolerance –Incast due to Partition/Aggregate is common. 2. Low Latency –Short flows, queries 3. High Throughput –Continuous data updates, large file transfers The challenge is to achieve these three together.

Tension Between Requirements 34 High Burst Tolerance High Throughput Low Latency DCTCP Deep Buffers:  Queuing Delays Increase Latency Shallow Buffers:  Bad for Bursts & Throughput Reduced RTO min (SIGCOMM ‘09)  Doesn’t Help Latency AQM – RED:  Avg Queue Not Fast Enough for Incast Objective: Low Queue Occupancy & High Throughput

Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). Time (W*+1)(1-α/2) W* Window Size W*+1

Packets sent in this RTT are marked. Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). Time (W*+1)(1-α/2) W* Window Size W*+1

Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. 17 B Cwnd Buffer Size Throughput 100%

Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. 17 B Cwnd Buffer Size Throughput 100%

Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. Measurements show typically 1-2 big flows at each server, at most 4. 17

Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. Measurements show typically 1-2 big flows at each server, at most B Real Rule of Thumb: Low Variance in Sending Rate → Small Buffers Suffice

Analysis How low can DCTCP maintain queues without loss of throughput? How do we set the DCTCP parameters? 22  Need to quantify queue size oscillations (Stability). 85% Less Buffer than TCP

DCTCP in Action 20 Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB (Kbytes)

Review: The TCP/EControl Loop 43 Sender 1 Sender 2 Receiver ECN Mark (1 bit) ECN = Explicit Congestion Notification

Does DCTCP meet the objectives? 1.High Burst Tolerance Large buffer headroom → bursts fit Aggressive marking → sources react before packets are dropped 2. Low Latency Small buffer occupancies → low queuing delay 3. High Throughput ECN averaging → smooth rate adjustments, low variance 21