Experiments with Data Center Congestion Control Research

Slides:

Advertisements

Similar presentations

Finishing Flows Quickly with Preemptive Scheduling

Advertisements

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

1 CNPA B Nasser S. Abouzakhar Queuing Disciplines Week 8 – Lecture 2 16 th November, 2009.

Congestion Control Reasons: - too many packets in the network and not enough buffer space S = rate at which packets are generated R = rate at which receivers.

IS333, Ch. 26: TCP Victor Norman Calvin College 1.

PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.

Congestion control in data centers

Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.

Congestion Control and Resource Allocation

ACN: Congestion Control1 Congestion Control and Resource Allocation.

Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Transport Protocols Slide 1 Transport Protocols.

Information-Agnostic Flow Scheduling for Commodity Data Centers

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Gursharan Singh Tatla Transport Layer 16-May

ICTCP: Incast Congestion Control for TCP in Data Center Networks∗

Practical TDMA for Datacenter Ethernet

Christopher Bednarz Justin Jones Prof. Xiang ECE 4986 Fall Department of Electrical and Computer Engineering University.

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

Wireless TCP Prasun Dewan Department of Computer Science University of North Carolina

B 李奕德.  Abstract  Intro  ECN in DCTCP  TDCTCP  Performance evaluation  conclusion.

TCP1 Transmission Control Protocol (TCP). TCP2 Outline Transmission Control Protocol.

Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,

Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,

Mr. Mark Welton.  Quality of Service is deployed to prevent data from saturating a link to the point that other data cannot gain access to it  QoS allows.

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

11 CS716 Advanced Computer Networks By Dr. Amir Qayyum.

Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.

Congestion Control Mechanisms for Data Center Networks

Data Center TCP (DCTCP)

Networking in Datacenters EECS 398 Winter 2017

Low-Latency Software Rate Limiters for Cloud Networks

Instructor Materials Chapter 6: Quality of Service

Resilient Datacenter Load Balancing in the Wild

6.888 Lecture 5: Flow Scheduling

Operating System Structures

On Queuing, Marking, and Dropping

Data Center Network Architectures

Topics discussed in this section:

LWIP TCP/IP Stack 김백규.

HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,

5. End-to-end protocols (part 1)

Bob Briscoe, BT Murari Sridharan, Microsoft IETF-84 ConEx Jul 2012

Understand the OSI Model Part 2

Transport Protocols over Circuits/VCs

Congestion Control and Resource Allocation

Multipath TCP Yifan Peng Oct 11, 2012

© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Quality of Service Connecting Networks.

Transport Layer Unit 5.

Scaling Deep Reinforcement Learning to Enable Datacenter-Scale

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

Augmenting Proactive Congestion Control with Aeolus

Congestion Control in Software Define Data Center Network

AMP: A Better Multipath TCP for Data Center Networks

Data Center TCP (DCTCP)

Fast Congestion Control in RDMA-Based Datacenter Networks

RDMA over Commodity Ethernet at Scale

Outline Chapter 2 (cont) OS Design OS structure

Building A Network: Cost Effective Resource Sharing

SICC: SDN-Based Incast Congestion Control For Data Centers Ahmed M

CS4470 Computer Networking Protocols

Lecture 16, Computer Networks (198:552)

Congestion Control Reasons:

Lecture 17, Computer Networks (198:552)

Ch 17 - Binding Protocol Addresses

Congestion Control and Resource Allocation

Transport Layer 9/22/2019.

Presentation transcript:

Experiments with Data Center Congestion Control Research Wei Bai APNet 2017, Hong Kong

The opinions of this talk do not represent the official policy of HKUST and Microsoft

Data Center Congestion Control Research 2009: TCP retransmissions 2010: DCTCP, ICTCP 2011: D 3 , MPTCP 2012: PDQ, D 2 TCP, ECN ∗ 2013: pFabric 2014: PASE, Fastpass, CP 2015: DCQCN, TIMELY, PIAS ….

This talk is about our experience on PIAS project. Joint work with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang HotNets 2014, NSDI 2015 and ToN 2017

Outline PIAS mechanisms Implementation efforts for NSDI submission Efforts after NSDI Takeaway from PIAS experience

Outline PIAS mechanisms Implementation efforts for NSDI submission Efforts after NSDI Takeaway from PIAS experience

Flow Completion Time (FCT) is Key Data center applications Desire low latency for short messages App performance & user experience Goal of DCN transport: minimize FCT Many data center applications, such as Web search, machine learning, database and cache, desire ultra low latency for these short messages. The reason is that the completion times of these short messages directly determine the user experience. Therefore, to improve the application performance, one of the most important design goal for data center transport is to minimize flow completion times, especially for short flows. To address this challenge, there are many flow scheduling proposals, [click]

PIAS Key Idea PIAS performs Multi-Level Feedback Queue (MLFQ) to emulate Shortest Job First (SJF) High Priority 1 Priority 2 The design rationale of PIAS is performing multi-level feedback queue to emulate shortest job first. As we can see, there are K priority queues in MLFQ. 1 is the highest priority while K is the lowest one. During a flow’s life time, its priority is gradually reduced. [click] …… Low Priority K

PIAS Key Idea PIAS performs Multi-Level Feedback Queue (MLFQ) to emulate Shortest Job First (SJF) Priority 1 Priority 2 For example, for this flow. The first packet is assigned to the priority 1. [click]. The second packet is demoted to priority 2. [click] For the third packet, its priority is further reduced. [click]. Eventually, if this flow is large enough, the last packet is assigned to the lowest priority. You can see that, the key idea of multi-level feedback queue is very simple and it does not require flow size information to do the scheduling. …… Priority K

PIAS Key Idea PIAS performs Multi-Level Feedback Queue (MLFQ) to emulate Shortest Job First (SJF) In general, PIAS short flows finish in higher priority queues while large ones in lower priority queues, emulating SJF, effective for heavy tailed DCN traffic. With MLFQ, small flows are more likely to finish in higher priority queues while large flows are more likely to finish in lower priority queues. Therefore, small flows are prioritized over large flows, thus emulating SJF, especially for heavy tailed DCN workloads.

Requires switch to keep per-flow state How to implement MLFQ? Implementing MLFQ at switch directly not scalable Requires switch to keep per-flow state Priority 1 Priority 2 However, MLFQ is not supported on commodity switches because it’s hard to track per-flow state on switches. …… Priority K

How to implement MLFQ? Decoupling MLFQ Stateless Priority Queueing at the switch (a built-in function) Stateful Packet Tagging at the end host Priority 1 - K priorities: P i 1≤i≤K − K−1 thresholds: α j 1≤j≤K−1 - Threshold from P j−1 to P j is: α j−1 Priority 2 To solve this problem, we decouple multi-level feedback queue. On the switch, we enable strict priority queueing [click]. At end hosts, we have a packet tagging module[click]. The logic of packet tagging module is quite simple. Assuming that they are K priorities and K-1 demotion thresholds correspondingly. Note that the P1 is the highest priority. The packet tagging module just maintains per-flow state and compare the bytes sent information with demotion thresholds. If a flow’s bytes sent value is exactly larger than Alpha[j-1], the following packet of this flow will be marked with Pj. For example, [click] …… Priority K K priorities: P i 1≤i≤K K−1 thresholds: α j 1≤j≤K−1 Threshold from P j−1 to P j is: α j−1 K priorities: P i 1≤i≤K K−1 thresholds: α j 1≤j≤K−1 Threshold from P j−1 to P j is: α j−1

How to implement MLFQ? Decoupling MLFQ Stateless Priority Queueing at the switch (a built-in function) Stateful Packet Tagging at the end host i Priority 1 - K priorities: P i 1≤i≤K − K−1 thresholds: α j 1≤j≤K−1 - Threshold from P j−1 to P j is: α j−1 Priority 2 解释例子 …… Priority K K priorities: P i 1≤i≤K K−1 thresholds: α j 1≤j≤K−1 Threshold from P j−1 to P j is: α j−1 K priorities: P i 1≤i≤K K−1 thresholds: α j 1≤j≤K−1 Threshold from P j−1 to P j is: α j−1

Threshold vs Traffic Mismatch DCN traffic is highly dynamic Threshold fails to catch traffic variation → mismatch Ideal, threshold = 20KB 10MB ECN 10MB High To determine demotion thresholds for MLFQ, we build a simple queueing theory model to analyze it and derive demotion thresholds based on flow size distributions. However, traffic is highly dynamic in data centers (across both time and space), so the mismatch between traffic and demotion thresholds is unavoidable. We need to handle the mismatch problem in order keep PIAS effective in highly dynamic data center networks? [click] 解释例子 Low Too small, 10KB 20KB Too big, 1MB

PIAS in 1 Slide PIAS packet tagging PIAS switch PIAS rate control Maintain flow states and mark packets with priority PIAS switch Enable strict priority queueing and ECN PIAS rate control Employ Data Center TCP to react to ECN Now, we summarize the design of PIAS in 1 slide. PIAS has 3 components. The first one is packet tagging module at end hosts. It tracks per-flow state and mark packets with priority. At the switch, PIAS needs to enable strict priority queueing and ECN marking. PIAS leverages DCTCP for rate control. This is the whole system of PIAS.

Outline Key idea of PIAS Implementation efforts for NSDI submission Efforts after NSDI Takeaway from PIAS experience

Implementation Stages ECN-based transport DCTCP at the end host ECN marking at the switch MLFQ scheduling Packet tagging module at the end host Priority queueing at the switch Evaluation Measure FCT using realistic traffic We divide the implementation efforts into three stages. In this first stage …… [click] In the second stage … [click] In the third stage ….

Integrate DCTCP into Linux Kernel DCTCP was not integrated into Linux in 2014 Linux patch provided by the authors

Integrate DCTCP into Linux Kernel DCTCP was not integrated in Linux in 2014 Linux patch provided by the authors PASS So it is very easy to build a kernel with DCTCP algorithm.

ECN Marking at the Switch Switch hardware: Pica8 P-3295 1G switch Switch OS: PicOS 2.1.0 After that, we need to configure ECN marking at the switch. The switch I used is Pica8 P-3295 switch. It has 48 1G ports. The OS verion is PicOS 2.1.0. 看上去，配置ECN标记很容易，我们只需要在手册里找到命令并且输入命令就好了

ECN Marking at the Switch Switch hardware: Pica8 P-3295 1G switch Switch OS: PicOS 2.1.0 [click] 找不到ECN和RED [click]为什么我的交换机不支持ECN呢，DCTCP论文不是这么说的给pica8的人发信问，人家表示交换机芯片有这个能力，但是我们还没有做这个feature No “ECN” and “RED” in this document Why our switch does not support ECN?

ECN Marking at the Switch Switch model (down to top) Switching chip hardware Switching chip interfaces (all hardware features) Switch OS (some hardware features) Solution (with help from Dongsu) Use Broadcom shell to configure ECN/RED 在研究了交换机model之后，我终于理解了这个回答的意思. Our switch actually has three layers. The bottom layer is hardware switching chip. Note that many switch vendors do not have their own switching chips. They buy switching chips from some chip vendors, such as Broadcom and Barefoot. On the top of hardware, chip vendors also provide some switching chip abstraction interfaces to configure the hardware. These interfaces can control all hardware features. [click] The highest layer is switch OS. Switch companies develop their switch OS based on chip interfaces. Due to the time limitation and customer requirement, they will also expose control interfaces to some hardware features.

DCTCP Performance Validation TCP incast experiment TCP RTOmin: 200ms Static switch buffer allocation Expected results DCTCP greatly outperforms TCP Actual results DCTCP delivers similar / worse performance Some flows experience 3s timeout delays

Result Analysis Why flows experience 3s timeouts? HZ = 1second

Result Analysis Why flows experience 3s timeouts? Root cause Solution Many SYN packets get dropped ECN bits of SYN packets are 00 (Non-ECT) Root cause Non-ECT packets: SYN, FIN, pure ACK packets The switch drops Non-ECT packets if the queue length exceeds the marking threshold Solution Modify all TCP packets to ECT using iptables

Packet Tagging Module A loadable kernel module Shim layer between TCP/IP and Qdisc Application User Space TCP IP Packet Tagging Kernel Space Qdisc NIC Driver

Packet Tagging Module A loadable kernel module Shim layer between TCP/IP and Qdisc Netfilter hooks to intercept packets

Packet Tagging Module A loadable kernel module Shim layer between TCP/IP and Qdisc Netfilter hooks to intercept packets Keep per-flow state in a hash table with linked lists Linked List 1 Flow 1 Flow 4 Linked List 2 Linked List 3 Flow 2 Flow 5 Linked List 4 Linked List 5 Flow 3

Kernel Programming Likely to cause kernel panic

Kernel Programming After implementing a small feature, test it! Use printk to get some useful information Common errors Spinlock functions (e.g., spin_lock_irqsave and spin_lock) vmalloc and kmalloc (different types of memory) Pair programming

Priority Queueing at Switch Easy to configure using PicOS / Broadcom shell Undesirable interaction with ECN/RED Each queue is essentially a link with the varying capacity -> dynamic queue length threshold Existing ECN/RED solutions (queue/port/shared buffer pool) only support static thresholds Our choice: per-port ECN/RED Cannot preserve the scheduling policy

Evaluation Flow Completion Time (FCT) Measure FCT at the receiver side T(receive the last ACK) – T(send the first packet) The TCP sender does not know the time to receive the last ACK in practice Measure FCT at the receiver side The receiver sends a request to the sender to get the desired amount of data T(receive the all response) – T(send the request)

Outline Key idea of PIAS Implementation efforts for NSDI submission Efforts after NSDI Takeaway from PIAS experience

Implementation Efforts Improve traffic generator Use persistent TCP connections Better user interfaces Used in other papers (e.g. , ClickNP) Improve packet tagging module Identify message boundaries in TCP connections Monitor TCP send buffer occupancy using jprobe hooks Evaluation on Linux kernel 3.18

Some Research Questions How to do ECN marking with multiple queues? Our solution (per-port ECN/RED) violates the scheduling policy for good throughput and latency How does switch mange its buffer? Incast only happens with static buffer allocation

Research Efforts ECN marking with multiple queues (2015-206) MQ-ECN [NSDI’16]: dynamically adjust per-queue queue length thresholds TCN [CoNEXT’16]: use sojourn time as the signal Buffer management (2016-2017) BCC [APNet’17]: buffer-aware congestion control for extremely shallow-buffered data centers One more shared buffer ECN/RED configuration

Outline Key idea of PIAS Implementation efforts for NSDI submission Efforts after NSDI Takeaway from PIAS experience

Takeaway Start to do implementation when you start a project. A good implementation not only makes the paper stronger, but also unveils many research problems.

Cloud & Mobile Group at MSRA Research Area: Cloud Computing, Networking, Mobile Computing We are now looking for full-time researchers and research interns Feel free to talk to me or send emails to my manager Thomas Moscibroda

Thank You