Lecture 17, Computer Networks (198:552)

Slides:



Advertisements
Similar presentations
Finishing Flows Quickly with Preemptive Scheduling
Advertisements

B 黃冠智.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Modified by Feng.
Lecture 18: Congestion Control in Data Center Networks 1.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Presented by Shaddi.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
CS 4700 / CS 5700 Network Fundamentals Lecture 12: Router-Aided Congestion Control (Drop it like it’s hot) Revised 3/18/13.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
Lecture 5: Congestion Control l Challenge: how do we efficiently share network resources among billions of hosts? n Last time: TCP n This time: Alternative.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
TCP & Data Center Networking
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
3: Transport Layer3b-1 Principles of Congestion Control Congestion: r informally: “too many sources sending too much data too fast for network to handle”
Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
Computer Networking Lecture 18 – More TCP & Congestion Control.
TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.
1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:
CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.
1 Computer Networks Congestion Avoidance. 2 Recall TCP Sliding Window Operation.
Advance Computer Networks Lecture#09 & 10 Instructor: Engr. Muhammad Mateen Yaqoob.
Lecture Topics: 11/15 CPU scheduling: –Scheduling goals and algorithms.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
Karn’s Algorithm Do not use measured RTT to update SRTT and SDEV Calculate backoff RTO when a retransmission occurs Use backoff RTO for segments until.
Data Center TCP (DCTCP)
Data Center TCP (DCTCP)
Resilient Datacenter Load Balancing in the Wild
6.888 Lecture 5: Flow Scheduling
CS450 – Introduction to Networking Lecture 19 – Congestion Control (2)
Topics discussed in this section:
Chapter 3 outline 3.1 transport-layer services
OTCP: SDN-Managed Congestion Control for Data Center Networks
Satellite TCP Lecture 19 04/10/02.
Introduction to Congestion Control
TCP Congestion Control at the Network Edge
Chapter 3 outline 3.1 Transport-layer services
Improving Datacenter Performance and Robustness with Multipath TCP
TCP Congestion Control
Congestion-Aware Load Balancing at the Virtual Edge
Scaling Deep Reinforcement Learning to Enable Datacenter-Scale
Microsoft Research Stanford University
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
TCP.
Lecture 19 – TCP Performance
Queuing and Queue Management
Augmenting Proactive Congestion Control with Aeolus
TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar.
AMP: A Better Multipath TCP for Data Center Networks
Jiyong Park Seoul National University, Korea
CS640: Introduction to Computer Networks
Centralized Arbitration for Data Centers
Lecture 16, Computer Networks (198:552)
TCP Congestion Control
Congestion-Aware Load Balancing at the Virtual Edge
Administrivia Review 1 due tomorrow Office hours on Thursdays 10—12
EE 122: Lecture 10 (Congestion Control)
Transport Layer: Congestion Control
Chapter 3 outline 3.1 Transport-layer services
TCP: Transmission Control Protocol Part II : Protocol Mechanisms
Lecture 5, Computer Networks (198:552)
Lecture 6, Computer Networks (198:552)
Presentation transcript:

Lecture 17, Computer Networks (198:552) Packet Scheduling in Data Centers Lecture 17, Computer Networks (198:552)

Datacenter transport Goal: Complete flows quickly / meet deadlines Short flows (e.g., query, coordination) Large flows (e.g., data update, backup) Low Latency High Throughput

Low latency cong control (ex: DCTCP) Keep network queues small (at high throughput) Want: implicitly prioritize mice HOL blocking if long queues Can we do better?

The opportunity Many DC apps/platforms know flow size or deadlines in advance Key/value stores Data processing Web search 4 Front end Server Aggregator … … Worker …

H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Flow scheduling on giant switch DC transport = Flow scheduling on giant switch Objective? Minimize avg FCT Minimize missed deadlines H1 H1 H2 H2 ingress & egress capacity constraints H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Example: Minimize average FCT Flow A Size Flow B Flow C 1 2 3 A B C arrive at the same time share the same bottleneck link

Example: Minimize average FCT Fair sharing: 3, 5, 6 mean: 4.67 Shortest flow first: 1, 3, 6 mean: 3.33 B B C C C Time Throughput Time Throughput 1 3 B A 6 C 1 A B C 3 B C 5 6 C

Optimal flow scheduling for avg FCT NP-hard for multi-link network [Bar-Noy et al.] Shortest Flow First: 2-approximation 1 2 3

Some transmission order How can we schedule flows based on flow criticality in a distributed way? Some transmission order

Mohammad Alizadeh et al., SIGCOMM’13 pFabric Mohammad Alizadeh et al., SIGCOMM’13

Decouple scheduling from rate control pFabric in 1 slide Packets carry a single priority # e.g., priority = remaining flow size pFabric Switches Send highest priority / drop lowest priority packets Very small buffers (20-30KB for 10Gbps fabric) pFabric Hosts Send/retransmit aggressively Minimal rate control: just prevent congestion collapse Main Idea: Decouple scheduling from rate control

Starvation prevention Use remaining flow size as priority: What happens? Transmit earliest packet of flow with the highest priority packet 4 5 6 7 8 9 10 10 9 8 7 6 5 4 5 6 7 8 10 9 8 8 7 7 6 6 5 5 4

pFabric switch Boils down to a sort Essentially unlimited priorities Thought to be difficult in hardware Existing switching only support 4- 16 priorities pFabric queues very small 51.2ns to find min/max of ~600 numbers Binary comparator tree: 10 clock cycles Current ASICs: clock ~ 1ns

pFabric rate control Minimal version of TCP algorithm Start at line-rate Initial window larger than BDP No retransmission timeout estimation Fixed RTO at small multiple of round-trip time Reduce window size upon packet drops Window increase same as TCP (slow start, congestion avoidance, …) After multiple consecutive timeouts, enter “probe mode” Probe mode sends min. size packets until first ACK What about queue buildup? Why window control?

Why does pFabric work? Key invariant: Priority scheduling At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling High priority packets traverse fabric as quickly as possible What about dropped packets? Lowest priority → not needed till all other packets depart Buffer > BDP → enough time (> RTT) to retransmit

Overall mean FCT: Web search workload

Mice FCT (<100KB): Web search workload Average 99th percentile

Elephant FCT (>10MB): Data mining workload Average Why the gap? Pkt drop rate at 80% load: 4.3%

Discussion Priority-based scheduling: pros and cons Gaming? Starvation? Implementation using a small number of priority queues? How to set priority thresholds? When is the “big switch” abstraction appropriate?

PIAS: Information-agnostic scheduling Wei Bai et al., NSDI’15

State of the art for packet/flow scheduling pFabric, PDQ, PASE Assume Prior knowledge of flow size information To approximate ideal preemptive shortest job first (SJF) With customized network elements Can we minimize FCT without flow size information with commodity components?

Multi-level feedback queue (MLFQ) Emulate shortest job first (SJF) without job size information

Realization with commodity components Don’t keep per-flow sizes and state in switch Endpoint maintains per-flow state … sets the priority as a packet tag Map packets to queues using thresholds on packet priorities What about bad thresholds? Use ECN to keep queues short Use DCTCP at endpoints to react smoothly to ECN

Discussion What happens if a flow transmits with a bursty traffic pattern? Transmit for x seconds, wait y seconds, transmit again for x What if a flow transmits just enough to always stay in the highest priority queue?

MLFQ rules (Arpaci-Dusseau)

Acknowledgment Slides heavily adapted from material by Mohammad Alizadeh, Chi-Yao Hong, and Wei Bai