Presto: Edge-based Load Balancing for Fast Datacenter Networks

Slides:

Advertisements

Similar presentations

Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.

Advertisements

Computer Networks Transport Layer. Topics F Introduction (6.1)  F Connection Issues ( ) F TCP (6.4)

PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.

Doc.: IEEE /0604r1 Submission May 2014 Slide 1 Modeling and Evaluating Variable Bit rate Video Steaming for ax Date: Authors:

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Congestion Control Created by M Bateman, A Ruddle & C Allison As part of the TCP View project.

OpenFlow-Based Server Load Balancing GoneWild

Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.

An Implementation and Experimental Study of the eXplicit Control Protocol (XCP) Yongguang Zhang and Tom Henderson INFOCOMM 2005 Presenter - Bob Kinicki.

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

The Transport Layer Chapter 6. The TCP Segment Header TCP Header.

The Transport Layer Chapter 6. Performance Issues Performance Problems in Computer Networks Network Performance Measurement System Design for Better Performance.

Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.

Transport Level Protocol Performance Evaluation for Bulk Data Transfers Matei Ripeanu The University of Chicago Abstract:

Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

OpenFlow Switch Limitations. Background: Current Applications Traffic Engineering application (performance) – Fine grained rules and short time scales.

NET-REPLAY: A NEW NETWORK PRIMITIVE Ashok Anand Aditya Akella University of Wisconsin, Madison.

Practical TDMA for Datacenter Ethernet

Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.

Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.

A Simulation of Adaptive Packet Size in TCP Congestion Control Zohreh Jabbari.

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.

Section 5: The Transport Layer. 5.2 CS Computer Networks John Mc Donald, Dept. of Computer Science, NUI Maynooth. Introduction In the previous section.

1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.

TCP with Variance Control for Multihop IEEE Wireless Networks Jiwei Chen, Mario Gerla, Yeng-zhong Lee.

Transport Layer 3-1 Chapter 3 Transport Layer Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley March

1 Evaluating NGI performance Matt Mathis

CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.

Computer Networking Lecture 18 – More TCP & Congestion Control.

CS640: Introduction to Computer Networks Aditya Akella Lecture 15 TCP – III Reliability and Implementation Issues.

Transport Layer3-1 Chapter 3 outline r 3.1 Transport-layer services r 3.2 Multiplexing and demultiplexing r 3.3 Connectionless transport: UDP r 3.4 Principles.

Low Latency Adaptive Streaming over TCP Authors Ashvin Goel Charles Krasic Jonathan Walpole Presented By Sudeep Rege Sachin Edlabadkar.

Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.

Data Center Load Balancing T Seminar Kristian Hartikainen Aalto University, Helsinki, Finland

TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.

TCP transfers over high latency/bandwidth networks & Grid DT Measurements session PFLDnet February 3- 4, 2003 CERN, Geneva, Switzerland Sylvain Ravot

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.

Presented by: Xianghan Pei

@Yuan Xue A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their.

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh Spring

Revisiting Transport Congestion Control Jian He UT Austin 1.

R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.

Youngstown State University Cisco Regional Academy

HULA: Scalable Load Balancing Using Programmable Data Planes

CIS 700-5: The Design and Implementation of Cloud Networks

Resilient Datacenter Load Balancing in the Wild

6.888 Lecture 5: Flow Scheduling

How I Learned to Stop Worrying About the Core and Love the Edge

ECE 544: Traffic engineering (supplement)

Presented by Kristen Carlson Accardi

Transport Protocols over Circuits/VCs

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

Congestion-Aware Load Balancing at the Virtual Edge

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

CS640: Introduction to Computer Networks

RDMA over Commodity Ethernet at Scale

Centralized Arbitration for Data Centers

Congestion-Aware Load Balancing at the Virtual Edge

CS434/534: Topics in Network Systems Cloud Data Centers: VL2 Control; VLB/ECMP Load Balancing Routing Yang (Richard) Yang Computer Science Department.

Lecture 17, Computer Networks (198:552)

2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Peng Wang, George Trimponias, Hong Xu,

Modeling and Evaluating Variable Bit rate Video Steaming for ax

ECE 671 – Lecture 8 Network Adapters.

2019/10/9 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Jin-Li Ye, Yu-Huang Chu, Chien Chen.

Jennifer Rexford Princeton University

Presentation transcript:

Presto: Edge-based Load Balancing for Fast Datacenter Networks Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, Aditya Akella

Background Datacenter networks support a wide variety of traffic Elephants: throughput sensitive Data Ingestion, VM Migration, Backups Mice: latency sensitive Search, Gaming, Web, RPCs

SLA is violated, revenue is impacted The Problem Network congestion: flows of both types suffer Example Elephant throughput is cut by half TCP RTT is increased by 100X per hop (Rasley, SIGCOMM’14) SLA is violated, revenue is impacted

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive ECMP No Coarse-grained Proactive Proactive: try to avoid network congestion in the first place

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive ECMP No Coarse-grained Proactive Centralized No Coarse-grained Reactive (control loop) Reactive: mitigate congestion after it already happens

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive ECMP No Coarse-grained Proactive Centralized No Coarse-grained Reactive (control loop) MPTCP No Yes Fine-grained Reactive

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive ECMP No Coarse-grained Proactive Centralized No Coarse-grained Reactive (control loop) MPTCP No Yes Fine-grained Reactive CONGA/ Juniper VCF Yes No Fine-grained Proactive

Traffic Load Balancing Schemes Hardware changes Transport changes Granularity Pro-/reactive ECMP No Coarse-grained Proactive Centralized No Coarse-grained Reactive (control loop) MPTCP No Yes Fine-grained Reactive CONGA/ Juniper VCF Yes No Fine-grained Proactive Presto No Fine-grained Proactive

Goal: near optimally load balance the network at fast speeds Presto Near perfect load balancing without changing hardware or transport Utilize the software edge (vSwitch) Leverage TCP offloading features below transport layer Work at 10 Gbps and beyond Goal: near optimally load balance the network at fast speeds

Near uniform-sized data units Presto at a High Level Spine Leaf Near uniform-sized data units NIC NIC vSwitch vSwitch TCP/IP TCP/IP

Near uniform-sized data units Presto at a High Level Spine Leaf Near uniform-sized data units NIC Proactively distributed evenly over symmetric network by vSwitch sender NIC vSwitch vSwitch TCP/IP TCP/IP

Near uniform-sized data units Presto at a High Level Spine Leaf Near uniform-sized data units NIC Proactively distributed evenly over symmetric network by vSwitch sender NIC vSwitch vSwitch TCP/IP TCP/IP

Near uniform-sized data units Presto at a High Level Spine Leaf Near uniform-sized data units NIC Proactively distributed evenly over symmetric network by vSwitch sender NIC vSwitch vSwitch TCP/IP TCP/IP Receiver masks packet reordering due to multipathing below transport layer

Outline Sender Receiver Evaluation

What Granularity to do Load-balancing on? Per-flow Elephant collisions Per-packet High computational overhead Heavy reordering including mice flows Flowlets Burst of packets separated by inactivity timer Effectiveness depends on workloads small inactivity timer large A lot of reordering Mice flows fragmented Large flowlets (hash collisions)

Presto LB Granularity Presto: load-balance on flowcells What is flowcell? A set of TCP segments with bounded byte count Bound is maximal TCP Segmentation Offload (TSO) size Maximize the benefit of TSO for high speed 64KB in implementation What’s TSO? TCP/IP Large Segment NIC Segmentation & Checksum Offload MTU-sized Ethernet Frames

Presto LB Granularity Presto: load-balance on flowcells What is flowcell? A set of TCP segments with bounded byte count Bound is maximal TCP Segmentation Offload (TSO) size Maximize the benefit of TSO for high speed 64KB in implementation Examples TCP segments 25KB 30KB 30KB Flowcell: 55KB Start

Presto LB Granularity Presto: load-balance on flowcells What is flowcell? A set of TCP segments with bounded byte count Bound is maximal TCP Segmentation Offload (TSO) size Maximize the benefit of TSO for high speed 64KB in implementation Examples TCP segments 1KB 5KB 1KB Start Flowcell: 7KB (the whole flow is 1 flowcell)

Controller installs label-switched paths Presto Sender Spine Leaf Controller installs label-switched paths NIC NIC vSwitch vSwitch TCP/IP TCP/IP Host A Host B

Controller installs label-switched paths Presto Sender Spine Leaf Controller installs label-switched paths NIC NIC vSwitch vSwitch TCP/IP TCP/IP Host A Host B

Presto Sender Spine Leaf vSwitch vSwitch TCP/IP TCP/IP Host A Host B NIC uses TSO and chunks segment #1 into MTU-sized packets Leaf flowcell #1: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC vSwitch 50KB vSwitch receives TCP segment #1 vSwitch TCP/IP TCP/IP Host A Host B

Presto Sender Spine Leaf vSwitch vSwitch TCP/IP TCP/IP Host A Host B NIC uses TSO and chunks segment #2 into MTU-sized packets Leaf flowcell #2: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC vSwitch 60KB vSwitch receives TCP segment #2 vSwitch TCP/IP TCP/IP Host A Host B

Benefits Most flows smaller than 64KB [Benson, IMC’11] the majority of mice are not exposed to reordering Most bytes from elephants [Alizadeh, SIGCOMM’10] traffic routed on uniform sizes Fine-grained and deterministic scheduling over disjoint paths near optimal load balancing

Presto Receiver Major challenges Packet reordering for large flows due to multipath Distinguish loss from reordering Fast (10G and beyond) Light-weight

Intro to GRO Generic Receive Offload (GRO) The reverse process of TSO

Intro to GRO TCP/IP OS GRO NIC Hardware

Intro to GRO TCP/IP GRO NIC MTU-sized Packets Queue head P1 P2 P3 P4

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 P2

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 P2

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 – P2

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 – P3

Intro to GRO TCP/IP GRO NIC Merge MTU-sized Packets Queue head P1 – P4

Intro to GRO TCP/IP GRO NIC Push-up P1 – P5 GRO Push-up MTU-sized Packets NIC Large TCP segments are pushed-up at the end of a batched IO event (i.e., a polling event)

Intro to GRO TCP/IP GRO NIC Push-up P1 – P5 GRO Push-up MTU-sized Packets NIC Merging pkts in GRO creates less segments & avoids using substantially more cycles at TCP/IP and above [Menon, ATC’08] If GRO is disabled, ~6Gbps with 100% CPU usage of one core

Reordering Challenges TCP/IP GRO NIC P1 P2 P3 P6 P4 P7 P5 P8 P9 Out of order packets

Reordering Challenges TCP/IP P1 GRO NIC P2 P3 P6 P4 P7 P5 P8 P9

Reordering Challenges TCP/IP P1 – P2 GRO NIC P3 P6 P4 P7 P5 P8 P9

Reordering Challenges TCP/IP P1 – P3 GRO NIC P6 P4 P7 P5 P8 P9

Reordering Challenges TCP/IP P1 – P3 P6 GRO NIC P4 P7 P5 P8 P9 GRO is designed to be fast and simple; it pushes-up the existing segment immediately when 1) there is a gap in sequence number, 2) MSS reached or 3) timeout fired

Reordering Challenges P1 – P3 TCP/IP P6 GRO NIC P4 P7 P5 P8 P9

Reordering Challenges P1 – P3 P6 TCP/IP P4 GRO NIC P7 P5 P8 P9

Reordering Challenges P1 – P3 P6 P4 TCP/IP P7 GRO NIC P5 P8 P9

Reordering Challenges P1 – P3 P6 P4 P7 TCP/IP P5 GRO NIC P8 P9

Reordering Challenges P1 – P3 P6 P4 P7 P5 TCP/IP P8 GRO NIC P9

Reordering Challenges P1 – P3 P6 P4 P7 P5 TCP/IP P8 – P9 GRO NIC

Reordering Challenges P1 – P3 P6 P4 P7 P5 P8 – P9 TCP/IP GRO NIC

Reordering Challenges GRO is effectively disabled Lots of small packets are pushed up to TCP/IP Huge CPU processing overhead Poor TCP performance due to massive reordering

Improved GRO to Mask Reordering for TCP TCP/IP GRO NIC P1 P2 P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 GRO NIC P2 P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P2 GRO NIC P3 P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P3 GRO NIC P6 P4 P7 P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P3 P6 GRO NIC P4 P7 P5 P8 P9 Idea: we merge packets in the same flowcell into one TCP segment, then we check whether the segments are in order Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P4 P6 GRO NIC P7 P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P4 P6 – P7 GRO NIC P5 P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P5 P6 – P7 GRO NIC P8 P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P5 P6 – P8 GRO NIC P9 Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP TCP/IP P1 – P5 P6 – P9 GRO NIC Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP P1 – P5 P6 – P9 TCP/IP GRO NIC Flowcell #1 Flowcell #2

Improved GRO to Mask Reordering for TCP Benefits: 1)Large TCP segments pushed up, CPU efficient 2)Mask packet reordering for TCP below transport Issue: How we can tell loss from reordering? Both create gaps in sequence numbers Loss should be pushed up immediately Reordered packets held and put in order

Loss vs Reordering Presto Sender: packets in one flowcell are sent on the same path (64KB flowcell ~ 51 us on 10G networks) Heuristic: sequence number gap within a flowcell is assumed to be loss Action: no need to wait, push-up immediately

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P2 P3

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P3 – P5

Loss vs Reordering TCP/IP GRO NIC ✗ No wait Flowcell #1 Flowcell #2 P1 P3 – P5 P6 – P9 TCP/IP No wait GRO NIC P2 ✗ Flowcell #1 Flowcell #2

Loss vs Reordering Benefits: Most of losses happen within a flowcell and are captured by this heuristic TCP can react quickly to losses Corner Case: Losses at the flowcell boundaries

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 P2 P3

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 – P5

(an estimation of the extent of reordering) Loss vs Reordering P1 – P5 TCP/IP P7 – P9 GRO NIC P6 ✗ Wait based on adaptive timeout (an estimation of the extent of reordering) Flowcell #1 Flowcell #2

Loss vs Reordering TCP/IP GRO NIC ✗ Flowcell #1 Flowcell #2 P1 – P5

Evaluation Implemented in OVS 2.1.2 & Linux Kernel 3.11.0 1500 LoC in kernel 8 IBM RackSwitch G8246 10G switches, 16 hosts Performance evaluation Compared with ECMP, MPTCP and Optimal TCP RTT, Throughput, Loss, Fairness and FCT Spine Leaf

Microbenchmark Presto’s effectiveness on handling reordering 4.6G with 100% CPU of one core CDF 9.3G with 69% CPU of one core (6% additional CPU overhead compared with the 0 packet reordering case) Segment Size (KB) Stride-like workload. Sender runs Presto. Vary receiver (unmodified GRO vs Presto GRO).

Evaluation Presto’s throughput is within 1 – 4% of Optimal, even when the network utilization is near 100%; In non-shuffle workloads, Presto improves upon ECMP by 38-72% and improves upon MPTCP by 17-28%. Throughput (Mbps) Workloads Optimal: all the hosts are attached to one single non-blocking switch

Evaluation Presto’s 99.9% TCP RTT is within 100us of Optimal 8X smaller than ECMP CDF TCP Round Trip Time (msec) [Stride Workload]

Additional Evaluation Presto scales to multiple paths Presto handles congestion gracefully Loss rate, fairness index Comparison to flowlet switching Comparison to local, per-hop load balancing Trace-driven evaluation Impact of north-south traffic Impact of link failures

Conclusion Presto: moving network function, Load Balancing, out of datacenter network hardware into software edge No changes to hardware or transport In conclusion, Presto moves the network function, load balancing, out of datacenter network hardware into the software defined edge. The results are promising. We believe that other network functions can also be implemented at the software edge. Presto requires no change to… Performance is close to a giant switch

Thanks!