HULA: Scalable Load Balancing Using Programmable Data Planes

Slides:

Advertisements

Similar presentations

Jennifer Rexford Princeton University MW 11:00am-12:20pm Logically-Centralized Control COS 597E: Software Defined Networking.

Advertisements

Programming Protocol-Independent Packet Processors

Jaringan Komputer Lanjut Packet Switching Network.

Fine-Grained Latency and Loss Measurements in the Presence of Reordering Myungjin Lee, Sharon Goldberg, Ramana Rao Kompella, George Varghese.

Scalable Flow-Based Networking with DIFANE 1 Minlan Yu Princeton University Joint work with Mike Freedman, Jennifer Rexford and Jia Wang.

Dissemination protocols for large sensor networks Fan Ye, Haiyun Luo, Songwu Lu and Lixia Zhang Department of Computer Science UCLA Chien Kang Wu.

A Scalable, Commodity Data Center Network Architecture.

OpenFlow Switch Limitations. Background: Current Applications Traffic Engineering application (performance) – Fine grained rules and short time scales.

1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.

Introduction to Dynamic Routing Protocol

DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang.

VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.

Routing Networks and Protocols Prepared by: TGK First Prepared on: Last Modified on: Quality checked by: Copyright 2009 Asia Pacific Institute of Information.

Data Center Load Balancing T Seminar Kristian Hartikainen Aalto University, Helsinki, Finland

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Coping with Link Failures in Centralized Control Plane Architecture Maulik Desai, Thyagarajan Nandagopal.

PISCES: A Programmable, Protocol-Independent Software Switch

VL2: A Scalable and Flexible Data Center Network

Data Center Architectures

P4: Programming Protocol-Independent Packet Processors

COS 561: Advanced Computer Networks

Routing Metrics for Wireless Mesh Networks

Building Efficient and Reliable Software-Defined Networks

Lec4: Introduction to Dynamic Routing Protocol

Introduction to Dynamic Routing Protocol

Multi Node Label Routing – A layer 2.5 routing protocol

Resilient Datacenter Load Balancing in the Wild

How I Learned to Stop Worrying About the Core and Love the Edge

Jennifer Rexford Princeton University

Routing Metrics for Wireless Mesh Networks

P4 (Programming Protocol-independent Packet Processors)

Advanced Computer Networks

How I Learned to Stop Worrying About the Core and Love the Edge

ECE 544: Traffic engineering (supplement)

NOX: Towards an Operating System for Networks

Fast and Efficient Flow and Loss Measurement for Data Center Networks

Srinivas Narayana MIT CSAIL October 7, 2016

Introduction to Dynamic Routing Protocol

Congestion-Aware Load Balancing at the Virtual Edge

ECE 544 Protocol Design Project 2016

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

CS 31006: Computer Networks – The Routers

Sonata Query-driven Streaming Network Telemetry

Introduction to Dynamic Routing Protocol

DC.p4: Programming the forwarding plane of a datacenter switch

Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,

Routing Metrics for Wireless Mesh Networks

VL2: A Scalable and Flexible Data Center Network

Implementing an OpenFlow Switch on the NetFPGA platform

Network Control Jennifer Rexford

Sonata: Query-Driven Streaming Network Telemetry

Elmo: Source-Routed Multicast for Public Clouds

1 Multi-Protocol Label Switching (MPLS). 2 MPLS Overview A forwarding scheme designed to speed up IP packet forwarding (RFC 3031) Idea: use a fixed length.

Data Center Architectures

Centralized Arbitration for Data Centers

Programmable Switches

Congestion-Aware Load Balancing at the Virtual Edge

2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Peng Wang, George Trimponias, Hong Xu,

Fast Network Congestion Detection And Avoidance Using P4

Toward Self-Driving Networks

Toward Self-Driving Networks

2019/10/9 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Jin-Li Ye, Yu-Huang Chu, Chien Chen.

Jennifer Rexford Princeton University

Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,

Towards Predictable Datacenter Networks

Presentation transcript:

HULA: Scalable Load Balancing Using Programmable Data Planes Naga Katta1 Mukesh Hira2, Changhoon Kim3, Anirudh Sivaraman4, Jennifer Rexford1 1.Princeton 2.VMware 3.Barefoot Networks 4.MIT

Datacenter Load Balancing Multiple network paths High bisection bandwidth Volatile traffic Multiple tenants

Multiple network paths A Good Load Balancer Multiple network paths Track path performance Choose best path High bisection bandwidth Volatile traffic Multiple tenants

Multiple network paths A Good Load Balancer Multiple network paths Track path performance Choose best path High bisection bandwidth Fine grained load balancing Volatile traffic Multiple tenants

Multiple network paths A Good Load Balancer Multiple network paths Track path performance Choose best path High bisection bandwidth Fine grained load balancing Volatile traffic In-dataplane Multiple tenants

Multiple network paths A Good Load Balancer Multiple network paths Track path performance Choose best path High bisection bandwidth Fine grained load balancing Volatile traffic In-dataplane Multiple tenants In-network

CONGA (SIGCOMM’14)

Datapath LB: Challenges Tracks all paths

Datapath LB: Challenges Large FIBs Tracks all paths

Datapath LB: Challenges Large FIBs Tracks all paths Custom ASIC

Programmable Commodity Switches Vendor agnostic Uniform programming interface (P4) Today’s trend -> cheaper Reconfigurable in the field Adapt or add dataplane functionality Examples RMT, Intel Flexpipe, Cavium Xpliant, etc.

Programmable Switches - Capabilities More than OF 1.X P4 Program Compile Memory Memory Memory Memory M A m1 a1 M A m1 a1 M A m1 a1 M A m1 a1 Ingress Parser Queue Buffer Egress Deparser

Programmable Switches - Capabilities More than OF 1.X P4 Program Programmable Parsing Compile Memory Memory Memory Memory M A m1 a1 M A m1 a1 M A m1 a1 M A m1 a1 Ingress Parser Queue Buffer Egress Deparser

Programmable Switches - Capabilities More than OF 1.X P4 Program Programmable Parsing Compile Memory Memory Memory Memory M A m1 a1 M A m1 a1 M A m1 a1 M A m1 a1 Ingress Parser Queue Buffer Egress Deparser Match-Action Pipeline

Programmable Switches - Capabilities More than OF 1.X P4 Program Programmable Parsing Register Array Compile Memory Memory Memory Memory M A m1 a1 M A m1 a1 M A m1 a1 M A m1 a1 Ingress Parser Queue Buffer Egress Deparser Match-Action Pipeline

Programmable Switches - Capabilities More than OF 1.X P4 Program Programmable Parsing Stateful Memory Compile Memory Memory Memory Memory M A m1 a1 M A m1 a1 M A m1 a1 M A m1 a1 Ingress Parser Queue Buffer Egress Deparser Switch Metadata Match-Action Pipeline

Hop-by-hop Utilization-aware Load-balancing Architecture Distance-vector like propagation Periodic probes carry path utilization Each switch chooses best downstream path Maintains only best next hop Scales to large topologies Programmable at line rate Written in P4.

HULA: Scalable and Programmable Objective P4 feature Probe propagation Programmable parsing Monitor path performance Link state metadata Choose best path, route flowlets Stateful memory and comparison operators

Probes carry path utilization Spines Probe replicates Aggregate Let’s dig a little deeper. In HULA, the probes originate at the ToRs (through the servers or the ToRs themselves) and then are replicated on multiple paths as they travel the network. We make sure that each switch replicates only the necessary set of probes to the others so that the probe overhead is minimal. Probe originates ToR

Probes carry path utilization P4 primitives New header format Programmable Parsing RW packet metadata Spines Probe replicates Aggregate Let’s dig a little deeper. In HULA, the probes originate at the ToRs (through the servers or the ToRs themselves) and then are replicated on multiple paths as they travel the network. We make sure that each switch replicates only the necessary set of probes to the others so that the probe overhead is minimal. Probe originates ToR

Probes carry path utilization ToR ID = 10 Max_util = 80% S2 ToR ID = 10 Max_util = 60% ToR 10 S3 ToR 1 S1 Probe S4 ToR ID = 10 Max_util = 50%

Each switch identifies best downstream path ToR ID = 10 Max_util = 50% S2 ToR 10 S3 ToR 1 S1 Probe Then the switch takes the minimum from among the probe given utilization and stores it in the local routing table. The switch S1 then sends its view of the best path to the upstream switches. And the upstream switches process these probes and update their neighbours. This leads to a distance vector-style propagation of information about network utilization to all the switches. But at the same time, each switch only needs to keep track of the best next hop towards a destination. Dst Best hop Path util ToR 10 S4 50% ToR 1 S2 10% … S4 Best hop table

Switches load balance flowlets Flowlet table Dest Timestamp Next hop ToR 10 1 S4 … S2 ToR 10 Data S3 ToR 1 S1 The switches then route data packets in the opposite direction by spitting flows into smaller groups of packets called flowlets. Flowlets basically are subsequences of packets in a flow separated by a large enough inter-packet gap so that when different flowlets are sent on different paths, the destination does not see packet reordering with high probability. Dest Best hop Path util ToR 10 S4 50% ToR 1 S2 10% … S4 Best hop table

Switches load balance flowlets P4 primitives RW access to stateful memory Comparison/arithmetic operators Flowlet table Dest Timestamp Next hop ToR 10 1 S4 … S2 ToR 10 Data S3 ToR 1 S1 The switches then route data packets in the opposite direction by spitting flows into smaller groups of packets called flowlets. Flowlets basically are subsequences of packets in a flow separated by a large enough inter-packet gap so that when different flowlets are sent on different paths, the destination does not see packet reordering with high probability. Dest Best hop Path util ToR 10 S4 50% ToR 1 S2 10% … S4 Best hop table

Switches load balance flowlets P4 code snippet if(curr_time - flowlet_time[flow_hash] > THRESH) { flowlet_hop[flow_hash] =best_hop[packet.dst_tor]; } metadata.nxt_hop = flowlet_hop[flow_hash]; flowlet_time[flow_hash] = curr_time; Flowlet table Dest Timestamp Next hop ToR 10 1 S4 … S2 ToR 10 Data S3 ToR 1 S1 The switches then route data packets in the opposite direction by spitting flows into smaller groups of packets called flowlets. Flowlets basically are subsequences of packets in a flow separated by a large enough inter-packet gap so that when different flowlets are sent on different paths, the destination does not see packet reordering with high probability. Dest Best hop Path util ToR 10 S4 50% ToR 1 S2 10% … S4 Best hop table

Evaluated Topology S1 S2 Link Failure 40Gbps A1 A2 A3 A4 40Gbps L1 L2 8 servers per leaf

NS2 packet-level simulator RPC-based workload generator Evaluation Setup NS2 packet-level simulator RPC-based workload generator Empirical flow size distributions Websearch and Datamining End-to-end metric Average Flow Completion Time (FCT)

Compared with ECMP CONGA’ Flow level hashing at each switch CONGA within each leaf-spine pod ECMP on flowlets for traffic across pods1 1. Based on communication with the authors

HULA handles high load much better

HULA keeps queue occupancy low

HULA is stable on link failure

Scalable to large topologies Adaptive to network congestion HULA - Summary Scalable to large topologies HULA distributes congestion state Adaptive to network congestion Proactive path probing Reliable when failures occur Programmable in P4! This leads to a design that is topology oblivious (it makes no assumptions about symmetry or number of hops in the network) The distribution of only the relevant utilization state to network switches mitigates concentration of a huge amount of state at the sender. Since each switch makes the routing decision based on the INT info, there is no separate source routing mechanism necessary. And most interesting thing is that all this logic is programmable in P4. I can discuss the details offline if you are interested.

Backup Ok, let’s see why correctness is tricky compared to say traditional IP rule caching.

HULA: Scalable, Adaptable, Programmable LB Scheme Congestion aware Application agnostic Dataplane timescale Scalable Programmable dataplanes ECMP SWAN, B4 MPTCP CONGA HULA