Enabling Programmable Infrastructure for Multi-Tenant Data Centers

Enabling Programmable Infrastructure for Multi-Tenant Data Centers
Muhammad Shahbaz (FPO Talk) Adviser: Nick Feamster Readers: Jen Rexford and Ben Pfaff (VMware) Examiners: Mike Freedman and Wyatt Lloyd Hi, my name is Shahbaz. Today, I am going to talk about my work on enabling programmable infrastructure for multi-tenant data centers. (next slide)

Multi-Tenant Data Centers
Host hundreds of thousands of Tenants Workloads Data center over the years have evolved both in scale and complexity. These mega-scale data centers (click) host hundreds of thousands of tenants, each of which runs (click) workloads (click) supporting a number of different services. (next slide)

Infrastructure Services Load Balancing Advanced Telemetry e.g., L4 and utilization-aware LB e.g., In-band Network Telemetry (INT) Enhanced Routing e.g., source-controlled routing Physical-to-Virtual Gateway To operate at such scale and complexity, data-center operators need to constantly introduce (click) new infrastructure services in their network such as (click) load balancing, advanced telemetry, enhanced routing and switching, physical-to-virtual gateways, security and more. (next slide) e.g., GRE, VXLAN, and STT Security e.g., Firewall

Infrastructure Services Load Balancing Advanced Telemetry e.g., L4 and utilization-aware LB e.g., In-band Network Telemetry (INT) Enhanced Routing e.g., source-controlled routing Programmable Control Physical-to-Virtual Gateway In order to do so, they require (click) programmable (flexible) control over their infrastructure. (next slide) e.g., GRE, VXLAN, and STT Security e.g., Firewall

(next slide)

Fixed-function switches However, these data-center networks, today, are made up of (click) fixed-function switches (both hardware and software). It’s difficult to make changes to these switches. (next slide) vSwitch vSwitch vSwitch vSwitch

Fixed-function switches 1. Require a forklift upgrade 2. Wait on vendors for new switches For example, in case of hardware switches, (click) adding a new change requires a forklift upgrade, and that too after waiting on switch vendors for multiple years to make this change available in their new switches. For example, it took VXLAN four years before it became available in mainstream switches and is the most dominant feature used in data centers, today. (next slide) vSwitch vSwitch vSwitch vSwitch

Fixed-function switches Similarly, making changes to software switches, running inside the servers, is also not an easy task. One can assume that since these switches are defined in software, therefore, (next slide) vSwitch vSwitch vSwitch vSwitch

Fixed-function switches it should be *very* easy to change their behavior. (next slide) vSwitch vSwitch vSwitch vSwitch Easy?

Fixed-function switches Well, that’s not the case. Most of the logic that enables fast packet forwarding in these software switches resides in the kernel. Writing kernel code requires domain expertise that most network operators lack, introducing significant barriers for developing and deploying new features. (click) Thus, as in the case of hardware switches, network operators would have to wait for the primary switch developers (e.g., OVS team) to add their requested change to the codebase. After that it can take up to a year before the change actually becomes available in the mainstream code. (next slide) vSwitch vSwitch vSwitch vSwitch Easy? Wait on switch developers for new changes

Programmable Control This fixed-function nature of these switches, (next slide)

Programmable Control limits the control that data-center operators can have over their network. Thus, making it hard for operators to quickly improve their networks and innovate with changing workload requirements. (next slide)

Fixed-function switches Upcoming data-plane solutions have started replacing these existing fixed-function hardware switches (next slide) vSwitch vSwitch vSwitch vSwitch

Programmable switches with programmable switches. These switches allow customizing their behavior (next slide) vSwitch vSwitch vSwitch vSwitch

Programmable switches Language using a data-plane language like P4. (next slide) vSwitch vSwitch vSwitch vSwitch

Programmable switches Language Routing: switch.p4 rotuer.p4 source-routing.p4 Load Balancing: L4L7_load_balancer.p4 conga.p4 hula.p4 Monitoring: int.p4 marple.p4 sonata.p4 Advanced Apps: ddos_detection.p4 Proprietary Apps … This allows network operators to add customized features as new P4 programs. (next slide) vSwitch vSwitch vSwitch vSwitch

Programmable switches Language However, software switches are still fixed function. In this thesis, my first contribution is (next slide) Still fixed-function switches vSwitch vSwitch vSwitch vSwitch

My Thesis Contributions
Language a programmable software switch, called PISCES, that allows customizations using P4 without requiring direct modifications to switch source code. (next slide) PISCES PISCES PISCES PISCES a. A Programmable Software Switch: PISCES

Programmable Control With this network operators can now have (next slide)

Programmable Control complete (programmable) control over all switches in their network. (next slide) a. Software Switch: PISCES

Language And can use a single language to express customizations for both software and hardware switches. (next slide) PISCES PISCES PISCES PISCES

However, switch-level programmability alone is not sufficient to help build infrastructure services. These services run in a distributed manner using switches spanning across the entire data-center network. If implemented naively, these services can lead to poor performance and can exhaust the already limited resources of a data point out these resources e.g., flow table sizes, network bandwidth, etc. Fortunately, the design of modern multi-tenant data centers presents unique, domain-specific characteristics that network operators can exploit to implement efficient infrastructure point out some characteristics e.g., symmetric topologies, and short paths, etc. (next slide) PISCES PISCES PISCES PISCES

Characteristics of Data-Center Topologies
Core - Symmetric Spine Leaf Data-center topologies tend to be symmetric, having a tiered structure consisting of core, spine, and leaf layers. (next slides) Hypervisor PISCES PISCES PISCES PISCES Processes: VMs, containers, etc.

Core - Symmetric - Short Paths Spine Leaf They have limited number of switches on any individual path. (next slide) Hypervisor PISCES PISCES PISCES PISCES Processes: VMs, containers, etc.

Core - Symmetric - Short Paths - Co-located Placement Spine Leaf Finally, to minimize internal bandwidth use, tenant virtual machines (VMs) tend to cluster in the same part of the data-center topology. In my second contribution, we show how programmable switches help build scalable services by exploiting these unique characteristics of multi-tenant data centers. (next slide) Hypervisor PISCES PISCES PISCES PISCES Processes: VMs, containers, etc.

b. A Scalable Infrastructure Service for Multicast: Elmo To enable in-network multicast in multi-tenant data centers: Using both programmable software and hardware switches And exploiting unique characteristics of data-center networks. For that, we build an example scalable infrastructure service, called Elmo, (click) that enables in-network multicast in data centers using both programmable and hardware switches. (next slide) PISCES PISCES PISCES PISCES

b. Infrastructure Service: Elmo Programmable Control With Elmo, we demonstrate that along with having programmable control, it’s equally important to have efficient mechanisms for implementing scalable infrastructure services. (next slide) a. Software Switch: PISCES

a. PISCES: A Programmable, Protocol-Independent Software Switch
SIGCOMM’16 a. PISCES: A Programmable, Protocol-Independent Software Switch Muhammad Shahbaz1 Sean Choi2, Ben Pfaff3, Changhoon Kim4, Nick Feamster1, Nick McKeown2, and Jen Rexford1 Let me first discuss my first contribution PISCES … (next slide) 1. Princeton Stanford VMware Barefoot Networks

Software Switches Language vSwitch vSwitch vSwitch vSwitch
(next slide) vSwitch vSwitch vSwitch vSwitch

Software Switches Fast Packet Forwarding OVS vSwitch vSwitch
Software switches, like Open vSwitch, (click) are based on a large body of complex codebase needed to set up the machinery for fast packet forwarding (next slide) vSwitch

Software Switches DPDK Kernel NetDev … OVS vSwitch
like Kernel, DPDK, and more. (next slide)

Software Switches Packet Processing Logic OVS DPDK Kernel NetDev …
vSwitch Packet Processing Logic OVS DPDK Kernel NetDev … And at the same time, a programmer has to specify the logic for packet processing (next slide)

Match-Action Pipeline
Software Switches vSwitch Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … like parser and match-action pipeline, (click) using methods and interfaces exposed by the underlying machinery. (next slide)

Software Switches Requires domain expertise in: Network protocol design Software development vSwitch Develop Test Deploy Parser Match-Action Pipeline … large, complex codebases. OVS Complex APIs DPDK Kernel NetDev … Thus, making changes in these switches is a formidable undertaking requiring expertise in (click) network protocol design and (click) software development with the ability to develop, test, and deploy a large, complex codebase. (click) It can take up to 3 to 6 months to push a new feature into the mainstream code. And then it can take years to push the code into mainline Linux distributions like Ubuntu and Fedora. (click) Furthermore, customization requires not only incorporating changes into switch code, but also maintaining these customizations across different versions of the switch. (next slide) Can take 3-6 months to get a new feature in. Maintaining changes across releases

Software Switches vSwitch Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … As protocol designers we are interested in specifying how to parse packet headers and the structure of the match-action tables (i.e., which header fields to match and which actions to perform on matching headers). (next slide)

Software Switches vSwitch Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … We do not need to understand the complexities of the underlying codebases and the complex APIs that they expose to enable fast packet forwarding. (next slide)

Software Switches vSwitch Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … So what should we do about this? How should we enable protocol designers to only specify the packet processing logic without worrying about the underlying machinery. To address this issue, (next slide)

PISCES: A Programmable Software Switch
Parser Match-Action Pipeline PISCES Parser Parser Match-Action Pipeline Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … we present PISCES, a programmable switch that (click) separates the packet processing logic from the underlying forwarding machinery and (next slide)

Parser Match-Action Pipeline DSL PISCES Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … let protocol designers specify it using a high-level domain-specific language (DSL) without requiring direct modifications to the underlying switch source code. (next slide)

Parser Match-Action Pipeline P4 is an open-source language.[1] Describes different aspects of a packet processor: Packet headers and fields Metadata Parser Actions Match-Action Tables (MATs) Control Flow PISCES Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … For this work, we choose P4: a data-plane language for programming protocol-independent packet processors. (click) It’s an open-source language that is being used by existing hardware switch vendors for specifying their behavior. Thus, to provide a unified abstraction for all switches in the network, we choose P4 for our switch as well. Also, it’s a domain-specific language that is easier for network operator to work with allowing them to describe key aspects of a packet processor e.g., packet headers and fields, parser, tables, actions, and control flow of the match-action pipeline. (next slide) [1]

Parser Match-Action Pipeline PISCES Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … And, we choose OVS as our switch target; a widely used hypervisor switch in today’s multi-tenant data centers. We modify OVS so that it acts as a generic engine whose behavior is described via the DSL. (next slide)

Parser Match-Action Pipeline PISCES Compiler Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … A P4-to-OVS compiler, compiles the P4 code to OVS. It generates (click) the parser and match-action pipeline code needed to (click) build the corresponding switch executable. (next slide) Executable

Research Goals Quantify reduction in complexity, i.e., does expressing customizations in P4 easier than direct modifications to the OVS source code? Performance optimizations, i.e., is there any overhead in compiling P4 programs to OVS? If so, can we mitigate these overheads via compiler optimizations. In this work, (click) our first research goal is to quantify the reduction in complexity, i.e., whether expressing customizations in P4 is easier than direct modifications to the OVS source code. (click) Our second goal is to find out and build optimizations to reduce the additional overhead on performance when compiling a P4 program to OVS. (next slide)

Quantifying Reduction in Complexity
We evaluate two categories of complexity: Development complexity of developing baseline packet processing logic for a software switch Change complexity of making changes and maintaining an existing software switch We evaluate two categories of complexity. (click) The first one is the development complexity, i.e., the amount of effort needed in building baseline packet processing logic of a software switch (or in other words building a switch from scratch). (click) The second one is the change complexity i.e., the effort needed in making and maintaining a change in an existing switch. (next slide)

Development complexity 40x 20x We measure the development complexity by comparing the native OVS with the equivalent baseline functionality implemented in PISCES. We use three metrics: (click) lines of code, (click) method count, and (click) average method size. (Note that these measurements only consider the set of code that is responsible for match, parse and action.) We see that PISCES reduces the lines of code by (click) a factor of 40 and the average method size by (click) a factor of 20. (next slide) (show a comparison of P4 and C/C++, show self contained in there)

Change complexity For the change complexity, (click) we compare the effort required in adding support for a new header field in a protocol that is otherwise already supported in OVS and in PISCES. We add support for three fields (…) and measure how many lines and files need to be changed in adding those fields. We see that modifying just a few lines of code in a single P4 file in PISCES is sufficient to support a new field, whereas in OVS, the corresponding change often requires hundreds of lines of changes over tens of files. (next slide)

So, does PISCES reduce complexity? The resulting code simplicity makes it easier to: Implement Deploy Maintain custom software switches Yes! So, does PISCES reduce complexity. [click] Yes, it does. [click] The resulting code simplicity should make it easier for protocol designers to implement, deploy, and maintain custom software switches.

Performance Optimizations
Our second research goal is performance optimizations.. (next slide)

Parser Match-Action Pipeline PISCES Compiler Performance overhead? Parser Match-Action Pipeline OVS Complex APIs DPDK Kernel NetDev … As we are compiling a P4 program to our modified OVS switch, we want to see if doing so incurs any (click) overhead on performance. (next slide) Executable

Naïve Compilation from P4 to OVS (L2L3-ACL)
Performance overhead of 40% We measure this overhead by comparing the performance of the native OVS---a hand-written software switch---to PISCES. We see that a naïve compilation of our benchmark application (which is essentially a router with an access-control list) shows (click) that PISCES has a performance overhead of about 40% compared to the native OVS. To understand the causes behind this overhead, let’s us first look at the forwarding model that both P4 and OVS support. (next slide)

(Post-Pipeline Editing)
P4 Forwarding Model (Post-Pipeline Editing) Ingress Packet Parser Egress Checksum Verify Packet Deparser Checksum Update Match-Action Tables In P4 packet forwarding model, (click) a packet parser (click) identifies the headers and (click) extracts them as packet header fields, essentially making a copy of the content of the packets. (click) The checksum verify block then (click) verifies the checksum based on the header fields specified in the P4 program. (click) The match-action tables ((click) operate on these header fields. (click) A checksum update block (click) updates the checksum. (click) Finally, a packet deparser (click) writes the changes from these header fields back on to the packet before (click) sending it out of the egress port. (click) We name this mode of operating on header fields as “post-pipeline editing.” (next slide) Header Fields

OVS Forwarding Model Slow-Path Fast-Path Ingress Packet Parser
Whereas, in OVS packet forwarding model, the processing is divided into a slow- and fast-path. (click) A packet parser, in the fast path, only (click) identifies the headers. (next slide) Ingress Packet Parser

OVS Forwarding Model Egress Match-Action Tables Slow-Path Fast-Path
Flow Rule Miss The packet is then looked up in a match-action cache. If there is a (click) miss, the packet is sent to the match-action tables (that form the actual switch pipeline). (click) A new flow rule is calculated and installed in the match-action cache. (click) And the original packet, as processed by the match-action pipeline, is sent to the egress. (next slide) Match-Action Cache Ingress Packet Parser

OVS Forwarding Model Egress Match-Action Tables Slow-Path Fast-Path
Next time, when another (click) packet belonging to the same flow enters the switch. (click) The parser identifies the headers as before. (next slide) Match-Action Cache Ingress Packet Parser

OVS Forwarding Model Match-Action Tables Egress Slow-Path Fast-Path
Hit This time the cache will result in a hit, and (click) the packet is processed and sent to the egress without traversing the match-action pipeline. (next slide) Match-Action Cache Ingress Packet Parser Egress

(Inline Editing) OVS Forwarding Model Match-Action Tables Egress
Slow-Path Fast-Path In OVS, tables directly operate on the headers inside the packet (i.e., no copy is maintained). (click) We name this mode of operating on packet header fields as “inline editing.” (next slide) Match-Action Cache Ingress Packet Parser Egress

PISCES Forwarding Model (Modified OVS)
Supports both editing modes: Inline Editing Post-pipeline Editing Match-Action Tables Slow-Path Fast-Path In order to map P4 to OVS, we modified OVS to provide support for (click) both post-pipeline and inline editing modes. We call this modified OVS model, a PISCES forwarding model. (next slide) Match-Action Cache Ingress Packet Parser Packet Deparser Egress Checksum Verify Checksum Update

performance overhead is
Compiling P4 to OVS Ingress Packet Parser Packet Deparser Egress Checksum Verify Checksum Update Match-Action Tables P4 Naïve compilation performance overhead is 40% Match-Action Tables So, now the problem is how to efficiently compile the P4 forwarding model (click) to this modified OVS forwarding model. (click) As mentioned earlier, the naïve compilation has a performance overhead of 40%. (next slide) OVS Match-Action Cache Ingress Packet Parser Packet Deparser Egress Checksum Verify Checksum Update

Causes of Performance Overhead
Match-Action Tables Cache Misses Match-Action Cache Ingress Packet Parser Packet Deparser Egress Checksum Verify Checksum Update We observe that there are two main aspects that significantly affect the performance of PISCES. (click) The first one is the number of CPU cycles consumed in processing a single packet. And the second one is the number of cache misses. CPU Cycles per Packet

Factors Affecting CPU Cycles
To understand the causes of CPU cycles per packet, we looked at the cycles consumed by each component of the forwarding model (in the fast path) e.g., parser and match-action cache. … (next slide)

Factors Affecting CPU Cycles
Extra copy of headers Fully-specified checksums Parsing unused header fields and more … We studied different factors that affected the CPU cycles per-packet (click) like … (next slide)

Per-Packet Cost (CPU Cycles) Inline vs. post-pipeline editing ✓ Incremental checksum Parser specialization Action specialization Action coalescing To mitigate these factors, we implement a number of optimizations to reduce the CPU cycles consumed per packet.

Per-Packet Cost (CPU Cycles) Cache Misses Inline vs. post-pipeline editing ✓ Incremental checksum Parser specialization Action specialization Action coalescing Stage assignment Cached field modifications We implement two optimizations called … to reduce the cache miss rate due to the previous two factors. (next slide)

Performance Optimizations for L2L3-ACL
Performance overhead of < 2% With all optimizations together, we are able to significantly reduce the cycle consumption of the parser and actions for our L2L3-ACL application, and there is a two-fold increase in the throughput compared to the un-optimize PISCES implementation. (click) … (click) The performance overhead is now less than 2% between the optimized version of PISCES and the native OVS. (next slide)

Programs written in PISCES, using P4, are 40 times more concise than native software code. With hardly any performance overhead! PISCES So, in summary … (click) PISCES is the first software switch that allows specifying custom packet-processing logic in a high-level DSL, called P4 PISCES programs are about 40 times more concise than the equivalent programs in native code (click) with hardly any performance cost! (next slide) @mshahbaz: point out that PISCES is being moved into the mainline OVS. OVS

b. Elmo: Source-Routed Multicast for Cloud Services
Resubmit to EuroSys’19 b. Elmo: Source-Routed Multicast for Cloud Services Muhammad Shahbaz1 Lalith Suresh2, Jen Rexford1, Nick Feamster1, Ori Rottenstreich3, and Mukesh Hira2 Let’s now look at the second contribution of my thesis, Elmo: Source-Routed Multicast for Cloud Services. @Shahbaz: connect with PISCES i.e., we show an infrastructure service that scales by exploiting programmable data planes and unique characteristics of data centers. 1. Princeton VMware Technion

Cloud Workloads Exhibit 1-to-Many Comm. Patterns
Cloud workloads are commonly driven by systems that deliver large amounts (next slide) vSwitch vSwitch vSwitch vSwitch

of data to groups of endpoints. (next slide)

Cloud providers support hundreds of thousands of tenants, each of which may run tens to hundreds of such workloads. (next slide) Tenants: … and more

Distributed Programming Frameworks (e.g., Hadoop and Spark) Publish-Subscribe Systems (e.g., ZeroMQ and RabbitMQ) Replication (e.g., for Databases and state machines) Infrastructure Applications (e.g., VMware NSX and OpenStack) Streaming Telemetry (e.g., Ganglia Monitoring System) Common workloads include: (click) streaming telemetry where hosts continuously send telemetry data in incremental updates to a set of collectors, (click) replication for databases and state machines (e.g., in PAXOS or other consensus algorithms), (click) distributed programming frameworks (like Spark, and Hadoop), (click) publish-subscribe systems (like ZeroMQ and RabbitMQ) for publishing messages to multiple receivers, (click) infrastructure applications (like VMware NSX) running on top of a provider for replicating broadcast, unknown unicast, and multicast traffic. (click) and more (next slide) and more …

Multicast These workloads naturally suggest the use of (click) native multicast, yet none of the data centers enable (next slide) vSwitch vSwitch vSwitch vSwitch

Multicast it today. (next slide) vSwitch vSwitch vSwitch vSwitch

Approaches to Multicast
Good Bad Control Overhead Traffic Overhead No. of Groups End-host Overhead Tenant Isolation Bisection Bandwidth Run at Line Rate IP Multicast SDN-based Multicast with rule aggregation App. Layer Multicast Existing approaches to multicast are limited in one or more ways, making them infeasible for cloud deployments. For example, (click) IP Multicast faces lots of control-plane challenges with regard to stability under membership churn, is limited in the number of groups it can support, and can’t utilize the full cross-sectional bandwidth in data centers. (click) SDN-based Multicast alleviates many of the control-plane issues but is still limited in the number of groups it can support. (click) Increasing the number of groups using flow aggregation, introduces other overheads. (click) Application-layer solutions are the primary way of supporting multicast in multi-tenant data centers, today. However, these solutions introduce a lot of redundant traffic overhead, impose significant CPU load, and cannot operate at line rate with predictable latencies. Due to these limitations certain classes of workloads (e.g., workloads introduced by financial apps) cannot use today’s cloud-based infrastructure at all. (click) Existing source-routed solutions (like Bloom Filters and BIER) encode forwarding state inside packets. But, these solutions require unorthodox processing at the switches (like loops) and are not able to process multicast packets at line rate. (next slide) Source-Routed Multicast (e.g., Bloom Filters and BIER)

Approaches to Multicast
Good Bad Control Overhead Traffic Overhead No. of Groups End-host Overhead Tenant Isolation Bisection Bandwidth Run at Line Rate IP Multicast SDN-based Multicast with rule aggregation App. Layer Multicast Elmo, on the other hand, is a source-routed multicast solution designed to operate at line rate and introduces negligible traffic overhead while utilizing the entire cross-sectional bandwidth of the data-center network. (next slide) @Shahbaz: point out that for a solution to be deployable in cloud data centers it needs to address all of the issues listed as columns, above. Source-Routed Multicast (e.g., Bloom Filters and BIER) Elmo

Elmo: Source-Routed Multicast for Cloud Services
(next slide)

In Elmo, a (click) software switch encodes (click) the multicast forwarding policy inside packets, (click x2) and network switches read this policy to forward packets to the receivers. (next slide) vSwitch vSwitch vSwitch vSwitch

Key challenges: How to efficiently encode multicast forwarding policy inside packets? How to process this encoding at line rate? Thus, there are two key challenges that we need to address … (next slide)

Core - Symmetric - Short Paths - Co-located Placement Spine Leaf Elmo (click) exploits unique characteristics of data-center topologies and tenants’ VM placement to (click) efficiently encode multicast forwarding policy inside packets, and (click) modern programmable data planes to (click) process it at line rate. (next slide) Hypervisor PISCES PISCES PISCES PISCES Processes: VMs, containers, etc.

Encoding Multicast Tree
Key design decisions: Encoding switch output ports in a bitmap Encoding on the logical topology Sharing bitmap across switches Dealing with limited header space using default p-rules Reducing traffic overhead using s-rules There are (click) five key design decisions that enable Elmo to generate efficient encoding of multicast trees which is both compact and simple for switches to process at line rate. (next slide)

(next slide) PISCES PISCES PISCES PISCES

1. Encoding Switch Output Ports in a Bitmap
1011 1011 1011 1011 Downstream S0 S1 S2 S3 S4 S5 S6 S7 10 10 01 01 11 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 @mshahbaz: point out that we only generate the encoding for network switches (and not the software switches). We use (click) bitmaps for each network switch to specify which ports should forward the packet. Using this is desirable for two reasons: (click) (1) a bitmap is the internal data structure that network switches use to direct a packet to multiple output ports and (2) switches typically either have many output ports in the set or are not even part of the multicast tree. (next slide) 1. Bitmap is an internal data structure in switches for replicating packets 2. Switches typically have many ports participating in a multicast tree or none at all PISCES PISCES PISCES PISCES

1. Encoding Switch Output Ports in a Bitmap
1011 1011 1011 1011 Upstream Downstream S0 S1 S2 S3 S4 S5 S6 S7 Bitmap + Flag Bit 10 10 01 01 11 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 Second, for switches in the upstream path of the packet (e.g., leaf and spine), we maintain a flag bit (along with the bitmap) to indicate the switch to forward the packet upstream using the configured multipath scheme (e.g., ECMP or CONGA). (next slide) PISCES PISCES PISCES PISCES

2. Encoding on the Logical Topology
Core C0 C1 C2 C3 1011 1011 1011 1011 Spine S0 S1 S2 S3 S4 S5 S6 S7 10 10 01 01 11 11 Leaf L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 Next, we exploit the symmetry in data-center topologies. Core switches and spine switches, in each pod, share the same bitmap, hence, (next slide) PISCES PISCES PISCES PISCES

2. Encoding on the Logical Topology
Logical Core C0 C1 C2 C3 1011 P0 P1 P2 P3 Logical Spine S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 Leaf L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 we group these switches together, as a logical switch, and maintain only one bitmap (and switch ID) for each of these logical switches. (next slide) PISCES PISCES PISCES PISCES

3. Sharing Bitmap Across Switches
Logical Core C0 C1 C2 C3 1011 P0 P1 P2 P3 Logical Spine S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 Leaf L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 To further reduce the number of bitmaps we need to encode in the packet, we let switches, on the same layer, share a bitmap by allowing some extra transmissions. For example, (click) L5 and L7 can share a bitmap (which is the bitwise OR) of their individual bitmaps (i.e., ). This would generate two extra transmissions: one at L5’s forth port, and the one one at L7’s sixth port. (click) We wrote a clustering algorithm (a greedy version of MIN-K-UNION) to determine which switches should share a bitmap while yielding minimum number of extra transmissions. (next slide) Clustering Algorithm (a greedy version of Min-K-Union) to determine which switches should share a bitmap while minimizing extra transmissions. PISCES PISCES PISCES PISCES L5, L7 001111 p-rule

4. Dealing with Limited Header Space using Default p-Rule
1011 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 There is a fixed number of bitmaps we can encode in the packet header as p-rules. Thus, even after applying the techniques so far, we may still have some switches left. To handle these leftover switches, we introduce the notion of a default p-rule, whose bitmap is the bitwise OR of all the switch bitmaps not assigned a non-default p-rule. For example, if we have space for only one p-rule and allow only two extra transmissions per p-rule, (click) then L5 and L7 will share the p-rule. (click) L0 and L6 are the leftover switches, which will be assigned to the default p-rule, generating two more extra transmissions. (next slide) PISCES L0, L6 101111 PISCES PISCES PISCES L5, L7 001111 default p-rule p-rule

5. Reducing Traffic Overhead using s-Rules
1011 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 Default p-rule in the limiting case can lead to a lot of extra transmissions, essentially broadcasting packets from all the switches assigned to it. To reduce the traffic overhead without increasing the packet header size, we exploit the fact that switches already support multicast group tables. Thus, before assigning a switch to a default p-rule, we first check if the switch has space in its multicast group table. If so, we install an entry in the switch (i.e., s-rule), and assign only those switches to the default p-rule which don’t have spare s-rule capacity. So, in our example, with an s-rule capacity of one, both leaf switches L0 and L6 will have an (next slide) PISCES L0, L6 101111 PISCES PISCES PISCES L5, L7 001111 Broadcast default p-rule p-rule

5. Reducing Traffic Overhead using s-Rules
1011 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 s-rule entry now, instead of the default p-rule. (next slide) PISCES PISCES PISCES PISCES L5, L7 001111 s-rule entry each p-rule

1011 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 So, after applying all the techniques, the encoding for our example will look like the following: (next slide) PISCES PISCES PISCES PISCES

1011 C0 C1 C2 C3 1011 P0 P1 P2 P3 00-1 P2,P3:11 P0:10 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L5,L7:001111 L0 L1 L2 L3 L4 L5 L6 L7 L0,L6:101111 100111 001011 101101 001110 With one p-rule per layer (and no more than two extra transmissions per p-rule) and no s-rules, we will get the encoding like this … (click x5) @mshahbaz: point out that downstream p-rules remain the same and only the upstream p-rules are updated per sender. Also, how we don’t need to store ids for the upstream switches. (next slide) Sender PISCES PISCES PISCES PISCES Parameters: 1 p-rule per downstream layer (2 extra transmissions) and no s-rules Default p-rule

1011 C0 C1 C2 C3 1011 P0 P1 P2 P3 00-1 P2,P3:11 P0:10 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L5,L7:001111 L0 L1 L2 L3 L4 L5 L6 L7 L0,L6:101111 100111 001011 101101 001110 (click) and with additional s-rule capacity, (next slide) Sender PISCES PISCES PISCES PISCES Parameters: 1 p-rule per downstream layer (2 extra transmissions) and 1 s-rule per switch Default p-rule

1011 C0 C1 C2 C3 1011 P0 P1 P2 P3 00-1 P2,P3:11 S0 S1 S2 S3 S4 S5 S6 S7 10 01 11 L5,L7:001111 L0 L1 L2 L3 L4 L5 L6 L7 100111 001011 101101 001110 spine switches in pod P0, and leaf switches (L0 and L6) now have an s-rule entry instead of the default p-rule. @mshahbaz: point out the difference b/w extra transmissions with and without s-rules. (next slide) Sender s-rule entry each PISCES PISCES PISCES PISCES Parameters: 1 p-rule per downstream layer (2 extra transmissions) and 1 s-rule per switch

Computes the encoding Encoding Multicast Tree Controller 1011 C0 C1 C2 C3 P0 P1 P2 P3 00-1 P2,P3:11 S0 S1 S2 S3 S4 S5 S6 S7 L5,L7:001111 L0 L1 L2 L3 L4 L5 L6 L7 A controller receives join requests from the tenant through (click) an API, provided by the cloud provider, and (click) computes a compact encoding for the multicast tree. The encoding consists of a list of packets rules (or p-rules)---consisting of one or more switch IDs and an output port bitmap along with a default p-rule---which is encapsulated as a header on the packet and switch rules (or s-rules) as a mechanism for reducing traffic overhead. (click) The controller installs flow rules on software switches specifying where to the forward and which p-rules to push on the packet, (next slide) s-rule entry each PISCES vSwitch PISCES PISCES Elmo Header

Computes the encoding Encoding Multicast Tree Controller s-rules C0 C1 C2 C3 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 L0 L1 L2 L3 L4 L5 L6 L7 PISCES PISCES PISCES PISCES flow rules specifying: output ports and the Elmo header to push on the packet 00-1 1011 P2,P3:11 L5,L7:001111 Elmo Header

Computes the encoding Encoding Multicast Tree Controller s-rules C0 C1 C2 C3 P0 P1 P2 P3 S0 S1 S2 S3 S4 S5 S6 S7 L0 L1 L2 L3 L4 L5 L6 L7 The software switch intercepts the packet originating from the VM. It sends the packet as-is to (click) any neighboring VMs in the tree and (click) pushes the p-rules on the packet before forwarding it to the leaf switch (L0 in this case). (click x4) Switches in the network find a matching p-rule and forward the packet to the associated output ports. If a packet contains no matching p-rule, the switch checks for an s-rule matching the destination IP address and forwards the packet accordingly. (click) Finally, software switches receive the packet. By that time all p-rules have been removed from the packet by the network switches. Software switches then forward the packet to the receiving VMs. (next slide) PISCES PISCES PISCES PISCES flow rules specifying: output ports and the Elmo header to push on the packet 00-1 1011 P2,P3:11 L5,L7:001111 Elmo Header

Evaluation We evaluated three main aspects of Elmo:
Scalability: Elmo scales to millions of multicast groups with minimal flow-table usage and control-plane update overhead on network switches. End-to-End Applications: Applications run unmodified and benefits from reduced CPU and bandwidth utilization. Hardware resource requirements: Elmo is inexpensive to implement in switching ASICs. add speaker notes. (next slide)

Elmo Scales to Millions of Groups
For a multi-rooted Clos topology with 27K hosts (having 48 hosts per rack), 1M multicast groups (with groups sizes based on IBM’s WVE trace), and a p-rule header of 325 bytes: Unicast Overlay In (click) a multi-rooted Clos topology with 27K hosts and 1M multicast groups, with group sizes based on a production trace from IBM’s Websphere Virtual Enterprise and p-rule header of 325 bytes: (click) (i) Elmo can encode, without using default p-rules, 89% of groups when no extra transmission is allowed and up to 99% of groups when this condition is relaxed. (click) (ii) 95% of switches have less than 4059 rules when no extra transmission is allowed, and maximum of 107 rules when 12 extra transmissions are allowed per p-rule. (click) (iii) traffic overhead is negligible for cases where we allow extra transmissions and is identical to the ideal multicast, when no extra transmission is allowed. (next slide)

Elmo Scales to Millions of Groups
For a multi-rooted Clos topology with 27K hosts (having 48 hosts per rack), 1M multicast groups (with groups sizes based on IBM’s WVE trace), and a p-rule header of 325 bytes: Elmo reduces control-plane update overhead on network switches and directs updates to hypervisor switches instead. On average, a membership change to a group triggers an update to 50% of hypervisor switches, fewer than 0.006% of leaf and 0.002% of spine switches relevant to that group’s multicast tree. point out how most updates are handled by hypervisor switches. (next slide)

Applications Run Unmodified with No Overhead
Subscriber Publisher With unicast, the throughput at subscribers decreases as we increase the number of subscribers because the publisher becomes the bottleneck; the publisher services a single subscriber at 185K rps on average and drops to about 0.25K rps for 256 subscribers. With Elmo, the throughput remains the same regardless of the number of subscribers and averages 185K rps throughout. (click) The CPU usage of the publisher VM (and the underlying host) also increases with increasing number of subscribers. With Elmo, the CPU usage remains constant regardless of the number of subscribers (i.e., 4.97%)

Elmo Operates within the Header Size Limit of Switch ASICs
For a 256-port, 200 mm2 baseline switching ASIC that can parse a 512-byte packet header: 190 bytes for other protocols (e.g., datacenter protocols take about 90 bytes) For (click) a 256-port, 200 mm2 baseline switching ASIC that can parse a 512-byte packet header, (click) Elmo consumes only 63.5% of header space even with 30 p-rules, (click) still having 190 bytes for other protocols (e.g., in data-center networks, protocols take about 90 bytes). (next slide)

Elmo’s Primitives are Inexpensive to Implement in Switch ASICs
For a 256-port, 200 mm2 baseline switching ASIC that can parse a 512-byte packet header: As a comparison, Conga consumes 2% of area and Finally, Elmo’s primitives for bitmap-based output-port selection add only (click) % in area costs. This cost is quite modest when compared to CONGA and Banzai schemes that consume an additional switch area of 2% and 12%, respectively. (next slide) Banzai consumes 12% of area.

Designed for multi-tenant data centers Compactly encodes multicast policy inside packets Operates at line rate using programmable data planes Elmo A Scalable Multicast Service So, in summary … (click) Elmo is a scalable multicast service, (click) deployable in today’s multi-tenant data centers supporting millions of multicast groups and tenant isolation, while utilizing full cross-sectional bandwidth of the network. (click) Elmo takes advantage of the unique characteristics of data-center topologies and programmable data planes to compactly encode forwarding rules inside packets and (click) process them at line rate using programmable data planes. (next slide)

Conclusion add speaker notes (next slide)

a. Software Switch: PISCES
Conclusion b. Infrastructure Service: Elmo Programmable Control My thesis contributions In conclusion, I make two contributions in this thesis: (1) PISCES to enable network operators to gain programmable control over their network, and (2) Elmo, a scalable infrastructure service for in-network multicast in data centers. (click) (next slide) a. Software Switch: PISCES

Work not part of this thesis
PVPP: A Flexible Low-Level Backend for Software Data Planes APNet 2017 NetASM: An IR for Programmable Data Planes SOSR 2015 Kinetic: Verifiable Dynamic Network Control NSDI 2015 SDX: A Software Defined Internet Exchange SIGCOMM 2014 OSNT: Open Source Network Tester IEEE Network 2014 Network Prog. Data Planes Network Compilers Network Management add speaker notes (next slide) Network Management Network Testing

Thanks! … all for making this Ph.D. an amazing experience.

Enabling Programmable Infrastructure for Multi-Tenant Data Centers

Similar presentations

Presentation on theme: "Enabling Programmable Infrastructure for Multi-Tenant Data Centers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enabling Programmable Infrastructure for Multi-Tenant Data Centers

Similar presentations

Presentation on theme: "Enabling Programmable Infrastructure for Multi-Tenant Data Centers"— Presentation transcript:

Similar presentations

About project

Feedback