Bringing the Power of eBPF to Open vSwitch

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Memory Management (II)
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Introduction to Operating Systems Concepts
Chapter 13: I/O Systems.
ODP-VPP Integration updates
New Approach to OVS Datapath Performance
Introduction to Kernel
Kernel Design & Implementation
Non Contiguous Memory Allocation
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Jonathan Walpole Computer Science Portland State University
Zero-copy Receive Path in Virtio
Chapter 2 Memory and process management
Segmentation COMP 755.
XDP Infrastructure Development
LWIP TCP/IP Stack 김백규.
Chapter 9: Virtual Memory
COMBINED PAGING AND SEGMENTATION
CSI 400/500 Operating Systems Spring 2009
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Chapter 8: Main Memory.
CS 31006: Computer Networks – The Routers
Operating System Concepts
Optimizing Malloc and Free
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 9: Virtual-Memory Management
Chapter 2: The Linux System Part 2
Xen Network I/O Performance Analysis and Opportunities for Improvement
CS399 New Beginnings Jonathan Walpole.
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Open vSwitch HW offload over DPDK
Virtual Memory Hardware
Switching Techniques.
Implementing an OpenFlow Switch on the NetFPGA platform
Fast Packet Processing In Linux with AF_XDP
All or Nothing The Challenge of Hardware Offload
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Top Half / Bottom Half Processing
Empowering OVS with eBPF
Top #1 in China Top #3 in the world
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE451 Virtual Memory Paging Autumn 2002
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Fast Userspace OVS with AF_XDP
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 7: Flexible Address Translation
Ch 17 - Binding Protocol Addresses
P4C-XDP: Programming the Linux Kernel Forwarding Plane using P4
CSE 471 Autumn 1998 Virtual memory
NetCloud Hong Kong 2017/12/11 NetCloud Hong Kong 2017/12/11 PA-Flow:
CS703 - Advanced Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
COMP755 Advanced Operating Systems
Chapter 13: I/O Systems.
Module 12: I/O Systems I/O hardwared Application I/O Interface
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Bringing the Power of eBPF to Open vSwitch Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.io

Outline Introduction and Motivation OVS-eBPF Project OVS-AF_XDP Project Conclusion

What is OVS? ovs-vswitchd Slow Path Fast Path Datapath SDN Controller OpenFlow ovs-vswitchd Slow Path This is how the existing kernel work, then next slide this is how the bpf program work Ovs dp consists of three components, parse, lookup and actions previous approach Potential Benefits Extend new OVS features quickly Reduce the maintenance and compatibility issue Fast Path Datapath

OVS Linux Kernel Datapath ovs-vswitchd socket Slow path in userspace Fast Path in Kernel IP/routing OVS Kernel module Device RX Hook This is how the existing kernel work, then next slide this is how the bpf program work Ovs dp consists of three components, parse, lookup and actions previous approach Potential Benefits Extend new OVS features quickly Reduce the maintenance and compatibility issue driver Hardware

OVS-eBPF

OVS-eBPF Motivation Maintenance cost when adding a new datapath feature: Time to upstream and time to backport Maintain ABI compatibility between different kernel and OVS versions. Different backported kernel, ex: RHEL, grsecurity patch Bugs in compat code are easy to introduce and often non-obvious to fix Implement datapath functionalities in eBPF Reduce dependencies on different kernel versions More opportunities for experiements With eBPF, both userspace version and eBPF programs to be loaded into kernel. --- How can we leverage eBPF in OVS? Match the kernel version with features with OVS versions Add application specific functions without worrying kernel ABI Extend OVS features more quickly Reduce maintenance and compatibility issue

What is eBPF? A way to write a restricted C program and runs in Linux kernel A virtual machine running in Linux kernel Safety guaranteed by BPF verifier Maps Efficient key/value store resides in kernel space Can be shared between eBPF prorgam and user space applications Helper Functions A core kernel defined set of functions for eBPF program to retrieve/push data from/to the kernel Helper functions are a concept that lets eBPF programs con- sult a core kernel defined set of function calls in order to re- trieve/push data from/to the kernel. Available helper func- tions may differ for each eBPF program type,

OVS-eBPF Project Goal Re-write OVS kernel datapath entirely with eBPF ovs-vswitchd Goal Re-write OVS kernel datapath entirely with eBPF ovs-vswitchd controls and manages the eBPF DP eBPF map as channels in between eBPF DP will be specific to ovs-vswitchd Slow path in userspace eBPF maps Fast Path in Kernel IP/routing eBPF Datapath TC hook Previous approach introducing BPF_ACTION Tc is a Kernel packet queuing subsystem, provide QoS …. Ovs-vswitch creates map, load ebpf programs, etc Parse Lookup Actions driver Hardware

Headers/Metadata Parsing Define a flow key similar to struct sw_flow_key in kernel Parse protocols on packet data Parse metadata on struct __sk_buff Save flow key in per-cpu eBPF map Difficulties Stack is heavily used Program is very branchy No longer stick to kernel ABI Start by writing our own parser but it’s pretty standard and should reuse existing code Pretty standard and can be reused Extra md: linux skb specific, not defined in P4 https://github.com/iovisor/bcc/blob/master/src/cc/frontends/p4/README.md

Review: Flow Lookup in Kernel Datapath Slow Path Ingress: lookup miss and upcall ovs-vswitchd receives, does flow translation, and programs flow entry into flow table in OVS kernel module OVS kernel DP installs the flow entry OVS kernel DP receives and executes actions on the packet Fast Path Subsequent packets hit the flow cache ovs-vswitchd 2. miss upcall (netlink) 3. flow installation (netlink) Flow Table (emc + megaflow) When a packet arrives on a vport, the kernel module processes it by extracting its flow key and looking it up in the flow table. If there is a matching flow, it executes the associated actions. If there is no match, it queues the packet to userspace for processing (as part of its processing, userspace will likely set up a flow to handle further packets of the same type entirely in-kernel). Parser will walk through the protocol headers, create hash, and lookup eBPF hash map eBPF comes with a fast and efficient way of messaing using perf ring buffer Parser 4. actions 1. Ingress

Flow Lookup in eBPF Datapath Slow Path Ingress: lookup miss and upcall Perf ring buffer carries packet and its metadata to ovs-vswitchd ovs-vswitchd receives, does flow translation, and programs flow entry into eBPF map ovs-vswitchd sends the packet down to trigger lookup again Fast Path Subsequent packets hit the flow cache ovs-vswitchd 2. miss upcall (perf ring buf) 3. flow installation (TLV -> fixed array -> eBPF map) Flow Table (eBPF hash map) Avoid forward and backward flow key compatibility issue Parser will walk through the protocol headers, create hash, and lookup eBPF hash map eBPF comes with a fast and efficient way of messaing using perf ring buffer Network protocols evolve over time. New protocols become important and existing protocols it must be possible for newer versions to parse additional protocols as part of the flow key. It might even be desirable, someday, to drop support for parsing protocols that have become obsolete.  Probably framing it more like "existing datapath uses TLVs for matches and TLVs for actions. In eBPF, we can get rid of the TLV-formatted matches and use fixed binary views of the flow since we don't need to handle different versions of kernel vs. userspace.. But for actions they are logically a list so we need something list-like or array-like. The challenge is that eBPF makes it difficult to validate/verify TLVs. The solution we're using is just to expand the actions out to a fixed length array. This works for now, but may be an area for improvement in future.". Parser 4. actions 1. Ingress Limitation on flow installation: TLV format currently not supported in BPF verifier Solution: Convert TLV into fixed length array

Review: OVS Kernel Datapath Actions A list of actions to execute on the packet Example cases of DP actions Flooding: Datapath actions= output:9,output:5,output:10,… Mirror and push vlan: Datapath actions= output:3,push_vlan(vid=17,pcp=0),output:2 Tunnel: Datapath actions: set(tunnel(tun_id=0x5,src=2.2.2.2,dst=1.1.1.1,ttl=64,flags(df|key))),output:1 FlowTable Act1 Act2 Act3 … Datapath actions: 9,55,10,55,66,11,77,88,9,1 Datapath actions: set(ipv4(ttl=1)),2,4 [Datapath actions: 3,push_vlan(vid=17,pcp=0),2 set(tunnel(tun_id=0x5,src=2.2.2.2,dst=1.1.1.1,ttl=64,flags(df|key))),1 Datapath actions: 10,set(ipv4(src=192.168.3.91)),11,set(ipv4(src=192.168.3.90)),13 Datapath actions: set(ipv4(src=192.168.3.90)),10,set(ipv4(src=192.168.0.1)),11 Datapath actions: trunc(100),3,push_vlan(vid=17,pcp=0),2

eBPF Datapath Actions A list of actions to execute on the packet Challenges Limited eBPF program size (maximum 4K instructions) Variable number of actions: BPF disallows loops to ensure program termination Solution: Make each action type an eBPF program, and tail call the next action Side effects: tail call has limited context and does not return Solution: keep action metadata and action list in a map FlowTable Map lookup eBPF Act1 Tail Call Map lookup eBPF Act2 Tail Call … The verifier also disallows any program containing loops, thus ensuring that all programs will terminate.

Performance Evaluation 14.88Mpps br0 sender 16-core Intel Xeon E5 2650 2.4GHz 32GB memory DPDK packet generator BPF eth0 eth1 Intel X3540-AT2 Dual port 10G NIC + eBPF Datapath Ingress Egress Sender sends 64Byte, 14.88Mpps to one port, measure the receiving packet rate at the other port OVS receives packets from one port, forwards to the other port Compare OVS kernel datapath and eBPF datapath Measure single flow, single core performance with Linux kernel 4.9-rc3 on OVS server Compare Linux kernel 4.9-rc3

OVS Kernel and eBPF Datapath Performance OVS Kernel DP Actions Mpps Output 1.34 Set dst_mac 1.23 Set GRE tunnel 0.57 eBPF DP Actions Mpps Redirect(no parser, lookup, actions) 1.90 Output 1.12 Set dst_mac 1.14 Set GRE tunnel 0.48 The goal is not the absolute number This is single flow core and single core We could scale At this stage Not comparable with other people’s other how the bpf architecture the way that af Distribute over multiple cores Get an how the bpf arch fits the way we design This is not mean to show the raw performance of OVS, but to show relative performance of eBPF compare current OVS Relative speaking the overhead involved All measurements are based on single flow, single core.

Conclusion and Future Work Features Megaflow support and basic conntrack in progress Packet (de)fragmentation and ALG under discussion Lesson Learned Writing large eBPF code is still hard for experienced C programmers Lack of debugging tools OVS datapath logic is difficult Not clear whether there is performance head

OVS-AF_XDP

OVS-AF_XDP Motivation Pushing all OVS datapath features into eBPF is hard A large flow key on stack Variety of protocols and actions Dynamic number of actions applied for each flow Idea Retrieve packets from kernel as fast as possible Reuse the userspace datapath for flow processing Less kernel compatibility than OVS kernel module ushing the complete OVS datapath logic into the kernel is difficult due to how generic the OVS kernel module is (with respect to a flow-based model that supports a wide variety of protocols and actions to perform on a packet), so perhaps a better path is to instead attempt to retrieve packets from the kernel even earlier than today, then reuse the userspace datapath to implement forwarding. The use

OVS Userspace Datapath (dpif-netdev) SDN Controller ovs-vswitchd Both slow and fast path in userspace This is how the existing kernel work, then next slide this is how the bpf program work Ovs dp consists of three components, parse, lookup and actions previous approach Potential Benefits Extend new OVS features quickly Reduce the maintenance and compatibility issue Userspace Datapath DPDK library Hardware

XDP and AF_XDP XDP: eXpress Data path AF_XDP: An eBPF hook point at the network device driver level AF_XDP: A new socket type that receives/sends raw frames with high speed Use XDP program to trigger receive Userspace program manages Rx/Tx ring and Fill/Completion ring. Zero Copy from DMA buffer to user space memory, umem From “DPDK PMD for AF_XDP”

OVS-AF_XDP Project Goal ovs-vswitchd Goal Use AF_XDP socket as a fast channel to usersapce OVS datapath Flow processing happens in userspace User space Userspace Datapath AF_XDP socket Network Stacks Previous approach introducing BPF_ACTION Tc is a Kernel packet queuing subsystem, provide QoS …. Ovs-vswitch creates map, load ebpf programs, etc Kernel Driver + XDP Hardware

AF_XDP umem and rings Introduction umem memory region: multiple 2KB chunk elements 2KB Descriptors pointing to umem elements Receive Users receives packets For kernel to receive packets Rx Ring desc Fill Ring Transmit Users sends packets For kernel to signal send complete Tx Ring Completion Ring One Rx/Tx pair per AF_XDP socket One Fill/Comp. pair per umem region

OVS-AF_XDP: Packet Reception (0) umem consisting of 8 elements Umem mempool = {1, 2, 3, 4, 5, 6, 7, 8} addr: 1 2 3 4 5 6 7 8 Fill Ring … Rx Ring …

OVS-AF_XDP: Packet Reception (1) umem consisting of 8 elements Umem mempool = {5, 6, 7, 8} X addr: 1 2 3 4 5 6 7 8 X: elem in use GET four elements, program to Fill ring Fill Ring … 1 2 3 4 Rx Ring …

OVS-AFXDP: Packet Reception (2) umem consisting of 8 elements Umem mempool = {5, 6, 7, 8} X addr: 1 2 3 4 5 6 7 8 X: elem in use Kernel receives four packets Put them into the four umem chunks Transition to Rx ring for users Fill Ring … Rx Ring … 1 2 3 4

OVS-AFXDP: Packet Reception (3) umem consisting of 8 elements Umem mempool = {} X addr: 1 2 3 4 5 6 7 8 X: elem in use GET four elements Program Fill ring Fill Ring … 5 6 7 8 (so kernel can keeps receiving packets) Rx Ring … 1 2 3 4

OVS-AFXDP: Packet Reception (4) umem consisting of 8 elements Umem mempool = {} X addr: 1 2 3 4 5 6 7 8 X: elem in use OVS userspace processes packets on Rx ring Fill Ring … 5 6 7 8 Rx Ring … 1 2 3 4

OVS-AFXDP: Packet Reception (5) umem consisting of 8 elements Umem mempool = {1, 2, 3, 4} X addr: 1 2 3 4 5 6 7 8 X: elem in use OVS userspace finishes packet processing and recycle to umempool Back to state (1) Fill Ring … 5 6 7 8 Rx Ring …

OVS-AFXDP: Packet Transmission (0) umem consisting of 8 elements Umem mempool = {1, 2, 3, 4, 5, 6, 7, 8} addr: 1 2 3 4 5 6 7 8 X: elem in use OVS userspace has four packets to send Tx Ring … Completion Ring …

OVS-AFXDP: Packet Transmission (1) umem consisting of 8 elements Umem mempool = {5, 6, 7, 8} X addr: 1 2 3 4 5 6 7 8 X: elem in use GET fours element from umem Copy packets content Place in Tx ring Tx Ring … 1 2 3 4 Completion Ring …

OVS-AFXDP: Packet Transmission (2) umem consisting of 8 elements Umem mempool = {5, 6, 7, 8} X addr: 1 2 3 4 5 6 7 8 X: elem in use Issue sendmsg() syscall Kernel tries to send packets on Tx ring Tx Ring … 1 2 3 4 Completion Ring …

OVS-AFXDP: Packet Transmission (3) umem consisting of 8 elements Umem mempool = {5, 6, 7, 8} X addr: 1 2 3 4 5 6 7 8 X: elem in use Kernel finishes sending Transition the four elements to Completion Ring for users Tx Ring … Completion Ring … 1 2 3 4

OVS-AFXDP: Packet Transmission (4) umem consisting of 8 elements Umem mempool = {1, 2, 3, 4, 5, 6, 7, 8} addr: 1 2 3 4 5 6 7 8 X: elem in use OVS knows send operation is done Recycle/PUT the four elements back to umempool Tx Ring … Completion Ring …

Optimizations OVS pmd (Poll-Mode Driver) netdev for rx/tx Before: call poll() syscall and wait for new I/O After: dedicated thread to busy polling the Rx ring UMEM memory pool Fast data structure to GET and PUT umem elements Packet metadata allocation Before: allocate md when receives packets After: pre-allocate md and initialize it Batching sendmsg system call

Umempool Design umempool keeps track of available umem elements GET: take out N umem elements PUT: put back N umem elements Every ring access need to call umem element GET/PUT Three designs: LILO-List_head: embed in umem buffer, linked by a list_head, push/pop style FIFO-ptr_ring: a pointer ring with head and tail pointer LIFO-ptr_array: a pointer array and push/pop style access

LIFO-ptr_array Design top ptr_array X X X X X X X X X X X X X Multiple 2K umem chunk memory region Idea: Each ptr_array element contains a umem address Producer: PUT elements on top and top++ Consumer: GET elements from top and top--

Packet Metadata Allocation Every packets in OVS needs metadata: struct dp_packet Initialize the packet data independent fields Two designs: Embedding in umem packet buffer: Reserve first 256-byte for struct dp_packet Similar to DPDK mbuf design Separate from umem packet buffer: Allocate an array of struct dp_packet Similar to skb_array design Packet metadata Packet data

Packet Metadata Allocation Embedding in umem packet buffer Each 2K has: Multiple 2K umem chunk memory region Packet metadata umem buffer

Packet Metadata Allocation Separate from umem packet buffer Packet metadata in another memory region One-to-one maps to umem Multiple 2K umem chunk memory region

Performance Evaluation 19Mpps br0 sender 16-core Intel Xeon E5 2650 2.4GHz 32GB memory DPDK packet generator eth0 Netronome NFP-4000 Intel XL710 40GbE + AFXDP Userspace Datapath ingress Egress Sender sends 64Byte, 19Mpps to one port, measure the receiving packet rate at the other port Measure single flow, single core performance with Linux kernel 4.19-rc3 and OVS 2.9 Enable AF_XDP Zero Copy mode Compare Linux kernel 4.9-rc3

Performance Evaluation Experiments OVS-AFXDP rxdrop: parse, lookup, and action = drop L2fwd: parse, lookup, and action = set_mac, output to the received port XDPSOCK: AF_XDP benchmark tool rxdrop/l2fwd: simply drop/fwd without touching packets LIFO-ptr_array + separate md allocation shows the best Results Not clear whether there is performance head XDPSOCK OVS-AFXDP rxdrop 19Mpps l2fwd 17Mpps 14Mpps

Conclusion and Discussion Future Work Follow up new kernel AF_XDP’s optimizations Try virtual devices vhost/virtio with VM-to-VM traffic Bring feature parity between userspace and kernel datapath Discussion Usage model: # of XSK, # of queue, # of pmd/non-pmd Comparison with DPDK in terms of deployment difficulty Not clear whether there is performance head

Dislike? Like? Question? Thank You

Batching sendmsg syscall Place a batch of 32 packets on TX ring, issue send syscall Design 1 Check this 32 packets on completion ring, then recycle If not, keep issuing send Design 2 Check any 32 packets on completion ring, then recycle Design 3 Issue sendmsg syscall