Open vSwitch HW offload over DPDK India DPDK Summit - Bangalore March-2018
Virtual Switch is Overloaded as We Go Cloud Native 1000s containers per server Great portability High resource utilization High performance Few apps per server Low mobility Low resource utilization 10s-100s VM per server Good mobility High resource utilization App App App App App App Guest OS Guest OS Guest OS App App App Application App App App Hypervisor OVS Docker Engine OVS OS Host OS OS Infrastructure Infrastructure Infrastructure Hypervisor Virtualization, App Running in VM OS Resource Virtualization, App Running in Container Bare Metal
Telco and Cloud Applications Expose OVS Weaknesses Low packet performance Vanilla OVS delivering ~0.5 Mpps on 10G link ~1/80 of bare metal performance for voice applications Low efficiency Many CPU cores need to be dedicated to packet processing to achieve a fraction of bare metal performance Latency & packet drop High and unpredictable latency Queues build up, and packets can be dropped Directly affecting user experience for real time applications
How do We Solve the Problem? - ASAP2 Accelerated Switching and Packet Processing What does it do? Offload OVS data plane to eSwitch in the NIC Maintain SDN control plane Leverage standard open API Everything up-streamed Key advantages Higher throughput Lower, more deterministic latency Lower CPU overhead, higher efficiency
Common Operations in Networking Most network functions share some data-path operations Packet classification (into flows) Action based on the classification result Mellanox NIC has the capability to offload both the classification and the actions in hardware NIC Classification A Action A Classification B Action B Classification N Action N Packets In Processed Packets Out
Accelerated Virtual Switch (ASAP2-Flex)
Flex HW Acceleration for vSwitch/vRouter Offload some elements of the data-path to the NIC, but not the entire data-path Data will still flow via the vSwitch Para-Virtualized VM (not SR-IOV) Offloads (examples) Classification offload Application provide flow spec and flow ID Classification done in HW and attach a flow ID in case of match vSwitch classify based on the flow ID rather than full flow spec rte flow is used to configure the classification VxLAN Encap/decap VLAN add/remove QoS vSwitch acceleration VM ConnectX 4 eSwitch Hypervisor OVS PV TC / DPDK Offload Data Path PF
HW classification offload concept For every OVS flow, DP_IF should use the DPDK rte_flow to classify with Action tag (report id) or drop. When packet is received, use the tag id instead of classifying the packet again for Example : OVS set action Y to flow X Add a rte_flow to tag with id 0x1234 for flow X Config datapath to do action Y for mbuf->fdir.id = 0x1234 OVS action drop for flow Z Use rte_flow DROP and COUNT action to drop and count flow Z Use rte_flow counter to get flow statistic Packets flow PMD NIC Hardware User OVS DataPath OVS-vswitchD Rte_flow Flow X mark with id 0x1234 mbuf->fdir.id 0x1234 Do OVS action Y DP_IF - DPDK Config flow
Flow Tables Overview Multiple tables Programmable table size Programmable table cascading Dedicate, isolated tables for hypervisor and/or VMs Practically unlimited table size Can support million of rules/flows
Flow Tables – Classification Key fields example Ethernet Layer 2 Destination MAC 2 outer VLANs / priority Ethertype IP (v4 /v6) Source address Destination address Protocol / Next header TCP /UDP Source port Destination port Flexible fields extraction by “Flexparse” All fields mandatory by OpenFlow
Flow Tables – Actions Actions* Additional actions in newer NICs Steering and Forwarding Drop / Allow Counter set Send to Monitor QP Encapsulation Decapsulation Report Flow ID Additional actions in newer NICs Header rewrite MPLS and NSH encap/decap Flexible encap/decap Hairpin mode * Not all combinations are supported
OVS-DPDK using Flex HW classification offload For every datapath rule we add a rte_flow with flow id The flow id cache can contain flow rules in excess of 1M When packet received matches with a flow id in cache, no need to re-classify the packet to get the rule Flow id cache
Performance Case #flows Base MPPs Offload MPPs improvement Wire to virtio 1 5.8 8.7 50% Wire to wire 6.9 11.7 70% 512 4,2 11,2 267% Code submitted by Yuanhan Liu. Planned to be integrated to OVS 2.10 Single core for each pmd, single queue,
Full OVS Offload (ASAP2-Direct)
Full HW Offload for vSwitch/vRouter acceleration Offload the whole packet processing onto the embedded switch Split control plane and forwarding plane Forwarding plane – use the embedded switch Remove the cost of dataplane in SW SRIOV based
Representors We use VF representors Representor ports are a netdev modeling of eSwitch ports The VF representor supports the following operations Send packet from the host CPU to VF (OVS Re-injection) Receive of eSwitch “miss” packets Flow configuration (add/remove) Flow statistics read for the purposes of aging and statistics The Representor devices are switchdev instances OVS
OVS-DPDK with full HW offloads OVS DPDK with direct data path to VM’s switchdev SR-IOV offloads already implemented in Kernel OVS Use DPDK ‘slow’ path for exception flows or unsupported HW features Allow DPDK to use the control and data path of embedded switch Representor ports are exposed over the PF Data Path RX & TX queues per representor Send/receive packet to/from VF is done through it’s representor ACL, steering, routing encap/decap flow counters IPSec Co-exists with para-virt solutions rte_flow API will be extended to support full HW offload in DPDK 18.05 GuestPV GuestPV GuestVF virtio virtio VF driver OVS-DPDK netdev uplink VF representor Rte_flow switchdev PF VF NIC Embedded Switch uplink
OVS DPDK with/without full HW offload Test Full HW offload Without offload Benefit 1 Flow VXLAN 66M PPS 7.6M PPS (VLAN) 8.6X 60K flows VXLAN 19.8M PPS 1.9M PPS 10.4X