Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating OpenStack with DPDK for High Performance Applications

Similar presentations


Presentation on theme: "Integrating OpenStack with DPDK for High Performance Applications"— Presentation transcript:

1 Integrating OpenStack with DPDK for High Performance Applications
OpenStack Summit 2018 Vancouver

2 Who are we? Yasufumi Ogawa (@yogawa) Core maintainer in DPDK/SPP
Hello, everone! My name is Yausfumi Ogawa and Tetsuro Nakamura. Today, we would like to talk about how to improve performance of DPDK applications on virtual machines focusing on OpenStack. Tetsuro Nakamura Core maintainer in “networking-spp” Active in Nova - NFV/Placement

3 Agenda: DPDK OpenStack Strategies for High Performance
Examples of How to Configure Motivation and Usacase - SPP (Soft Patch Panel) OpenStack Bring the tuning settings to OpenStack CPU pinning Emulator threads policy NUMA Architecture Manage DPDK vhost-user interface In this presentation, first of all, I would like to explain about strategies for getting better performance with DPDK. I also would like to introduce our product SPP, it stands for Soft Patch Panel. As teleco service provider, we have tried to develop network service systems based on virtualization technologies, aiming to achieve high performance as dedicated hardware. Then, he talks how to get the performance on OpenStack. OpenStack provides us APIs for detailed configurations. However, it is insufficient to get the maximum performance. He will introduce what is the problem and how to improve it as our proposal.

4 Strategies for High Performance
There are three strategies for getting better performance for DPDK 1. Configuration considering Hardware Architecture NUMA and CPU layout Hugepages Memory channels 2. Optimization of VM Resource Assignment isolcpus taskset There are three strategies for getting better performance for DPDK 1. Configuration considering Hardware Architecture NUMA and CPU layout Hugepages Memory channels 2. Optimization of VM Resource Assignment isolcpus taskset 3. Writing Efficient Code Reduce memory copy Communication between lcores via ring 3. Writing Efficient Code Reduce memory copy Communication between lcores via ring

5 Configurations for DPDK
DPDK provides several options for optimizing to the architecture (1) CPU Layout Decide core assignment with '-l' option while launching DPDK app $ sudo /path/to/app -l Main Thread Worker Thread Worker Thread ・・・ core 0 core 1 core 2 core 3 core 4 (2) Memory Channel Give the number of mem channels with '-n' for optimization Add appropriate size of padding for load/store packets DPDK provides configurations for optimization of CPU and memory. You can runs worker threads on designated CPUs for utilization of 100%. Which of CPUs you run worker threads on depends on your hardware architecture. I mean considering NUMA. For memory, you should give the number of memory channels to DPDK for adding the padding to packet data to efficiently access to memory. In this case, two channels and 4-ranked DIMM memory, 1st packet is located to channel 0 and rank 0, and 2nd packet should be located to channel1 and rank1 for optimizing memory access. DPDK adds padding appropriately to adjust the starting point like this. For 2 channels and 4-ranked DIMM, 2nd packet should start from channel1, rank1 memory address Channel 1 1 1 1 1 1 1 1 1 1 1 1 ・・・ Rank 1 2 3 1 2 3 1 2 3 ・・・ Packet 1 2 3 4 5 6 7 8 9 A B C D E F padding 1 2 3 4 ・・・ packet 1 packet 2

6 Configurations for VM CPU assignment is not controllable by default, but doable (1) isolcpus Use the isolcpus Linux kernel parameter to isolate them from Linux scheduler to reduce context switches. # /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=3-7” (2) taskset Update affinity by using taskset command to pin vCPU threads If you run DPDK applications on virtual machines, you should consider to use isolcpus and taskset. isolcpus is a kernel parameter to isolate CPUs from the kernel scheduler and taskset is a command to set CPU affinity. You can assign CPUs exclusively for the VMs. $ sudo taskset -pc pid 's current affinity list: 0-31 pid 's new affinity list: 4 $ sudo taskset -pc ....

7 Motivation - Large Scale Telco-Services on NFV
Large-scale cloud for telecom services Service Function Chaining for virtual network appliances Flexibility, Maintainability and High-Performance VM ・・・ L2 Switch L3 Router Load Balancer MPLS Firewall Audio Video Web Service Security Monitoring DPI Variety kinds of service apps on VMs We NTT have developed NFV technologies to realize large-scale cloud system for telecom services. This is an image of network service platform using service function chaining. OVS or SR-IOV is usually used for such a situation. However, it does not meet both of requirements for flexibility, maintainability and high-performance. This is the reason why we have developed SPP.

8 SPP (Soft Patch Panel) Change network path with patch panel like simple interface High-speed packet processing with DPDK Update network configuration dynamically without terminating services VM Virtual Ports Physical Ports SPP behaves similar to patch panel. Patch panel is used for connecting LAN cables between servers. SPP uses DPDK for connecting and forwarding packets. SPP enables to change network paths between VMs with patch panel like simple interface. SPP also enables to connect applications running on host, for example, vswitch or firewall. It means that SPP allows users to combine any service components on host and guests. If you want to make filtering rules before forwarding VMs, you can insert a security application for.

9 SPP (Soft Patch Panel) Multi-process Application
Primary process is a resource manager Secondary processes are workers for packet forwarding Several Virtual Port Support ring pmd vhost pmd pcap pmd etc Host Guest NFV App vhost MPLS Firewall L2 Switch SPP Resource Manager (Primary) spp_nfv (Secondary) Basically, SPP is derived from multi-process application of DPDK sample application. Here is SPP running on host. Primary process is a resource manager and it initializes mempool, mbuf or other DPDK resources to provide them secondary processes. Secondary processes are workers for packet forwarding. SPP supports several types of virtual port, such as ring, vhost or pcap. You can send packets to any of application running on host or guest VMs via SPP.

10 Performance Performance of SPP, OVS-DPDK and SR-IOV through 1 ~ 8 VMs
Environment: CPU: Xeon E5-2690v3 (12cores/24threads) NIC: Intel X520 DP 10GB DA/SFP+ Server Adapter DPDK v16.07 Traffic: 64byte / 10GB SPP / OVS / SR-IOV Host#1 Host#2 10GB pktgen l2fwd ... ~ 8VMs In the end of my part, I would like to show a result of performance test. Throughput of inter-vm communication for SPP, OVS and SR-IOV. Here is an environment. Traffic is 10GB, packet size is 64byte. VMs are connected in serial from 1 to 8 VMs. The maximum number is 8 because for the limitation of the number of cores. We used DPDK and it is a little bit old. It is because for using ivshmem interface. Ivshmem is a kind of shared memory mechanism for passing data directly between host and guest VMs.

11 Performance SPP ring achieves the best performance
(vhost) SPP ring achieves the best performance Here is the result. x-axis is number of VMs and y-axis is throughput in Mpps. SPP ivshmem is the heighest performance and keeping wire-rate even if eight VMs. On the other hands, SPP vhost, red line, is about 10 Mpps for one VM and 7 Mpps for more VMs. Purple line is a result of SR-IOV. Performance is good for less than two VMs, but sharply dropped down for more than three. SPP provides high performance and flexibility of patch panel like interface. SPP vhost keeps 7 Mpps for ~8 VMs

12 Can we bring performance tunings
in OpenStack world? From now on, let’s get look into the OpenStack world. The main topic here is can we bring those performance tunings from DPDK world to OpenStack world? I’ll answer this question in this presentation.

13 A Basic Gap: DPDK based VMs wants to know underlying hardware to tune performance as well as possible OpenStack is a tool for *virtualization*. End-users should never be aware of underlying hardware. First of all, this is a hint to answer the question. There is a basic gap between DPDK and OpenStack. In DPDK world, VMs wants to know underlying hardware architecture as well as possible to get more performance. On the other hand, in OpenStack world, we virtualize hardware architecture. We hide hardwares from end users, and end users should never be aware of the underlying hardware.

14 The gap makes it complex. Let’s see the status today, pain points,
It’s not that easy. The gap makes it complex. Let’s see the status today, pain points, and possible improvements for “Rocky”. So it’s not that easy. The gap between DPDK and OpenStack makes things complex. Let’s see the status today, how we enable those performance tunings in OpenStack community. And let’s also see its pain points. And for some of the pain points, the solutions are being proposed towards Rocky. So let’s have a brief review on them in this presentation.

15 Agenda: DPDK OpenStack Strategies for High Performance
Examples of How to Configure Motivation and Usacase - SPP (Soft Patch Panel) OpenStack Bring the tuning settings to OpenStack CPU pinning Emulator threads policy NUMA Architecture Manage DPDK vhost-user interface The first agenda of this chapter is “Bring the tuning settings to OpenStack”. For performance tuning we’d like to talk here about cpu pinning feature and NUMA architecture.

16 Bring the tuning settings to OpenStack 1. CPU pinning
Agenda: Bring the tuning settings to OpenStack 1. CPU pinning 1-1. How to assign cores – Service Setup 1-2. How to assign cores – VM deployment 1-3. Pain Points in Queens 1-4. Proposed improvements for Rocky 2. Emulator threads policy 3. NUMA Architecture Let’s get started with CPU pinning feature. This is a topic about how to assign cores to the system including VMs and vSwitch. We at first see how it is like today, and then see the pain points, and proposed improvements for Rocky release.

17 1-1. How to assign cores - Service Setup
Isolate CPUs manually for vSwitch and VMs not to be scheduled to other processes. VM ... physical port SPP [/etc/default/grub] GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=2-15” Set CPUs for Network, which are used by vSwitch, via configuration file. (“0x3e” means 2-5) [local.conf] DPDK_PORT_MAPPINGS = 00:04.0#phys1#1#0x3e To assign cores to services in compute node, you set static parameters on compute node. Before starting openstack service, you need to set isolcpu on Linux Kernel System. This isolates cpus from OS scheduler, which prevents cpus from being used by the other processes, and enables us to use all the cpu time on including vSwitchs and VMs. We should also set cpus for vSwitches statically from isolated cpus if we want to get more performance. And we reserve some cpus from the isolated cpus for VMs in a compute host before we actually deploy VMs. This is specified in the parameter called “vcpu_pin_set” so far. These processes should be done for service setup. And then, we start the openstack compute service. Reserve the rest of CPUs for VMs [local.conf] vcpu_pin_set = 6-15 compute node

18 1-2. How to assign cores - VM deployment
Use “cpu_policy” in flavor extra specs. hw:cpu_policy=shared(default) hw:cpu_policy=dedicated VM VM vCPUs “float” across all the CPUs Each vCPU is “pinned” to a dedicated CPU. 1 1 1 So, with the previous service setup configuration, we have 4 cpus for vSwich and 10 cpus for VMs. 2 are reserved for OS. Let’s have VMs deployed to this machine. ** By default, vCPUs in the VM “float” across all the CPUs in “vcpu_pin_set” that we configured before. With this feature, you can use more virtual cpus than actual physical cpus. But we get less performance and the performance is unpredictable because the cpus are shared with the other VMs. ** If you want to pin vCPUs to dedicated physical CPUs, use cpu_policy option. If you set this cpu_policy to dedicated, Each vCPU is “pinned” to a dedicated CPU and the dedicated CPUs are picked from vcpu_pin_set. With this feature, you can get more performance, but you get less accommodation rate since CPUs are not actually virtualized in a sence here. 2 3 4 5 6 7 8 9 10 11 13 12 15 14 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 Host OS Used by vSwitch vcpu_pin_set OS Used by vSwitch vcpu_pin_set 〇 Can use more vCPUs than real × Less and unpredictable performance 〇 Can get more performance × Less accommodation rate

19 (ref.) Performance Difference
PowerEdge R730: CPU: E5-2690v3 (2.60GHz, 12 cores) NIC: Intel X520 DP 10Gb DA/SFP+ DPDK 17.11: Hugepage: 1GB Traffic: 64 byte UDP × 2.98 Throughput (GB) Breaking dedicated status for cpu core utilization, that is, breaking cpu pinning, is very critical for high performance compared to other tuning parameters. We had an experiment using a sample DPDK application on a virtual machine on SPP and measured its traffic rate. Here we used DPDK version of with 1GB hugepage, and the load traffic was 64 byte UDP packet. If we used isolcpu and cpu pinning so that the DPDK application process is not harmed by other processes, the performance went up by nearly 3 times, which is a significant change. harmed by other process isolcpu + cpupinning

20 1-3. Pain Points 1 VMs with different “cpu_policy”s can’t be colocated on the same host. This is Because the “shared” VM would float on pCPUs pinned to the “dedicated” VM, which results in unpredictable performance degradation. VM VM The pain point today is that we can’t take both of them. We can’t deploy both policies, shared and dedicated VMs to one host. ** This is because the vCPU threads of “shared” VM float on physical CPUs including CPUs pinned to the a VM with “dedicated” policy. This breaks the “dedicated” status of pinned vCPUs and results in unpredictable performance degradation in the whole system. 1 1 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 Host OS Used by vSwitch vcpu_pin_set

21 1-3. Pain Points 2 No way to assign both dedicated and shared vCPUs to one VM. We want to save cores for: house keeping tasks for OS, and DPDK cores for controlling tasks example architecture of DPDK application hw:cpu_policy=mixed (not supported!) master core slave core1 slave core2 VM 1 2 In the same way we have no way to assign both dedicated and shared virtual CPUs to one VM. This is needed because we want to save cores for house keeping tasks for Operation System and DPDK cores for controlling threads. For example, our product, SPP is based on DPDK’s multi-process model using one master core to talk to its controller and multiple slave cores for processing data path packets. This means we want performance for slave cores, but we don’t need that performance for the master core. And we want to save cores such for controlling and managing threads. 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 Host

22 1-4. Proposed improvements for Rocky
Service Setup Options Deprecate “vcpu_pin_set” option Produce “cpu_shared_set” and “cpu_dedicated_set” instead They are reported as “VCPU”, “PCPU” resource class respectively to Placement. VM deployment Options Deprecate “hw:cpu_policy” option Request each resource class respectively “resources:VCPU=2&resources:PCPU=4” To achieve that feature and architecture, a spec is now being proposed.towards Rocky release. Here we deprecate “vcpu_pin_set” option and instead of that, we produce “cpu_shared_set” and “cpu_dedicated_set. They are reported as “VCPU”, “PCPU” resource class respectively to Placement. VCPU stands for shared cpus, and PCPU stands for dedicated cpus. These are set when the operator sets up a compute node. And for VM deployment, we are going to deprecate “hw:cpu_policy” option. And instead of that we request each resource class respectively “resources:VCPU=2&resources:PCPU=4”. Next page shows us the process in details. spec: Standardize CPU resource tracking

23 1-4. Proposed improvements for Rocky
Setup compute hosts with both VCPUs and PCPUs Simply request them via flavor resources:VCPU=1, PCPU=3 Flavor VM deploy 1 2 3 This vCPU floats across the “VCPU”s The other vCPUs are pinned to dedicated “PCPU”s First we setup compute hosts with both shared VCPUs and dedicated PCPUs in the host. ** Then they are reported by the virt driver periodically to placement service. And now placement service knows this compute node has 2 shared cpus and 8 dedicated cpus. And, let’s deploy a VM. ** A user shall request simply both kinds of vcpus via VM’s flavor, for example, one shared cpu and three dedicated cpus in one VM. ** Then nova asks which compute node is available to placement, and deploy it if available. 1 shared vCPU floats acrosss cpus in cpu shared set, and the other 3 vCPU s are pinned to dedicated cpus in cpu_dedicated_set. So they never conflict anymore. This spec is not yet approved, but ready for review and NTT is willing to contribute to nova for this feature, so any feedbacks from NFV operators are very welcome. reported by the virt driver 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 Placement Service This compute node has: 2 VCPUs 8 PCPUs OS Used by vSwitch cpu_shared_set cpu_dedicated_set

24 Bring the tuning settings to OpenStack 1. CPU pinning
Agenda: Bring the tuning settings to OpenStack 1. CPU pinning 2. Emulator threads policy 2-1. What is emulator threads? 2-2. Emulator threads policy options 2-3. Pain points 2-4. Proposed improvements for Rocky 3. NUMA Architecture The next topic is emulator threads policy. This is also important to tune performance in OpenStack. In this chapter we describe what emulator thread is at first and introduce some options prepared in OpenStack. We are also going to look into some pain points for this feature and explain what it will be in Rocky realease.

25 2-1. What is emulator threads?
VM(QEMU) process has “emulator threads” and “vCPU threads” vCPU threads: one thread per guest vcpu used in Guest cpu execution Emulator threads: one or more thread per guest instance not associated with any of guest vcpus used for the QEMU main event loop asynchronous I/O operation completion SPICE display I/O etc. $ pstree -p 2606 qemu-system-x86(2606) ┬ {qemu-system-x8}(2607) ├ {qemu-system-x8}(2623) ├ {qemu-system-x8}(2624) ├ {qemu-system-x8}(2625) ├ {qemu-system-x8}(2626) vCPUs Okay, let me explain what it is. First, QEMU process consists of two kinds of threads, one of which is “emulator threads” and the other is “vCPU threads”. You can see it via pstree command on Linux. vCPU thread is one thread per guest vCPU. So if you have 4 vCPUs on that instance, its QEMU process would have 4 threads as vCPU threads. These threads are used in Guest cpu execution. On the other hand, emulator thread is one or more thread per guest instance. These thread are not associated with any of guest vCPUs. These are used for the main QEMU event loop, and I/O. In terms of performance tuning, you should take care not to let this “emulator threads” to steal time from vCPU threads, which run actual instructions for fast data path packet processing.

26 2-2. Emulator threads policy options
You should take care not to let this “emulator threads” to steal time from vCPU threads, which run actual instructions for fast data path packet processing. hw:emulator_threads_policy=share(default) hw:emulator_threads_policy=isolate VM VM 1 1 Emulator threads are running on the same CPUs as vCPU threads Emulator threads are isolated to a dedicated CPU In openstack world, by default, Emulator threads are running on the same physical CPUs as vCPU threads. This is not optimized for performance since it steals time from vCPU threads, which run actual instructions for fast data path packet processing. ** So,if you want to avoid this performance degradation, you can set emulator_threads_policy option to isolate. With this option, emulator threads are pinned to to a dedicated physical CPU, which means vCPU threads are not harmed any more and completely isolated from emulator threads. e e 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 OS Used by vSwitch vcpu_pin_set OS Used by vSwitch vcpu_pin_set

27 2-3. Pain Points You should take care not to let this “emulator threads” to steal time from vCPU threads, which run actual instructions for fast data path packet processing. hw:emulator_threads_policy=isolate Question: Do we want to consume one dedicated CPU for every emulator threads? -> Not really... it is vCPU threads who process fast data path packet, not emulator threads. VM 1 Emulator threads are isolated to a dedicated CPU But we still have question that do we want to consume one dedicated CPU for every emulator threads? The answer is no. because it is vCPU threads who process fast data path packet, not emulator threads. We don’t need to get performance for emulator threads. In other words, emulator threads can be on shared vCPUs, which provides more accomodation rate to us. e 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 OS Used by vSwitch vcpu_pin_set

28 2-3. Proposed improvements for Rocky
“hw:emulator_thread_policy=share” will try to run emulator threads on CPUs in “shared_cpu_set” and fallback to the legacy behavior if unavailable. “hw:emulator_thread_policy=isolate” will remain the same. hw:emulator_threads_policy=share(default) hw:emulator_threads_policy=isolate VM VM Emulator threads try to float across the “VCPU”s 1 2 3 1 2 3 e e So this is a view, proposed in Rocky release. The spec for this feature is already approved and it is already implemented for current master branch, which is very nice and want to say thank you here to the community. ** The behavior is that when we set emulator_threads_policy=share, which is by default, nova tries to run emulator threads on CPUs in “shared_cpu_set”. If the value of “shared_cpu_set” is not set by the operator, it falls back to the legacy behavior to harm vCPU threads. ** And isolate policy will remain the same, which means it is also available to pin the emulator threads to one dedicated physical cpu. 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 OS Used by vSwitch cpu_shared_set cpu_dedicated_set OS Used by vSwitch cpu_shared_set

29 Bring the tuning settings to OpenStack 1. CPU pinning
Agenda: Bring the tuning settings to OpenStack 1. CPU pinning 2. Emulator threads policy 3. NUMA Architecture 2-1. What is NUMA? 2-2. NUMA strategy in OpenStack 2-3. Pain Points 2-4. Proposed Improvements for Rocky Okay, now let’s move to the next topic, NUMA architecture. We explain what NUMA is at first, and then explain NUMA strategy in OpenStack especially for KVM driver with libvirt. We also mention its pain points and proposed improvements for Rocky release.

30 3-1. What is NUMA? NUMA stands for Non-Uniform Memory Access.
The access cost to memory is different (not symmetric). We want to avoid remote access for NFV application. -> Therefore, In OpenStack with libvirt/KVM backend, NUMA architecture in an instance *always* reflects on the underlying physical NUMA architecture. See the next page. NUMA0 NUMA1 Socket1 Core1 Core3 Core5 Core7 Core9 Core11 Core13 Core15 Core17 Core19 Core21 Core23 Memory Socket0 Core0 Core2 Core4 Core6 Core8 Core10 Core12 Core14 Core16 Core18 Core20 Core22 remote access local access NUMA stands for Non-Uniform Memory Access. If your computer memory has NUMA designed architecture, the access cost to memory is different depending on which NUMA memory you want to access. And the performance delta between local access and remote access is not trivial for NFV application. Therefore, if you use OpenStack with libvirt/KVM backend, virtual NUMA architecture in a virtual machine always reflect on the underlying physical NUMA architecture. In other words, to some extent it is not virtualized intentionally.

31 (ref.) Performance Difference
PowerEdge R740: CPU: Xeon GOLD 5118 (2.30GHz) NIC: Intel X710 DP 10Gb DA/SFP+ DPDK 17.11: Hugepage: 1GB Traffice: 64 byte UDP × 1.75 Throughput (GB) This graph shows the performance difference between there is remote access on the systems and if there is no remote access other system. In this experiment we used a sample DPDK application on a virtual machine on SPP and measured its traffic rate for SPP and VM on the same numa node case and SPP and VMs on the different numa node case. Here we used DPDK version of with 1GB hugepage, and the load traffic was 64 byte UDP packet. The difference of the through put was by about 1.75 times, which is pretty significant for network function. So, this is why we think being aware of NUMA node is very important. remote access local access

32 3-2. NUMA strategy in OpenStack
Let’s think of deploying instances to a host with 2 NUMA nodes. With cpu pinning feature, nova picks dedicated CPUs from only one NUMA node. Each VM memory is allocated on the same NUMA node as CPUs hw:cpu_policy=dedicated hw:cpu_policy=dedicated VM1 VM2 1 2 3 1 2 3 In OpenStack world, using libvirt/KVM backend, virtual NUMA architecture in a virtual machine always reflect on the underlying physical NUMA architecture, which enables us to be aware of NUMA node, which is nice. Let me explain this point more. So here let’s think of deploying VMs to a host with 2 NUMA nodes. If cpu pinning feature is specified by a user, nova decides that the VM get one instance NUMA node and picks dedicated CPUs from only one actual NUMA node. They never pick dedicated cpu from the other NUMA node. And each VM virtual memory is allocated on physical memory from the same NUMA node as dedicated physical CPUs on which vCPU is running. NUMA0 NUMA1 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 17 16 19 18 20 21 23 22 25 24 26 27 29 28 31 30 OS Used by vSwitch vcpu_pin_set vcpu_pin_set

33 3-2. NUMA strategy in OpenStack
The host has 18 CPUs left! – 6 from NUMA node0, 12 from NUMA node1 Can we deploy 16 CPUs with “dedicated” cpu policy to the host ?? -> the answer is “No,” because neither of the NUMA node has room for that. hw:cpu_policy=dedicated hw:cpu_policy=dedicated VM1 VM2 1 2 3 1 2 3 So, in the situation with this picture, the host has been set with 26 available cores and 8 cores are being used. We have 18 CPUs left. – 6 free cpus from NUMA node0, and 12 free cpus from NUMA node1. The question is can we deploy one NUMA node instance with 16 CPUs with “dedicated” cpu-policy to the host?? The answer is no. Because we have 18 free cpus available in the host, but we don’t have 16CPUs available on one NUMA node. You should specify 2 NUMA node VM with 6CPUs on one NUMA node and 10 CPUs on the other NUMA node if you want to use this host. So this is what I meant NUMA architecture is not virtualized in OpenStack with KVM. NUMA0 NUMA1 1 2 3 4 5 6 7 8 9 10 11 13 12 15 14 17 16 19 18 20 21 23 22 25 24 26 27 29 28 31 30 OS Used by vSwitch vcpu_pin_set vcpu_pin_set

34 3-3. Pain Points/Improvements for Rocky
NUMA node information is very important in deployment. However, placement API exposes no information about NUMA node. -> In Rockys, we propose to use placement to see NUMA resources. Compute host DISK_GB:300 (used:200) Compute host PCPU:26 (used:8) MEMORY_MB:4096 (used:2048) DISK_GB:300 (used:200) NUMA 0 PCPU:10 (used:4) MEMORY_MB:2048 (used:1024) So, NUMA node information is very important in deployment for NFV operators. However, placement API exposes no information about NUMA node. It just tells total CPUs on the compute host. So in Rockys, we propose to use placement to see NUMA resources. Now the spec is being proposed and it is under review. Modeling Compute host’s NUMA architecture in placement would be optional and done via nova configuration file. We are good with that propose and would be greatful if this feature is approved. NUMA 1 PCPU:16 (used:4) MEMORY_MB:2048 (used:1024) Enable NUMA in Placement

35 Manage DPDK vhost-user interfaces
vhost-user is standard choice of interface to communicate DPDK based switches and VMs. However, the number of the interface is limited to 32 port by default because of the memory requirement. The number of SR-IOV VFs is similarly limited and going to be managed in Placement. We’d like to have some similar solution for management of vhost-user interfaces. When you use SPP or other DPDK vSwitches, vhost-user is standard choice of interface to communicate between switches and VMs in a compute node. However, the number of the interface is limited to 32 port by default because of the memory requirement of DPDK. The number of SR-IOV VFs is similarly limited and going to be managed in Placement. But so far there is no way to manage the numbers of vhost-user port. So we’d like to have the same scheme on vhost-user interfaces.

36 Summary We Introduced our DPDK product “Soft Patch Panel(SPP)”, which is available from OpenStack using “networking-spp”. We have a lot of parameters for performance tunings for SPP as well as other DPDK applications. We already have several schemes to tune these parameters in OpenStack. For further optimization, new features are being proposed and under community review or for Rocky release. Soft Patch Panel: networking-spp: ...feel free to contact us !!


Download ppt "Integrating OpenStack with DPDK for High Performance Applications"

Similar presentations


Ads by Google