Tetsuya Murakami IP Infusion, CTO

Tetsuya Murakami (tetsuya.murakami@ipinfusion.com) IP Infusion, CTO
Consideration and Design for High Performance packet forwarding with DPDK. Tetsuya Murakami IP Infusion, CTO

To achieve high performance…
Optimizing the code… Utilizing DPDK APIs… Utilizing CPU extension instructions…

Topic to achieve high performance
CPU Instruction Cache… CPU Data Cache… Non-voluntary Context Switch… Disclaimer: The analysis/investigation listed in following slides depends on CPU architecture. The following analysis/investigation has been done on Intel SandyBridge CPU.

CPU Instruction Cache (1)
Rx Thread WorkerThread (ACL) WorkerThread (IPSec) WorkerThread (L2TP) Instruction Cache Instruction Cache Instruction Cache Instruction Cache Cache Miss! Core#0 Core#1 Core#3 Core#4 Intel i (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC (667 MHz) T. Maximum Instruction cache Penalty : 28 cycles + 49 ns = 57.5 ns Packet Arrival Interval on 10G bps Interface with the shortest packet: 67 ns When the instruction cache miss is happened, the time slot for almost 1 packet is wasted.

CPU Instruction Cache (2)
Code For Rx thread Code For ACL Code For IPSec Code For L2TP Rx Thread WorkerThread (ACL) WorkerThread (IPSec) WorkerThread (L2TP) Align as much as possible Instruction Cache Instruction Cache Instruction Cache Instruction Cache Core#0 Core#1 Core#3 Core#4 Rx Thread) IPSec sub Thread) static inline int __attribute__((section("fwdr_common_in_main"))) hsl_validate_rcvd_pkt (hsl_fib_id_t fib_id, struct hsl_ip *iph) { ..... } inline void __attribute__((section("fwdr_common_in_main"))) hsl_ip_decrease_ttl (struct hsl_ip *iph) ….. int __attribute__((section("fwdr_common_in_main"))) hsl_mpls_pkt_rcv (struct hsl_if *ifpl3, u_char *pkt, int pkt_len, void *hw_pkt) …. int __attribute__((section("fwdr_common_in_sub_ipsec"))) hsl_ipsec_encrypt (struct hsl_if *ifpl3, struct hsl_if *nh_if, void *hw_pkt) { ….. } hsl_ipsec_decrypt4 (struct hsl_if *ifpl3, u_char *pkt, void *hw_pkt) …. Int __attribute__((section(“fwdr_common_in_sub_ipsec”))) hsl_hw_pkt_ipsec_rcv (struct hsl_if *l3ifp_src, hsl_fib_id_t fib_id, void *hw_pkt) .....

CPU Data Cache (1) Rx Thread WorkerThread (ACL) WorkerThread (IPSec) WorkerThread (L2TP) Processing time getting longer… Expired!! Data Cache Data Cache Data Cache Data Cache Core#0 Core#1 Core#3 Core#4 Packets If the processing time on the worker thread (i.e, ACL) is getting longer, the packet data cache on the Rx thread might be expired… After processing the packet on the worker thread, the packet might be trapped back to the Rx thread in order to forward the packets based on IP/MPLS forwarding table… However, the packet data cache on the Rx thread might be expired. Automatic prefetcher does not work because it works for only continuous memory access but the packet data is not in continuous memory. Prefetch instruction can be used but the prefech request opt to be dropped if the number of issued prefetch request opt is reaching to the maximum number. Maximum data cache penalty: 28 cycles + 49 ns = 57.5 ns Packet Arrival Interval on 10G bps Interface with the shortest packet: 67 ns If the packet data cache is expired, the time slot for almost 1 packet is wasted.

Updating the data cache
CPU Data Cache (2) Rx Thread WorkerThread (ACL) WorkerThread (IPSec) WorkerThread (L2TP) Touch the packet data periodically. Data Cache Data Cache Data Cache Data Cache Core#0 Core#1 Core#3 Core#4 Updating the data cache periodically Packets In order to prevent from expiring the data cache for a given packet on the Rx thread, any worker thread is touching the packet data periodically if the worker thread traps back the packet to the Rx thread. Only reading the packet data on the worker thread can not update the packet data cache on the Rx thread. The packet data cache on the Rx thread can be updated by writing the packet data on the worker thread.

CPU Data Cache Alignment (1)
L1 Data Cache (8K – 64Bx64Block, 2-way) Memory pool (continuous memory) 64B x 64 block (4096B) 10th Block 4096 8192 61440 pkt buffer pkt buffer pkt buffer pkt buffer No Cache! 64B x 64 block (4096B) 10th Block 10th Block 10th Block 10th Block 4095 8191 12287 65535 10th Block CPU Data Cache is managed by using the index pointing to the position of the target memory area (64B block) in the data cache block (4096B). If the size of packet buffer is same as the size of data cache block, the same index can be selected because normally packet header (L2, L3, L4) is accessed and these headers are located in the same position in the packet buffer.

CPU Data Cache Alignment (2)
L1 Data Cache (8K – 64Bx64Block, 2-way) Memory pool (continuous memory) 64B x 64 block (4096B) 10th Block 12th Block 4160 8320 58240 pkt buffer pkt buffer pkt buffer pkt buffer 64B x 64 block (4096B) 10th Block 11th Block 12th Block 24th Block 11th Block 4095 8255 12415 62335 4159 Extra Block 8319 Extra Block 12479 Extra Block 62499 Extra Block The packet size is not matched to the the size of data cache block by adding one extra 64B memory. Even though accessing the same area in the packet buffer, the index based on the data cache block (64B x 64Block) is different. So, multiple packets can be cached at the same time. 24th Block

Non-voluntary Context Switch (1)
Normally DPDK application is using busy loop on a dedicated CPU core in order to poll received packets, process packets, forward packets… /* main processing loop */ static int main_loop(__attribute__((unused)) void *dummy) { struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; …. while (1) { cur_tsc = rte_rdtsc(); .... } DPDK application is running on a dedicated CPU core and never using any system calls to prevent from the context switch… However, the context switch is happened periodically on a given CPU core where DPDK application is running….

bash-4.2# ps ax|grep hsl 2182 ? Ssl 182:56 /usr/sbin/hsl -d bash-4.2# ls /proc/2182/task/ 2182 2203 2210 2234 2238 2268 2187 2204 2225 2237 2267 bash-4.2# cat /proc/2182/task/2226/status Name: lcore-slave-1 State: R (running) Tgid: 2182 Pid: 2226 PPid: 1 TracerPid: 0 Uid: Gid: FDSize: 64 Groups: VmPeak: kB VmSize: kB <snip….> Threads: 15 <snip…> CapInh: CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 02 Cpus_allowed_list: 1 Mems_allowed: , Mems_allowed_list: 0 voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 2 Context Switch was happened on a given CPU core where DPDK application is running. When the context switch is happened, the packet processing is stopped and some of packets can be dropped.

Who is making the non-voluntary context switch on a given CPU core where DPDK application is running…  kworker process in Linux kernel…. bash-4.2# ps ax|grep kworker 5 ? S 0:00 [kworker/u:0] 8 ? S 0:00 [kworker/1:0] 10 ? S 0:00 [kworker/0:1] 12 ? S 0:00 [kworker/2:0] 15 ? S 0:00 [kworker/3:0] 18 ? S 0:00 [kworker/4:0] 21 ? S 0:00 [kworker/5:0] 24 ? S 0:00 [kworker/6:0] 37 ? S 0:00 [kworker/6:1] 38 ? S 0:00 [kworker/3:1] 53 ? S 0:00 [kworker/4:1] 54 ? S 0:00 [kworker/5:1] 55 ? S 0:00 [kworker/1:1] 59 ? S 0:00 [kworker/u:1] 1096 ? S 0:00 [kworker/2:2] 2267 ? S 0:00 [kworker/0:2] What is kworker? kworker means a Linux kernel process doing "work" (processing system calls). You can have several of them in your process list: kworker/0:1 is the one on your first CPU core, kworker/1:1 the one on your second etc.. Why does kworker hog your CPU? To find out why a kworker is wasting your CPU, you can create CPU backtraces: watch your processor load (with top or something) and in moments of high load through kworker, execute echo l > /proc/sysrq-trigger to create a backtrace. (On Ubuntu, this needs you to login with sudo -s). Do this several times, then watch the backtraces at the end of dmesg output. See what happens frequently in the CPU backtraces, it hopefully points you to the source of your problem. (from Looks like kworker process is needed for Linux kernel to handle several events such as I/O event, timers, interrupts, etc… However, all events does not need to be handled on given CPU cores where DPDK applications are running. However, Linux platform can not have any options to disable some of events for kworker process... “echo >/proc/sys/vm/stat_interval” (Just increase the interval for kworker process…) Might need Linux platform suitable for VNF solution…

Performance (1) IPv4/IPv6 frame IPv4/IPv6 frame IXIA IXIA IPI VirNOS
KVM IXIA SR-IOV Intel 10G NIC (ixgbe driver) SR-IOV Intel 10G NIC (ixgbe driver) IXIA Host (CentOS6.5) Host Machine Spec) CPU: Intel ® Xeon® CPU 3.4GHZ RAM: 16G NIC: Intel 82599EB 10 Gigabit SFI/SFP+ (VF) VM Spec) CPU: 5 vCore RAM: 8 G NIC: SR-IOV

Performance (2) Sent Rate(PPS) 14.8Mpps (10Gbps w/ 64byte packet)
Received Rate(PPS) 14.8Mpps (10Gbps w/ 64byte packet)

Add Section Title Here Section Title Here

Tetsuya Murakami IP Infusion, CTO

Similar presentations

Presentation on theme: "Tetsuya Murakami IP Infusion, CTO"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tetsuya Murakami IP Infusion, CTO

Similar presentations

Presentation on theme: "Tetsuya Murakami IP Infusion, CTO"— Presentation transcript:

Similar presentations

About project

Feedback