Download presentation
Presentation is loading. Please wait.
1
Data Plane Acceleration
Receiving Packets at the Speed of Light Networking Services Team, Red Hat Alexander Duyck February 6th, 2015
2
Agenda Identifying the problem
Why is it difficult to receive packets at line rate? Where do the bottlenecks lie? Fixing the problem xmit_more dma_rmb/dma_wmb() napi_alloc_skb/napi_alloc_frag() fib_table_lookup() What is left to be done
3
Identifying the Problem
It is possible with netperf to receive 10Gb/s line rate Frame size 1514 w/ TSO on transmit GSO on receive Maximum frame rate w/ 1514 is 812,744pps With 60B frames line rate is much more difficult Only 24B of additional overhead per frame 1Gb/s / 125MB/Gb / 84Bpp = 1.488Mpps 10Gb/s / 125MB/Gb / 84Bpp = 14.88Mpps
4
Identifying the Problem
5
Why Is It Difficult to Receive Packets at Line Rate?
Simple matter of timing (84B/p * 8b/B) / 1Gb/s = 672ns/p (84B/p * 8b/B) / 10Gb/s = 67.2ns/p (84B/p * 8b/B) / 100Gb/s = 6.72ns/p L2 cache latency on Ivy Bridge is about 12 cycles Each nanosecond a i7-4930K will process 3.4 cycles 12 cycles / 3.4 cycles/ns = 3.5ns
6
Where Do the Bottlenecks Lie?
Bottlenecks can be broken into 3 categories MMIO and Barriers readl/writel() rmb/wmb() Atomic operations get_page/put_page() spin_lock/spin_unlock() Large memory footprint and memory initialization fib_table_lookup() build_skb() memcpy()
7
Fixing the Problem xmit_more Coalesces MMIO write over multiple frames
Reduces transmit times by up to 100ns / frame Adds burst option to pktgen allowing line rate testing at 10Gb/s using a single CPU
8
Fixing the Problem dma_rmb/dma_wmb()
Drivers used rmb() to order reads of descriptor and data rmb() enforces reads between coherent/noncoherent rmb() maps to expensive primitives such as lfence Drivers needed to order coherent/coherent x86 already strong ordered so no barrier primitive needed Other architectures could use primitives such as lwsync Total savings of about 7ns per frame on i7-4930K
9
Fixing the Problem napi_alloc_skb()/napi_alloc_frag()
netdev_alloc_frag() used per-cpu variables and local_irq_save/restore to prevent contention NAPI context is per-cpu softirq so no other NAPI instances will be running on the same CPU while it is active By adding napi_alloc_frag we can drop local_irq_save and save 11ns per allocated frame.
10
Fixing the Problem fib_table_lookup() - Ongoing
Multiple trie structures representing forwarding tables Longest prefix match was O(N²) algorithm, was able to reduce to O(N) Worst case look-up time approaching 1us reduced to 140ns. Consumes up to 2 cache-lines per trie node, could be converted to 1 cache-line per node Could remove leaf_info objects and directly link leaf to FIB alias list.
11
What Is Left to Be Done Further work still needed for fib_table_lookup() Explore optimizing primitives for memset/memcpy() build_skb() - 20ns memcpy() - 9ns Work ongoing to batch allocate memory kmem_cache_alloc_array/kmem_cache_free_array() Explore batching in other areas of the receive path Consider aggregating frames from the same flow to coalesce costs such as fib_table_lookup()
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.