Data Plane Acceleration

Data Plane Acceleration
Receiving Packets at the Speed of Light Networking Services Team, Red Hat Alexander Duyck February 6th, 2015

Agenda Identifying the problem
Why is it difficult to receive packets at line rate? Where do the bottlenecks lie? Fixing the problem xmit_more dma_rmb/dma_wmb() napi_alloc_skb/napi_alloc_frag() fib_table_lookup() What is left to be done

Identifying the Problem
It is possible with netperf to receive 10Gb/s line rate Frame size 1514 w/ TSO on transmit GSO on receive Maximum frame rate w/ 1514 is 812,744pps With 60B frames line rate is much more difficult Only 24B of additional overhead per frame 1Gb/s / 125MB/Gb / 84Bpp = 1.488Mpps 10Gb/s / 125MB/Gb / 84Bpp = 14.88Mpps

Identifying the Problem

Why Is It Difficult to Receive Packets at Line Rate?
Simple matter of timing (84B/p * 8b/B) / 1Gb/s = 672ns/p (84B/p * 8b/B) / 10Gb/s = 67.2ns/p (84B/p * 8b/B) / 100Gb/s = 6.72ns/p L2 cache latency on Ivy Bridge is about 12 cycles Each nanosecond a i7-4930K will process 3.4 cycles 12 cycles / 3.4 cycles/ns = 3.5ns

Where Do the Bottlenecks Lie?
Bottlenecks can be broken into 3 categories MMIO and Barriers readl/writel() rmb/wmb() Atomic operations get_page/put_page() spin_lock/spin_unlock() Large memory footprint and memory initialization fib_table_lookup() build_skb() memcpy()

Fixing the Problem xmit_more Coalesces MMIO write over multiple frames
Reduces transmit times by up to 100ns / frame Adds burst option to pktgen allowing line rate testing at 10Gb/s using a single CPU

Fixing the Problem dma_rmb/dma_wmb()
Drivers used rmb() to order reads of descriptor and data rmb() enforces reads between coherent/noncoherent rmb() maps to expensive primitives such as lfence Drivers needed to order coherent/coherent x86 already strong ordered so no barrier primitive needed Other architectures could use primitives such as lwsync Total savings of about 7ns per frame on i7-4930K

Fixing the Problem napi_alloc_skb()/napi_alloc_frag()
netdev_alloc_frag() used per-cpu variables and local_irq_save/restore to prevent contention NAPI context is per-cpu softirq so no other NAPI instances will be running on the same CPU while it is active By adding napi_alloc_frag we can drop local_irq_save and save 11ns per allocated frame.

Fixing the Problem fib_table_lookup() - Ongoing
Multiple trie structures representing forwarding tables Longest prefix match was O(N²) algorithm, was able to reduce to O(N) Worst case look-up time approaching 1us reduced to 140ns. Consumes up to 2 cache-lines per trie node, could be converted to 1 cache-line per node Could remove leaf_info objects and directly link leaf to FIB alias list.

What Is Left to Be Done Further work still needed for fib_table_lookup() Explore optimizing primitives for memset/memcpy() build_skb() - 20ns memcpy() - 9ns Work ongoing to batch allocate memory kmem_cache_alloc_array/kmem_cache_free_array() Explore batching in other areas of the receive path Consider aggregating frames from the same flow to coalesce costs such as fib_table_lookup()

Data Plane Acceleration

Similar presentations

Presentation on theme: "Data Plane Acceleration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Plane Acceleration

Similar presentations

Presentation on theme: "Data Plane Acceleration"— Presentation transcript:

Similar presentations

About project

Feedback