TX bulking and qdisc layer

Slides:

Advertisements

Similar presentations

Device Layer and Device Drivers

Advertisements

The Journey of a Packet Through the Linux Network Stack

A Pipeline for Lockless Processing of Sound Data David Thall Insomniac Games.

What’s in a device driver?. Role of the OS Protect data and resources (file permissions, etc.) Provide isolation (virtual memory, etc.) Abstract away.

Xen and the Art of Virtualization A paper from the University of Cambridge, presented by Charlie Schluting For CS533 at Portland State University.

CS533 - Concepts of Operating Systems 1 Class Discussion.

G Robert Grimm New York University Receiver Livelock.

FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.

Brierley 1 Module 4 Module 4 Introduction to LAN Switching.

1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.

© 2004, The Technology Firm Ethertype 886 from the Intel website Probe Packets and Settings AFT and ALB teams use probe packets. Probes.

The Internet is Broken, and How to Fix It Jim Gettys Bell Labs July 27, 2012.

Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.

High Performance Network Virtualization with SR-IOV By Yaozu Dong et al. Published in HPCA 2010.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Planning and Implementing a Basic SOHO Network using Network Segmentation COMP 316.

CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.

AVS Brazos : IPv6. Agenda AVS IPv6 background Packet flows TSO/TCO Configuration Demo Troubleshooting tips Appendix.

An open source user space fast path TCP/IP stack and more…

Data Structures Intro2CS – week Stack ADT (Abstract Data Type) A container with 3 basic actions: – push(item) – pop() – is_empty() Semantics: –

Considerations for Benchmarking Virtual Networks Samuel Kommu, Jacob Rapp, Ben Basler,

Generic page-pool recycle facility?

Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer.

Virtual Networking Performance

Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer

GPUNFV: a GPU-Accelerated NFV System

Review Array Array Elements Accessing array elements

Network Performance Workshop Presentors: Organizer:

Zero-copy Receive Path in Virtio

Andreas Klappenecker [partially based on the slides of Prof. Welch]

XDP Infrastructure Development

XDP – eXpress Data Path Intro and future use-cases

XDP Infrastructure Development

Input/Output Devices ENCE 360

Linux Realtime Preemption and Its Impact on ULDD* - Progress Report -

Next steps for Linux Network stack approaching 100Gbit/s

Ideas for SLUB allocator

Concurrent Data Structures for Near-Memory Computing

Memory vs. Networking Provoking and fixing memory bottlenecks

Data Plane Acceleration

Lecture 25 More Synchronized Data and Producer/Consumer Relationship

Queues Queues Queues.

Towards 10Gb/s open-source routing

Multiprocessor Cache Coherency

Open Source 10g Talk at KTH/Kista

Report from Netconf 2009 Jesper Dangaard Brouer

TLS Receive Side Crypto Offload to NIC

Lecture 19 – TCP Performance

HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.

techX and ONL Summer 2008 Plans

Chapter 9: Virtual-Memory Management

Get the best out of VPP and inter-VM communications.

An NP-Based Router for the Open Network Lab Overview by JST

Xen Network I/O Performance Analysis and Opportunities for Improvement

Yiannis Nikolakopoulos

Offloading Linux LAG devices Via Open vSwitch and TC

Enabling TSO in OvS-DPDK

Fast Packet Processing In Linux with AF_XDP

Top Half / Bottom Half Processing

Stacks and Queues.

Using a Queue Chapter 8 introduces the queue data type.

Using a Queue Chapter 8 introduces the queue data type.

VIRTIO 1.1 FOR HARDWARE Rev2.0

Stacks, Queues, and Deques

Your Programmable NIC Should Be a Programmable Switch

COMP755 Advanced Operating Systems

LOOPS Generic Information Set draft-welzl-loops-gen-info-00

DPACC API Guidelines 2019/10/12.

Honnappa Nagarahalli Arm

Presentation transcript:

TX bulking and qdisc layer Session: Linux packet processing performance improvements TX bulking and qdisc layer Jesper Dangaard Brouer (Red Hat) John Fastabend (Intel) John Ronciak (Intel) Linux Plumbers Conference 16th Oct 2014

What will you learn? Unlocked full potential of driver (TX only) The xmit_more API for bulking Challenge bulking without adding latency Qdisc layer bulk dequeue, depend on BQL Existing aggregation GSO/GRO Qdisc locking is nasty Amortization locking cost Future: Lockless qdisc What about RX?

Unlocked: Driver TX potential Pktgen 14.8Mpps single core (10G wirespeed) Primary trick: Bulking packet (descriptors) to HW What is going on: Defer tailptr write, which notifies HW Very expensive write to none-cacheable mem Hard to perf profile Write to device does not showup at MMIO point Next LOCK op is likely “blamed”

API skb->xmit_more SKB extended with xmit_more indicator Stack use this to indicate another packet will be given immediately After/when ->ndo_start_xmit() returns Driver usage Unless TX queue filled Simply add the packet to the TX queue And defer the expensive indication to the HW

Challenge: Bulking without added latency Hard part: Use bulk API without adding latency Principal: Only bulk when really needed Based on solid indication from stack Do NOT speculative delay TX Don't bet on packets arriving shortly Hard to resist... as benchmarking would look good

Use SKB lists for bulking Changed: Stack xmit layer Adjusted to work with SKB lists Simply use existing skb->next ptr E.g. See dev_hard_start_xmit() skb->next ptr simply used as xmit_more indication Lock amortization TXQ lock no-longer per packet cost dev_hard_start_xmit() send entire SKB list while holding TXQ lock (HARD_TX_LOCK)

Existing aggregation in stack GRO/GSO Stack already have packet aggregation facilities GRO (Generic Receive Offload) GSO (Generic Segmentation Offload) TSO (TCP Segmentation Offload) Allowing bulking of these Introduce no added latency Xmit layer adjustments allowed this validate_xmit_skb() handles segmentation if needed

Qdisc layer bulk dequeue A queue in a qdisc Very solid opportunity for bulking Already delayed, easy to construct skb-list Rare case of reducing latency Decreasing cost of dequeue (locks) and HW TX Before: a per packet cost Now: cost amortized over packets Qdisc locking have extra locking cost Due to __QDISC___STATE_RUNNING state Only single CPU run in dequeue (per qdisc)

Qdisc locking is nasty Always 6 LOCK operations (6 * 8ns = 48ns) Lock qdisc(root_lock) (also for direct xmit case) Enqueue + possible Dequeue Enqueue can exit if other CPU is running deq Dequeue takes __QDISC___STATE_RUNNING Unlock qdisc(root_lock) Lock TXQ Xmit to HW Unlock TXQ Lock qdisc(root_lock) (can release STATE_RUNNING) Check for more/newly enqueued pkts Softirq reschedule (if quota or need_sched)

Qdisc bulking need BQL Only support qdisc bulking for BQL drivers Implement BQL in your driver now! Needed to avoid overshooting NIC capacity Overshooting cause requeue of packets Current qdisc layer requeue cause Head-of-Line blocking Future: better requeue in individual qdiscs? Extensive experiments show BQL is very good at limiting requeues

Future work (qdisc) Qdisc proper requeue facility Only implement for qdisc's that care BQL might reduce requeues enough Allow bulk for qdisc one-to-many TXQ's Current limited to flag TCQ_F_ONETXQUEUE Requires some fixes to requeue system Test on small OpenWRT routers CPU saving benefit might be larger

Future: Lockless qdisc Motivation for lockless qdisc (cmpxchg based) Direct xmit case (qdisc len==0) “fast-path” Still requires taking all 6 locks! Enqueue cost reduced (qdisc len > 0) from 16ns to 10ns Measurement show huge potential for saving (lockless ring queue cmpxchg base implementation) If TCQ_F_CAN_BYPASS saving 60ns Difficult to implement 100% correct Not allowing direct xmit case: saving 50ns

Qdisc RCU status Qdisc layer change Needed to support lockless qdisc All classifiers converted to RCU Bstats/qstats per CPU Do we want xmit stats per cpu?

Ingress qdisc Audit RCU paths one more time. Remove ingress qdisc lock

What about RX? TX looks good now How do we fix RX? Experiments show Highly tuned setup RX max 6.5Mpps Forward test, single CPU only 1-2Mpps Alexie started optimizing the RX path from 6.5 Mpps to 9.4 Mpps via build_skb() and skb prefetch tunning

The End Thanks Getting to this level of performance have been the jointed work and feedback from many people Download slides here: http://people.netfilter.org/hawk/presentations/ Discussion...