NUMA scaling issues in 10GbE NetConf 2009 PJ Waskiewicz Intel Corp. LAN Access Division
NUMA balancing on 10GbE No affinity to a socket when a driver loads Insmod runs as a single thread, indeterminate of where static structures will be allocated from No linkage (currently) to aligning where driver buffers are allocated to where userspace apps run. Is this important?
Current issues observed in 10GbE Scaling multiple ports of 10GbE can cause NUMA memory bandwidth bottlenecks What happens in systems with a PCIe slot affinitized to a socket? How does the driver know, and allocate accordingly? Kernel currently references everything per-core (per_cpu lists, etc.). Moving to billions of cores, referencing things to node starts to make more sense
Tiny snapshot of balancing issue
Thoughts on which direction to go This problem isn't solved yet, and affects almost everyone. Becomes even worse approaching 40GbE 100GbE Have an API of sorts to “properly” allocate memory for drivers in a NUMA environment Move towards using a single queue (or small set) per NUMA node, instead of queue per CPU core. Inter- socket performance is much better than inter-node Is there any benefit in trying to drive NUMA affinitization into userspace (possibly through the recent Flow Steering from Tom Herbert?)