Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Chapter 11: File System Implementation
CS 153 Design of Operating Systems Spring 2015
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Operating System Support Focus on Architecture
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Introduction to Virtual Memory and Memory Management
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Device Driver Concepts Digital UNIX Internals II Device Driver Concepts Chapter 13.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
FILE SYSTEM IMPLEMENTATION 1. 2 File-System Structure File structure Logical storage unit Collection of related information File system resides on secondary.
Soft Timers : Efficient Microsecond Software Timer Support for Network Processing - Mohit Aron & Peter Druschel CS533 Winter 2007.
Generic page-pool recycle facility?
Status from Martin Josefsson Hardware level performance of Status from Martin Josefsson Hardware level performance of by Jesper Dangaard Brouer.
Chapter 13: I/O Systems.
Kernel Design & Implementation
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Network Performance Workshop Presentors: Organizer:
XDP Infrastructure Development
XDP – eXpress Data Path Intro and future use-cases
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Modifying libiptc: Making large iptables rulesets scale
XDP Infrastructure Development
Chapter 8: Main Memory.
Next steps for Linux Network stack approaching 100Gbit/s
Ideas for SLUB allocator
CS703 - Advanced Operating Systems
TX bulking and qdisc layer
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Memory vs. Networking Provoking and fixing memory bottlenecks
Report from Netconf 2009 Jesper Dangaard Brouer
Multi-PCIe socket network device
Boost Linux Performance with Enhancements from Oracle
Chapter 8: Main Memory.
CSCI 315 Operating Systems Design
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 08: Memory Hierarchy Cache Performance
An NP-Based Router for the Open Network Lab Overview by JST
PUSH Flag A notification from the sender to the receiver to pass all the data the receiver has to the receiving application. Some implementations of TCP.
Xen Network I/O Performance Analysis and Opportunities for Improvement
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Direct Memory Access Disk and Network transfers: awkward timing:
IT351: Mobile & Wireless Computing
Fast Packet Processing In Linux with AF_XDP
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Fast Userspace OVS with AF_XDP
CSE 471 Autumn 1998 Virtual memory
CS703 - Advanced Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
Chapter 13: I/O Systems.
Lecture 10: Directory-Based Examples II
Paging Andrew Whitaker CSE451.
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer Provoking and fixing memory bottlenecks Focused on the page allocator Jesper Dangaard Brouer Principal Engineer, Red Hat LSF/MM-summit March, 2017

Overview Mind-bugling speeds New technology: XDP (eXpress Data Path) Limited by page allocator speed Page recycling Every driver reinvent page recycling Generalizing page recycling Micro benchmarking page allocator Huge improvements by Mel Gorman Strange overhead by NUMA

Mind-bugling speeds 10Gbit/s wirespeed smallest packet size 14.88 Mpps (Million packets per second) 67.2ns between packets, at 3GHz -> 201 cycles 100G with 1514 bytes pkt size → 8.1Mpps (123ns) Trick: Bulking/batching: amortize per packet cost Easy to think: bulk 10, 10x cycle budget 2010 cycles. Not as easy: bulking APIs does not give linear scaling E.g. SLUB bulking APIs gave 60% speedup. Kernel bypass solutions show it is possible! Via bulking and own memory allocators

XDP - eXpress Data Path NetDev-guys XDP technology (approx kernel v4.9) Shows these speeds are possible, as long as: 1. we avoid talking to MM-layer 2. page kept DMA mapped 3. stay on same CPU Single CPU performance (mlx5 50Gbit/s) XDP_DROP: 17Mpps XDP_TX: 10Mpps (TX out same interface) Use-cases: DDoS and Load-Balancer (Facebook)

XDP “real” forwarding missing XDP packet forward between devices Not implemented yet, due to performance concerns and missing RX bulking Cannot avoid interacting with MM-layer Local device driver specific recycle trick not sufficient Imagined forwarding to another device No-SKB, “raw” page is transferred, offset+length Sits on remote device TX queue Until DMA TX completion: Now page need free’ed or “returned”

Driver page recycling All high-speed NIC drivers do page recycling Two reasons: 1. page allocator is too slow 2. Avoiding DMA mapping cost Different variations per driver Want to generalize this Every driver developer is reinventing a page recycle mechanism

Need “handle” to page after leaving the driver Some drivers do opportunistic recycling today Bump page refcnt, and keep pages in a queue (Intel drivers use RX ring itself for this queue) On alloc, check if queue-head page have refcnt ==1 If so, reuse this Else, remove from queue and call put_page() Thus, last caller of put_page() will free it for real Issue: call DMA-unmap on pkts in-flight This looks good in benchmarks, but will it work for: Real use-cases: many sockets that need some queue TCP sockets: keep packets for retransmit until ACKed

Generalize problem statement Drivers receive DMA mapped pages Want to keep page DMA mapped (for perf reasons) Thinks page allocator too slow What can MM layer do address this use-case? Faster PCP (Per CPU Page) cache But can this ever compete with driver local recycling XDP_DROP return page into array (no-locks) (protected by NAPI/softirq running) Could MM provide API for per device page allocator (limited size) cache that keeps pages DMA mapped for this device

Even more generic What I'm basically asking for: destructor callback Upon page reach refcnt == 0 (+separate call to allow refcnt==1 when safe) Call a device specific destructor callback This call is allowed to steal the page callback gets page + data (in this case DMA address) Road blocks Need page-flag Room to store Data (DMA-addr) + Callback (can be table lookup id) (looking at page->compound_dtor infrastructure)

Microbenchmarking time Quote Sir Kelvin: "If you cannot measure it, you cannot improve it" What is the performance of the page allocator? How far are we from Jesper’s target? Do be aware: This is "zoom-in" microbenchmarking, designed to isolate and measure a specific code path, magnifying bottlenecks. Don’t get scared: Provoked lock congestion’s should not occur for real workloads

History: Benchmark page order-0 fast-path Micro-benchmark order-0 fast-path Test PCP (Per CPU Pages) lists Simply recycle same page Only show optimal perf achievable Graph show perf improvement history All credit goes to Mel Gorman Approx 48% improvement!!! :-) 276 → 143 cycles Benchmark ran on CPU i7-4790K @ 4.00GHz Kernel and git commit for last test: 4.11.0-rc1-vanilla-00088-gec3b93ae0bf4

Cost when page order increase (Kernel 4.11-rc1) Redline no surprises here Expected: Cost goes up with order Good curve until order-3 Yellow line Amortize cost per 4K Trick used by some drivers Want to avoid this trick: Attacker pin down memory Bad for concurrent workload Reclaim/compaction stalls Kernel benchmarked: 4.11.0-rc1-vanilla-00088-gec3b93ae0bf4 SMP PREEMPT CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y CONFIG_NUMA=y CONFIG_X86_64_ACPI_NUMA=y # CONFIG_NUMA_EMU is not set CONFIG_USE_PERCPU_NUMA_NODE_ID=y CONFIG_ACPI_NUMA=y

Pressure PCP lists with many out-standing pages Test alloc N order-0 pages before freeing them Expected results: Clearly shows size of PCP cache is 128 Fairly good at amotize cost (Does 32 bulking internally)

Moving pages cross CPU Networking often RX on CPU-A but process and free CPU-B Bench isolate page moving cross CPU Baseline: ptr_ring (31 cycles per CPU) Best possible perf with cross CPU queue PCP cross CPU(order-0) alloc_pages+put_page Cost on both CPUs increased to 210 cycles Cross CPU issue? Or outstanding pages issue? (210-31)*2=358 too high Hot lock: zone→lock (even-though PCP have 32 bulking) Single CPU cost 145 cycles (no outstanding) Delay 1 page + prefetchw before put_page() CPU-B cost 152 reduced with 58 cycles Helps CPU cache coherency protocol

Disable zone_statistics (via No-NUMA) Micro-benchmark order-0 fast-path Test PCP (Per CPU Pages) lists Simply recycle same page Only show optimal perf achievable Disable CONFIG_NUMA and zone_statistics (Kernel 4.11-rc1) 143 cycles normal with NUMA 97 cycles disabled NUMA Extra 46 cycles cost (32%) Looks like most originate from call zone_statistics() Benchmark ran on CPU i7-4790K @ 4.00GHz Kernel and git commit for last test: 4.11.0-rc1-vanilla-00088-gec3b93ae0bf4

End slide Use-case: XDP redirect out another device Need super fast return/free of pages Can this be integrated into page allocator Or keep doing driver opportunistic recycle hacks?

EXTRA SLIDES

Compare vs. No-NUMA out-standing pages test Test alloc N order-0 pages before freeing them Expected results: With or without NUMA Still follow curve Offset is almost constant

Page allocator bulk API benchmarks Took over some bulk patches from Mel Rebasing patches to latest kernel Ran into issue… no results to present :-( Fails in zone_watermark_fast check Did write page_bench04_bulk