Generic page-pool recycle facility?

Slides:



Advertisements
Similar presentations
Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
The Linux Kernel: Memory Management
Memory Management in Linux (Chap. 8 in Understanding the Linux Kernel)
May 7, A Real Problem  What if you wanted to run a program that needs more memory than you have?
1 A Real Problem  What if you wanted to run a program that needs more memory than you have?
Multiprocessing Memory Management
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Computer Organization and Architecture
Memory Management Five Requirements for Memory Management to satisfy: –Relocation Users generally don’t know where they will be placed in main memory May.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
CMPE 421 Parallel Computer Architecture
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
Operating Systems CMPSC 473 Virtual Memory Management (4) November – Lecture 22 Instructor: Bhuvan Urgaonkar.
Memory Management Fundamentals Virtual Memory. Outline Introduction Motivation for virtual memory Paging – general concepts –Principle of locality, demand.
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CS333 Intro to Operating Systems Jonathan Walpole.
Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Linux Kernel Development Memory Management Pavel Sorokin Gyeongsang National University
Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer
ODP-VPP Integration updates
8 July 2015 Charles Reiss
Virtual Memory So, how is this possible?
Non Contiguous Memory Allocation
Jonathan Walpole Computer Science Portland State University
Network Performance Workshop Presentors: Organizer:
XDP Infrastructure Development
Processor support devices Part 2: Caches and the MESI protocol
XDP – eXpress Data Path Intro and future use-cases
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
XDP Infrastructure Development
Next steps for Linux Network stack approaching 100Gbit/s
Ideas for SLUB allocator
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Outline Paging Swapping and demand paging Virtual memory.
TX bulking and qdisc layer
Memory vs. Networking Provoking and fixing memory bottlenecks
Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?
Chapter 9 – Real Memory Organization and Management
Semester Review Chris Gill CSE 422S - Operating Systems Organization
CS510 Operating System Foundations
Optimizing Malloc and Free
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012
Lecture 08: Memory Hierarchy Cache Performance
Chapter 20 Network Layer: Internet Protocol
Xen Network I/O Performance Analysis and Opportunities for Improvement
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Virtual Memory Hardware
Fast Packet Processing In Linux with AF_XDP
Top Half / Bottom Half Processing
CS 3410, Spring 2014 Computer Science Cornell University
CSE451 Virtual Memory Paging Autumn 2002
CSE 471 Autumn 1998 Virtual memory
Virtual Memory: Working Sets
COMP755 Advanced Operating Systems
Cache writes and examples
Paging Andrew Whitaker CSE451.
Presentation transcript:

Generic page-pool recycle facility? MM-summit 2016 Generic page-pool recycle facility? Jesper Dangaard Brouer Principal Engineer, Red Hat MM-summit 2016: April 18th-19th

Intro slide: Motivation for page recycling Bottlenecks: in both page allocator and DMA APIs Many driver specific workarounds and unfortunate side-effect of workarounds Motivation(1): primarily performance motivated Building “packet-page”/XDP level forward/drop facility Motivation(2): drivers are reinventing Cleanup open-coded driver approaches?! Motivation(3): other use-cases Like supporting zero-copy RX

Optimization principle behind page-pool idea Untapped optimization potential Recycling pages, instead of always returning to page allocator Opens up for a number of optimizations, in area shifting computation and setup time, to when enter/leaving pool

DMA bottleneck: mostly on PowerPC On arch's like PowerPC: DMA API is the bottleneck Driver work-around: amortize dma call cost alloc large order (compound) pages. dma_map compound page, handout page-fragments for RX ring, and later dma_unmap when last RX page-fragments is seen. Bad side-effect: DMA page considered 'read-only' Because dma_unmap call can be destructive NOP instruction on x86 Read-only side-effect: Cause netstack overhead: alloc new writable memory, copy-over IP-headers, and adjust offset pointer into RX-page

Idea to solve DMA mapping cost (credit Alexei) Keep these pages DMA mapped to device page-pool is recycling pages back to the originating device Avoid the need to call dma_unmap Only call dma_map() when setting up pages And DMA unmap when leaving pool This should solve both issues Removed cost of DMA map/unmap Can consider DMA pages writable (dma_sync determine when)

DMA trick: “Spelling it out” For DMA “keep-mapped-trick” to work Pages must be return to originating device To make “static” DMA map valid Without storing info in struct-page Troublesome to track originating device Needed at TX DMA completion time of another device (also track DMA unmap addr for PowerPC) Any meta-data to track originating device Cannot be free'ed until after TX DMA Could use page→private

Page allocator too slow On x86, DMA is NOT the bottleneck Besides the side-effect of read-only pages XDP (eXpress Data Path) performance target 14.8 Mpps, approx 201 cycles at 3GHz Single page order-0: cost 277 cycles alloc_pages() + __free_pages() (Mel's patchset reduced this to: 231 cycles)

Work around for slow page allocator Drivers use: same trick as DMA workaround Alloc larger order page: And handout fragments E.g. Page order-3 (32K): cost 503 cycles (Mel 397 cycles) Handout 4K blocks, cost per block: 62 cycles Problematic due do memory pin down “attacks” Google disable this driver feature See this as a bulking trick Instead implement a page bulk API?

Benchmark: Page allocator (optimal case, 1 CPU, no congestion) Cycles cost increase with page order size But partitioning page into 4K fragments amortize cost Mel Gorman's kernel tree: http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git Tested branch mm-vmscan-node-lru-v5r9

Issues with: Higher order pages Hidden bulking trick Alloc larger order page, handout fragments Troublesome 1. fast sometimes and other times require reclaim/compaction which can stall for prolonged periods of time. 2. clever attacker can pin-down memory Especially relevant for end-host TCP/IP use-case 3. does not scale as well, concurrent workloads

Concurrent CPUs scaling micro-benchmark Order=0 pages scale well Order=3 pages scale badly, even counting per 4K Already lose advantage with 2 concurrent CPUs

Page-pool cooperating Avoid keeping too many pages Steady state, RX=TX rate, no queue Only requires RX ring size + TX DMA outstanding Thus, restrict pool size can be small Overload/Burst state, RX > TX rate, cause queue “Good queue” behavior absorb bursts “Bad queue” (long standing queue) potential for OOM Today: handled at different levels, socket queue limit Potential for detecting “bad queue” at this level Allow page allocator to reclaim pool pages

Big question: How integrated with MM-layer Big “all-in” approach: Become allocator like slub: use struct page Minimum: page pointer back to page_pool And DMA unmap address Build as shell around page allocator How to keep track of “outstanding” pages? + track DMA unmap addr per page API users keep track of which pool to return to At TX completion time, return info needed Thus, meta-data is kept around too long (cache-cold) Might be a trick to avoid this, by sync on page refcnt

Novel recycle trick by Intel drivers Issue getting page recycled back into pool Without meta-data keeping track of return-pool Use page ref count To see if TX is done, when RX look at page Split pages in two halfs Keep pages in RX ring (tracking structure) On RX, if page refcnt is low (<=2), then reuse other half to refill RX ring (else normal alloc) In-effect recycle the page When one-time round ring is less than TX complet time Still, adds 2x atomic ops per packet

Other use-cases: RX zero-copy Currently: NIC RX zero-copy not allowed Could leak kernel memory information in page Know: Pages are recycled back into pool Clear memory on new page entering pool RX zero-copy safe, but could “leak” packet-data Early demux: HW filters can direct to specific RX-q Create page-pool per RX-queue Idea: alloc pages from virtual addr space (premapped) Need fairly closer integration with MM-layer (not compatible with Intel driver trick)

Other use-cases: Using huge pages for RX Make page-pool API hide page-boundaries Driver unaware of page order used Idea: huge page RX zero-copy Page-pool handout page-frags for RX ring Huge-page gets memory mapped into userspace Done to reduce TLB misses for userspace Zero-copy to userspace Netmap or DPDK could run on top Use NIC HW filter, create RX queue with this pool strategy Hardlimit on number huge pages

Concluding discussion!? Already active discussions on mailing list… Must fix DMA problem causing read-only pages Maybe just have “ugly” solution for x86? Leaning towards, something on top of page allocator Only focus on performance use-case Down prioritize RX zero-copy use-case? Use field in struct page, for pool return path Want a page bulk alloc API… please! For faster refill of page-pool Large order page trick is problematic (as described)