Memory vs. Networking Provoking and fixing memory bottlenecks

Slides:



Advertisements
Similar presentations
Device Layer and Device Drivers
Advertisements

Chapter 4 Memory Management Basic memory management Swapping
The Linux Kernel: Memory Management
Memory Management in Linux (Chap. 8 in Understanding the Linux Kernel)
Memory management.
Operating Systems ECE344 Ding Yuan Final Review Lecture 13: Final Review.
Memory Management Chapter 4. Memory hierarchy Programmers want a lot of fast, non- volatile memory But, here is what we have:
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.
Memory Management From Chapter 4, Modern Operating Systems, Andrew S. Tanenbaum.
LWIP TCP/IP Stack 김백규.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CS333 Intro to Operating Systems Jonathan Walpole.
12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Processes and Virtual Memory
1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Generic page-pool recycle facility?
Multiqueue Networking David S. Miller Red Hat Inc.
Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer
8 July 2015 Charles Reiss
Jonathan Walpole Computer Science Portland State University
Network Performance Workshop Presentors: Organizer:
Zero-copy Receive Path in Virtio
Processor support devices Part 2: Caches and the MESI protocol
XDP – eXpress Data Path Intro and future use-cases
XDP Infrastructure Development
Lecture: Large Caches, Virtual Memory
Chapter 8: Main Memory.
Next steps for Linux Network stack approaching 100Gbit/s
LWIP TCP/IP Stack 김백규.
Ideas for SLUB allocator
FileSystems.
CS703 - Advanced Operating Systems
Outline Paging Swapping and demand paging Virtual memory.
TX bulking and qdisc layer
Chapter 3: Process Concept
William Stallings Computer Organization and Architecture
Linux device driver development
Report from Netconf 2009 Jesper Dangaard Brouer
Short Circuiting Memory Traffic in Handheld Platforms
Lecture 10: Buffer Manager and File Organization
Chapter 8: Main Memory.
CMSC 341 Prof. Michael Neary
Optimizing Malloc and Free
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management 11/17/2018 A. Berrached:CS4315:UHD.
Evolution in Memory Management Techniques
Chapter 9: Virtual-Memory Management
Chapter 20 Network Layer: Internet Protocol
Xen Network I/O Performance Analysis and Opportunities for Improvement
CS399 New Beginnings Jonathan Walpole.
Introduction to Database Systems
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Fast Packet Processing In Linux with AF_XDP
CSE451 Virtual Memory Paging Autumn 2002
Translation Buffers (TLB’s)
CSE 471 Autumn 1998 Virtual memory
CS703 - Advanced Operating Systems
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
Paging Andrew Whitaker CSE451.
Presentation transcript:

Memory vs. Networking Provoking and fixing memory bottlenecks Jesper Dangaard Brouer Principal Engineer, Red Hat Date: October 2016 Venue: NetConf, Tokyo, Japan

Memory vs. Networking Network provoke bottlenecks in memory allocators Lots of work needed in MM-area Both in kmem_cache (SLAB/SLUB) allocator (bulk API almost done, more users please!) Page allocator Baseline performance too slow (see later graphs) Drivers: page recycle caches have limits Does not address all areas of problem space

MM: Status on kmem_cache bulk Discovered IP-forwarding: hitting slowpath in kmem_cache/SLUB allocator Solution: Bulk APIs for kmem_cache (SLAB+SLUB) Status: upstream since kernel 4.6 Netstack use bulk free of SKBs in NAPI-context Use bulking opportunity at DMA-TX completion 4-5% performance improvement for IP forwarding Generic kfree_bulk API Rejected: Netstack bulk alloc of SKBs As number of RX packets were unknown

MM: kmem_cache bulk, more use-cases Network stack – more use-cases Need explicit bulk free use from TCP stack NAPI bulk free, not active for TCP (keep ref too long) Use kfree_bulk() for skb→head (when allocated with kmalloc) Use bulk free API for qdisc delayed free RCU use-case Use kfree_bulk() API for delayed RCU free Other kernel subsystems?

SKB overhead sources Sources of overhead for SKBs (struct sk_buff) Memory alloc+free Addressed by kmem_cache bulk API Clearing SKB Need to clear 4 cache-lines! Read-only RX pages Cause more expensive construction the SKB

SKB allocations: with read-only pages Most drivers have read-only RX pages Cause more expensive SKB setup Alloc separate writable mem area memcpy over RX packet headers Store skb_shared_info in writable-area Setup pointers and offsets, into RX page-"frag" Reason: Performance trade off Page allocator is too slow DMA-API expensive on some platforms (with IOMMU) Hack: alloc and DMA map larger pages, and “chop-up” page Side-effect: read-only RX page-frames Due to unpredictable DMA unmap time

Benchmark: Page allocator (optimal case, 1 CPU, no congestion) Single page (order-0) too slow for 10Gbit/s budget Cycles cost increase with page order size But partitioning page into 4K fragments amortize cost Mel Gorman's kernel tree: http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git Tested branch mm-vmscan-node-lru-v5r9

Issues with: Higher order pages Performance workaround: Alloc larger order page, handout fragments Amortize alloc cost over several packets Troublesome 1. fast sometimes and other times require reclaim/compaction which can stall for prolonged periods of time. 2. clever attacker can pin-down memory Especially relevant for end-host TCP/IP use-case 3. does not scale as well, concurrent workloads

Concurrent CPUs scaling micro-benchmark Danger of higher order pages, with parallel workloads Order=0 pages scale well Order=3 pages scale badly, even counting per 4K Already lose advantage with 2 concurrent CPUs

RX-path: Make RX pages writable Need to make RX pages writable Why is page (considered) read-only? Due to DMA_unmap time Several page fragments (packets) in-flight Last fragment in RX ring queue, call dma_unmap() DMA engine unmap semantics allow overwriting memory (Not a problem on Intel) Simple solution: Use one-packet per page And call dma_unmap before using page My solution is the page_pool

Page pool: Generic solution, many advantages 5 features of a recycling page pool (per device): Faster than page-allocator speed As a specialized allocator require less checks DMA IOMMU mapping cost removed Keeping page mapped (credit to Alexei) Make page writable By predictable DMA unmap point OOM protection at device level Feedback-loop know #outstanding pages Zero-copy RX, solving memory early demux Depend on HW filters into RX queues

Page pool: Design Idea presented at MM-summit April 2016 Basic concept for the page_pool Pages are recycled back into originating pool Creates a feedback loop, helps limit pages in pool Drivers still need to handle dma_sync part Page-pool handle dma_map/unmap essentially: constructor and destructor calls Page free/return to page-pool, Either: SKB free knows and call page pool free, or put_page() handle via page flag

Page-pool: opportunity – feedback loop Today: Unbounded RX page allocations by drivers Can cause OOM (Out-of-Memory) situations Handled via skb->truesize and queue limits Page pool provides a feedback loop (Given pages are recycles back to originating pool) Allow bounding pages/memory allowed per RXq Simple solution: configure fixed memory limit Advanced solution, track steady-state Can function as a “Circuit Breaker” (See RFC draft link)

SKB clearing cost is high Options for addressing clearing cost: Smaller/diet SKB (currently 4 cache-lines) Too hard! Faster clearing Hand optimized clearing: only save 10 cycles Clear larger contiguous mem (during bulk alloc API) Delay clearing Don't clear on alloc (inside driver) Issue: knowing what fields driver updated Clear sections later, inside netstack RX Allow prefetchw to have effect

Topic: RX-MM-allocator – Alternative Prerequisite: When page is writable Idea: No SKB alloc calls during RX! Don't alloc SKB, Create it inside head or tail-room in data-page skb_shared_info, placed end-of data-page Issues / pitfalls: Clear SKB section likely expensive SKB truesize increase(?) Need full page per packet (ixgbe does page recycle trick)

The end kfree_bulk(16, slides);