Memory vs. Networking Provoking and fixing memory bottlenecks

Slides:

Advertisements

Similar presentations

Device Layer and Device Drivers

Advertisements

Chapter 4 Memory Management Basic memory management Swapping

The Linux Kernel: Memory Management

Memory Management in Linux (Chap. 8 in Understanding the Linux Kernel)

Memory management.

Operating Systems ECE344 Ding Yuan Final Review Lecture 13: Final Review.

Memory Management Chapter 4. Memory hierarchy Programmers want a lot of fast, non- volatile memory But, here is what we have:

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.

Memory Management From Chapter 4, Modern Operating Systems, Andrew S. Tanenbaum.

LWIP TCP/IP Stack 김백규.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

CS333 Intro to Operating Systems Jonathan Walpole.

12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

Processes and Virtual Memory

1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Generic page-pool recycle facility?

Multiqueue Networking David S. Miller Red Hat Inc.

Memory vs. Networking LSF/MM-summit March, 2017 Jesper Dangaard Brouer

8 July 2015 Charles Reiss

Jonathan Walpole Computer Science Portland State University

Network Performance Workshop Presentors: Organizer:

Zero-copy Receive Path in Virtio

Processor support devices Part 2: Caches and the MESI protocol

XDP – eXpress Data Path Intro and future use-cases

XDP Infrastructure Development

Lecture: Large Caches, Virtual Memory

Chapter 8: Main Memory.

Next steps for Linux Network stack approaching 100Gbit/s

LWIP TCP/IP Stack 김백규.

Ideas for SLUB allocator

CS703 - Advanced Operating Systems

Outline Paging Swapping and demand paging Virtual memory.

TX bulking and qdisc layer

Chapter 3: Process Concept

William Stallings Computer Organization and Architecture

Linux device driver development

Report from Netconf 2009 Jesper Dangaard Brouer

Short Circuiting Memory Traffic in Handheld Platforms

Lecture 10: Buffer Manager and File Organization

Chapter 8: Main Memory.

CMSC 341 Prof. Michael Neary

Optimizing Malloc and Free

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Memory Management 11/17/2018 A. Berrached:CS4315:UHD.

Evolution in Memory Management Techniques

Chapter 9: Virtual-Memory Management

Chapter 20 Network Layer: Internet Protocol

Xen Network I/O Performance Analysis and Opportunities for Improvement

CS399 New Beginnings Jonathan Walpole.

Introduction to Database Systems

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Virtual Memory Hardware

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Fast Packet Processing In Linux with AF_XDP

CSE451 Virtual Memory Paging Autumn 2002

Translation Buffers (TLB’s)

CSE 471 Autumn 1998 Virtual memory

CS703 - Advanced Operating Systems

Translation Buffers (TLBs)

Review What are the advantages/disadvantages of pages versus segments?

Paging Andrew Whitaker CSE451.

Presentation transcript:

Memory vs. Networking Provoking and fixing memory bottlenecks Jesper Dangaard Brouer Principal Engineer, Red Hat Date: October 2016 Venue: NetConf, Tokyo, Japan

Memory vs. Networking Network provoke bottlenecks in memory allocators Lots of work needed in MM-area Both in kmem_cache (SLAB/SLUB) allocator (bulk API almost done, more users please!) Page allocator Baseline performance too slow (see later graphs) Drivers: page recycle caches have limits Does not address all areas of problem space

MM: Status on kmem_cache bulk Discovered IP-forwarding: hitting slowpath in kmem_cache/SLUB allocator Solution: Bulk APIs for kmem_cache (SLAB+SLUB) Status: upstream since kernel 4.6 Netstack use bulk free of SKBs in NAPI-context Use bulking opportunity at DMA-TX completion 4-5% performance improvement for IP forwarding Generic kfree_bulk API Rejected: Netstack bulk alloc of SKBs As number of RX packets were unknown

MM: kmem_cache bulk, more use-cases Network stack – more use-cases Need explicit bulk free use from TCP stack NAPI bulk free, not active for TCP (keep ref too long) Use kfree_bulk() for skb→head (when allocated with kmalloc) Use bulk free API for qdisc delayed free RCU use-case Use kfree_bulk() API for delayed RCU free Other kernel subsystems?

SKB overhead sources Sources of overhead for SKBs (struct sk_buff) Memory alloc+free Addressed by kmem_cache bulk API Clearing SKB Need to clear 4 cache-lines! Read-only RX pages Cause more expensive construction the SKB

SKB allocations: with read-only pages Most drivers have read-only RX pages Cause more expensive SKB setup Alloc separate writable mem area memcpy over RX packet headers Store skb_shared_info in writable-area Setup pointers and offsets, into RX page-"frag" Reason: Performance trade off Page allocator is too slow DMA-API expensive on some platforms (with IOMMU) Hack: alloc and DMA map larger pages, and “chop-up” page Side-effect: read-only RX page-frames Due to unpredictable DMA unmap time

Benchmark: Page allocator (optimal case, 1 CPU, no congestion) Single page (order-0) too slow for 10Gbit/s budget Cycles cost increase with page order size But partitioning page into 4K fragments amortize cost Mel Gorman's kernel tree: http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git Tested branch mm-vmscan-node-lru-v5r9

Issues with: Higher order pages Performance workaround: Alloc larger order page, handout fragments Amortize alloc cost over several packets Troublesome 1. fast sometimes and other times require reclaim/compaction which can stall for prolonged periods of time. 2. clever attacker can pin-down memory Especially relevant for end-host TCP/IP use-case 3. does not scale as well, concurrent workloads

Concurrent CPUs scaling micro-benchmark Danger of higher order pages, with parallel workloads Order=0 pages scale well Order=3 pages scale badly, even counting per 4K Already lose advantage with 2 concurrent CPUs

RX-path: Make RX pages writable Need to make RX pages writable Why is page (considered) read-only? Due to DMA_unmap time Several page fragments (packets) in-flight Last fragment in RX ring queue, call dma_unmap() DMA engine unmap semantics allow overwriting memory (Not a problem on Intel) Simple solution: Use one-packet per page And call dma_unmap before using page My solution is the page_pool

Page pool: Generic solution, many advantages 5 features of a recycling page pool (per device): Faster than page-allocator speed As a specialized allocator require less checks DMA IOMMU mapping cost removed Keeping page mapped (credit to Alexei) Make page writable By predictable DMA unmap point OOM protection at device level Feedback-loop know #outstanding pages Zero-copy RX, solving memory early demux Depend on HW filters into RX queues

Page pool: Design Idea presented at MM-summit April 2016 Basic concept for the page_pool Pages are recycled back into originating pool Creates a feedback loop, helps limit pages in pool Drivers still need to handle dma_sync part Page-pool handle dma_map/unmap essentially: constructor and destructor calls Page free/return to page-pool, Either: SKB free knows and call page pool free, or put_page() handle via page flag

Page-pool: opportunity – feedback loop Today: Unbounded RX page allocations by drivers Can cause OOM (Out-of-Memory) situations Handled via skb->truesize and queue limits Page pool provides a feedback loop (Given pages are recycles back to originating pool) Allow bounding pages/memory allowed per RXq Simple solution: configure fixed memory limit Advanced solution, track steady-state Can function as a “Circuit Breaker” (See RFC draft link)

SKB clearing cost is high Options for addressing clearing cost: Smaller/diet SKB (currently 4 cache-lines) Too hard! Faster clearing Hand optimized clearing: only save 10 cycles Clear larger contiguous mem (during bulk alloc API) Delay clearing Don't clear on alloc (inside driver) Issue: knowing what fields driver updated Clear sections later, inside netstack RX Allow prefetchw to have effect

Topic: RX-MM-allocator – Alternative Prerequisite: When page is writable Idea: No SKB alloc calls during RX! Don't alloc SKB, Create it inside head or tail-room in data-page skb_shared_info, placed end-of data-page Issues / pitfalls: Clear SKB section likely expensive SKB truesize increase(?) Need full page per packet (ixgbe does page recycle trick)

The end kfree_bulk(16, slides);