Kernel Bypass Sujay Jayakar (dsj36) 11/17/2016.

Slides:

Advertisements

Similar presentations

Bart Miller. Outline Definition and goals Paravirtualization System Architecture The Virtual Machine Interface Memory Management CPU Device I/O Network,

Advertisements

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Xen and the Art of Virtualization A paper from the University of Cambridge, presented by Charlie Schluting For CS533 at Portland State University.

CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Achieving 10 Gb/s Using Xen Para-virtualized.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

© 2010 IBM Corporation Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking Alex Landau, David Hadas, Muli Ben-Yehuda IBM Research –

Lecture 19: Virtual Memory

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Xen and The Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt & Andrew Warfield.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Introduction to virtualization

Operating Systems Security

Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.

Full and Para Virtualization

Presented by: Xianghan Pei

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Mohit Aron Peter Druschel Presenter: Christopher Head

CS161 – Design and Architecture of Computer

Virtualization.

Virtual Machine Monitors

Arrakis: The Operating System is the Control Plane

Introduction to Operating Systems

Xen and the Art of Virtualization

Non Contiguous Memory Allocation

Module 12: I/O Systems I/O hardware Application I/O Interface

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Jonathan Walpole Computer Science Portland State University

Processes and threads.

ECE232: Hardware Organization and Design

CS161 – Design and Architecture of Computer

CS 6560: Operating Systems Design

Memory Caches & TLB Virtual Memory

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Presented by Kristen Carlson Accardi

143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Chapter 4 Threads.

Section 10: Last section! Final review.

Xen: The Art of Virtualization

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

OS Virtualization.

Chapter 3: Windows7 Part 2.

Introduction to Operating Systems

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

OSDI ‘14 Best Paper Award Adam Belay George Prekas Ana Klimovic

Xen Network I/O Performance Analysis and Opportunities for Improvement

Chapter 3: Windows7 Part 2.

Operating System Concepts

Mid Term review CSC345.

Lecture Topics: 11/1 General Operating System Concepts Processes

Virtual Memory Overcoming main memory size limitation

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE451 Virtual Memory Paging Autumn 2002

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Paging and Segmentation

CS703 - Advanced Operating Systems

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:

Xen and the Art of Virtualization

CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors

Chapter 1: Introduction CSS503 Systems Programming

Interrupts and System Calls

Module 12: I/O Systems I/O hardwared Application I/O Interface

Presentation transcript:

Kernel Bypass Sujay Jayakar (dsj36) 11/17/2016

Kernel Bypass Background Why networking? Status quo: Linux Papers Arrakis: The Operating System is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, Timothy Roscoe. OSDI 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, and Christos Kozyrakis, Edouard Bugnion. OSDI 2014.

Bigger pipes: Intel XL710 chipset (40 Gbps) Why networking? we want networking to have higher throughput and lower latency. why networking? bandwidth AND latency important! bounded queue depth => latency affects bandwidth actually programmed against this card for diskotech 100 gbps coming! mellanox qsfp+ card, intel not ready for market yet IIRC Bigger pipes: Intel XL710 chipset (40 Gbps) http://www.intel.com/content/www/us/en/ethernet-products/converged-network-adapters/ethernet-xl710.html

Faster storage: Intel P3700 (20 Gbps, 100 µs reads) Why networking? http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme/3 apparently the intel P3608 can do 40 gbps!! also, it’s pretty common to have many (~100?) disks/node in storage tiers, (~100 mb/s sequential reads adds up pretty quickly!) Faster storage: Intel P3700 (20 Gbps, 100 µs reads)

Why networking? https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram current storage architecture in kernel assumes spinning disk (high latency, “single threaded,” scheduling important) IO scheduler/SCSI drivers/ata translation/disk controller replaced with NVMe

Software requirements Why networking? memcache, redis, etc., tiny packets (1mqps/node, keys < 50b, values < 500b) big flows (analytics, object storage), large bandwidth https://chris.lu/upload/images/redis.png http://idroot.net/wp-content/uploads/2015/08/Memcached-logo.jpg Software requirements

Status Quo: Linux NIC owned by kernel, kernel multiplexes userspace processes onto singleton network stack, uses SMP for sharing the network stack

= 3.36 µs (+2.23µs to 6.19 µs if off-core) Status Quo: Linux 0.34 µs from arrakis paper 1.26 µs: network stack 0.34 µs: 0.24 µs for packet copy + 0.10 µs for syscall return 0.54 µs: 0.44 µs for packet copy + 0.10 µs for syscall 1.05 µs: transmit network stack = 3.36 µs/packet (with additional 2 µs for context switch in if descheduled) previously, packet arrival times >> interrupt/syscall latency -> focus on fine-grained resource scheduling 0.54 µs 1.26 µs 1.05 µs = 3.36 µs (+2.23µs to 6.19 µs if off-core)

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 10 Gb/s = 1.25 GB/s = 800K to 15M packets/sec Service time is only 67 ns to 1.25 µs per packet! even in the best setting, we’ve way blown through our budget we understand copies & syscall boundaries, why is the network stack slow? http://netoptimizer.blogspot.com/2014/05/the-calculations-10gbits-wirespeed.html 10

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 40 Gb/s = 5 GB/s = 3.25M to 60M packets/sec Service time is only 17 ns to 307 ns per packet! even in the best setting, we’ve way blown through our budget we understand copies & syscall boundaries, why is the network stack slow? http://netoptimizer.blogspot.com/2014/05/the-calculations-10gbits-wirespeed.html 10

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 100 Gb/s = 12.5 GB/s = 8M to 150M packets/sec Service time is only 6.7 ns to 125 ns per packet! even in the best setting, we’ve way blown through our budget we understand copies & syscall boundaries, why is the network stack slow? http://netoptimizer.blogspot.com/2014/05/the-calculations-10gbits-wirespeed.html 10

Status Quo: Linux Network stack: demultiplexing, checks, scheduling synchronization needed for multi-core POSIX limitations Copy required for send(2) and recv(2) File descriptor migration among processes Context switch overhead can’t DMA for network API (filesystem okay though) hard to affinitize in general, sockets can migrate among processes upon receiving interrupt, scheduler has to context switch into waiting process. blocking API does not help here, since process will get descheduled often. IX makes interesting point here. the scheduler is very fine-grained in that it receives control whenever a syscall blocks, subsequently requiring an interrupt and context switch in. this design makes sense when the packet arrival rate is >> interrupt/syscall latency

Arrakis: The Operating System is the Control Plane Arrakis: The Operating System is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. OSDI 2014. won best paper! UW crew: Simon Peter: post doc Jialin Li, Irene Zhang, Doug Woos: PhD students Dan Ports, Arvind Krishnamurthy, Tom Anderson: professors at UW Timothy Roscoe: professor at ETH Zurich main takeaway: moving multiplexing into the NIC (using virtualization) is much more efficient than doing it in the kernel.

Network virtualization VNICs managed by “control plane” in kernel NIC responsible for bandwidth allocation & demultiplexing Intel 82599 chipset Only supports filtering by MAC address, want arbitrary header fields Weighted round-robin bandwidth allocation At most 64 VNICs Pretty weak part of the paper, if you ask me. SR-IOV: standard for making a single physical PCI express device look like multiple devices for a single root. VT-d (intel): implementation + IOMMU + interrupt virtualization

Storage virtualization VSIC: SCSI/ATA queue with rate limiter Virtual Storage Area (VSA): Persistent disk segment Each VSIC has many VSAs, and each VSA can be mapped into many VSICs (for interprocess sharing) Don’t have a device like this, emulated in software with a dedicated core is this really as bogus as I read it to be? they’re comparing their custom log data structure with ext4/btrfs, apples to ferraris. redis could pretty much do the exact same thing with direct IO if they wanted.

Arrakis differences driver and network stack now live in libos within userspace network stack no longer needs to demultiplex fewer security checks? less worried about data leaking? kernel is no longer on hot path

Arrakis: Control Plane VIC management: queue pair creation & deletion Doorbells: IPC endpoint for VIC notifications Hardware control Demultiplexing filters (just layer 2 for now) Rate specifiers for rate limiting

Arrakis: Doorbells SR-IOV virtualizes the MSI-x interrupt registers libOS driver handles interrupts in user space if on- core, otherwise goes through control plane API is just a regular POSIX file descriptor how do interrupts on a descheduled VM work? is redirection to the hypervisor part of the VMENTER/VMEXIT setup/teardown? ARP sharing among guests/hypervisor? maybe not? the more modern XL710 assigns a new MAC address to each VF, routing through “the virtual machine switch”

Arrakis: User APIs Arrakis/N Zero-copy rings Extaris POSIX sockets extaris: another planet in dune full posix support! they run memcache & redis unmodified pretty slight changes for arrakis/N (only 130 LOC for memcache) POSIX sockets send_packet(queue, array) receive_packet(queue)→ packet packet_done(packet) doorbells on completion Ethernet

IX: A Protected Dataplane Operating System for High Throughput and Low Latency. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. OSDI 2014. another paper from OSDI ’14! and another fucking planet from dune jesus also won best paper! Adam Belay, Ana Klimovic, Samuel Grossman: PhD student Stanford Christos Kozyrakis: Professor at Stanford George Prekas: PhD student EPFL Edouard Bugnion: Professor at EPFL takeaway: use CPU virtualization as well to get protection between user & network stack.

IX: More protection Network stack in user-space unacceptable Firewall, ACL management in data plane Sending raw packets requires root Does zero-copy send require memory protection? Approach: use CPU virtualization to get three-way protection between kernel, networking, and user

IX: Protection can be cheap FlexSC (OSDI ’10): Why are syscalls expensive? Direct effects: User-mode pipeline flush, mode switch, stashing registers. Only ~150 cycles! direct: they implemented a indirect: syscalls are slow (on x86_64) because they blow out the cache (backend stalls) and slow down subsequent user code. L2/L3 more affected since they prefetch, unlike L1. (units are 64 byte cache line evictions, 24kb/32kb of L1-d flushed, 200kb of L3 flushed)

IX: Protection can be cheap FlexSC (OSDI ’10): Why are syscalls expensive? Direct effects: User-mode pipeline flush, mode switch, stashing registers. Only ~150 cycles! Indirect effects: TLB flush (if no tagged TLB), cache pollution with kernel instructions/data direct: they implemented a indirect: syscalls are slow (on x86_64) because they blow out the cache (backend stalls) and slow down subsequent user code. L2/L3 more affected since they prefetch, unlike L1. (units are 64 byte cache line evictions, 24kb/32kb of L1-d flushed, 200kb of L3 flushed) IPC i-cache d-cache L2 L3 d-TLB pwrite(2) 0.18 50 373 985 3160 44

IX: Protection can be cheap Dune (OSDI ’12): Run Linux processes as VMX non-root ring0 with syscalls replaced with hypercalls Untrusted code can run in ring3, can cheaply mode switch back into ring0 Processes can then manipulate page tables, interrupts, hardware, privilege, … memory safety has a bad rap main idea: keep user-space and kernel-space similar. requires VPID (tagged TLB)

IX: More protection differences from arrakis libix is a lot simpler, just glue to expose posix-like async API all network stack code is in ring0 otherwise, same idea

Why not both? IX’s use of VMX non-root ring0 is clever. Its embedding in Linux (with Dune) is nice. But if they used SR-IOV, they wouldn’t have needed the Toeplitz hash hack and monopolizing a queue. And then they would have gotten (fast!) virtualized interrupts to not have to poll. could IX authors have used their fancy pants DUNE features to simulate a full blocking POSIX API? (fun fact: Cornell grad Greg Hill, working on PhD at stanford, added SR-IOV support to IX in 2015)

= 3.36 µs (+2.23µs to 6.19 µs if off-core) Arrakis: Evaluation 0.34 µs old linux packet processing time 0.54 µs 1.26 µs 1.05 µs = 3.36 µs (+2.23µs to 6.19 µs if off-core)

Arrakis: Evaluation 0.21 µs 0.17 µs = 0.38 µs packet service time 10x faster! all time is in network stack: 6x faster (no demultiplexing, better cache utilization, no synchronization, fewer checks?) no scheduler, interrupts delivered to the right core, no context switch no copy no kernel crossing 0.21 µs 0.17 µs = 0.38 µs

Arrakis: memcached modified memcached to use Arrakis/N, only 74 LOC! top of chart is 10gbps biased towards linux: TCP/UDP segmentation offload, header splitting

IX: Evaluation

IX: Give up on POSIX No to POSIX: Non-blocking API with batching Packets processed to completion “Elastic” vs. “background” threads No internal buffering Slow consumers lead to delayed ACKs say no to posix (and rewrite the network stack) run to completion: slow processing means slow ACKs and smaller sender window. adaptive batching: continue going until you block (no waiting for batches) or until you hit a max batch size. helps cache locality & amortize mode switch elastic threads: pinned to a CPU, handle network events background threads: timeshared among reserved CPU elastic/background threading not that strange: elastic is event loop, background threads are workers

IX: Packet processing