ZUFS - Zero-copy User-mode FS

Slides:



Advertisements
Similar presentations
Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
Advertisements

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Slide 6-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 6 Implementing Processes, Threads, and Resources.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Linux Operating System
Slide 6-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 6.
Introduction. Why Study OS? Understand model of operation –Easier to see how to use the system –Enables you to write efficient code Learn to design an.
Slide 1-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 1.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Tanenbaum 8.3 See references
Slide 6-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 6.
Operating System Program 5 I/O System DMA Device Driver.
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
CASE STUDY 1: Linux and Android Tanenbaum & Bo, Modern Operating Systems:4th ed., (c) 2013 Prentice-Hall, Inc. All rights reserved.
OS2014 PROJECT 2 Supplemental Information. Outline Sequence Diagram of Project 2 Kernel Modules Kernel Sockets Work Queues Synchronization.
© 2004, D. J. Foreman 1 Implementing Processes and Threads.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
4P13 Week 12 Talking Points Device Drivers 1.Auto-configuration and initialization routines 2.Routines for servicing I/O requests (the top half)
Linux file systems Name: Peijun Li Student ID: Prof. Morteza Anvari.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
CMPS Operating Systems Prof. Scott Brandt Computer Science Department University of California, Santa Cruz.
Introduction to Operating Systems Concepts
Chapter 13: I/O Systems.
Chapter 3: Windows7 Part 5.
Virtualization.
Virtual Machine Monitors
Operating System & Application Software
Module 12: I/O Systems I/O hardware Application I/O Interface
Memory COMPUTER ARCHITECTURE
CS427 Multicore Architecture and Parallel Computing
Multi-processor Scheduling
Chapter 1: A Tour of Computer Systems
CASE STUDY 1: Linux and Android
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Implementing Processes and Threads
Chapter 4 Threads.
Operating Systems: A Modern Perspective, Chapter 6
Microsoft Build /12/2018 5:05 AM Using non-volatile memory (NVDIMM-N) as byte-addressable storage in Windows Server 2016 Tobias Klima Program Manager.
Threads and Locks.
Persistent Memory From Samples to Mainstream Adoption
Introduction to Operating Systems
OS Virtualization.
CSCI 315 Operating Systems Design
Operation System Program 4
Chapter 3: Windows7 Part 5.
Making Virtual Memory Real: The Linux-x86-64 way
CPSC 457 Operating Systems
Intro. To Operating Systems
I/O Systems I/O Hardware Application I/O Interface
Operating System Concepts
Mid Term review CSC345.
Thread Implementation Issues
Lecture Topics: 11/1 General Operating System Concepts Processes
Linux Architecture Overview.
Introduction to Operating Systems
LINUX System : Lecture 7 Lecture notes acknowledgement : The design of UNIX Operating System.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
SCONE: Secure Linux Containers Environments with Intel SGX
System Calls System calls are the user API to the OS
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Operating Systems Structure
Chapter 1: Introduction CSS503 Systems Programming
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

ZUFS - Zero-copy User-mode FS A new interface for a new breed of user-mode filesystems that require: - Extremely Low-Latency, - Synchronous & DAX, - NUMA-aware access Boaz Harrosh @ Linux Plumbers Sep. 2017 1 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In theory kernel APP ZUF APP zt zt zt zt ZU Thread per cpu ... APP ZUS Zu Feeder APP zt zt zt zt ZU Thread per cpu ... APP ZUS Zu Server Zufs-foo.so Zufs-bar.so Zufs-mem.so User space 2 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In Theory ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr) Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT) ZT-vma - Mmap 4M vma zero copy communication area per ZT IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation. On exit (or server crash) file is closed, Kernel cleans all resources 3 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In theory kernel ZUF APP P P P Zu Thread zt-vma App pages Mapped into Zu Feeder kernel APP P P P Zu Thread zt-vma App pages Mapped into Server VM Unmapped on return ZUS Zu Server User space 4 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

In Theory Async operation is also supported Server must not sleep in a ZT. All locks are trylocks. If failed to lock operation is queued and server returns EAGAIN. Server will later complete the operation ASYNC. App will be woken up. Do we need PAGE_CACHE support ? Also here write/read_pages() maps page-cache to zt-vma Application mmap is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM 5 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Raw Results FUSE Vs. ZUFS vs In Kernel In Kernel FS Threads Op/s Lat [us] 1 71,820 13.5 2 148,083 13.1 4 212,133 18.3 8 209,799 37.6 12 201,689 58.7 18 174,823 101.8 24 149,413 159.0 36 148,276 240.7 48 145,296 327.3 ZUFS 200,799 4.6 314,321 5.9 565,574 6.6 1,113,138 1,598,451 6.8 1,648,689 7.8 1,702,285 8.0 1,783,346 13.4 1,741,873 17.4 FUSE Vs. ZUFS vs In Kernel In Kernel FS Threads Op/s Lat (us) 1 388361 2.271589 2 635115 2.604376 4 1260307 2.626361 8 2744963 2.485292 12 2126945 5.020506 18 4350995 3.386433 24 4211180 4.784997 36 3057166 9.291997 48 3148972 10.382461 6 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Motivation for ZUFS (for near-memory speed PM media) Measured on Dual socket Intel XEON 2650v4 (48 HW Threads) DRAM-backed PM type Random 4KB DirectIO writ(ish) access 7 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Why is the mm patch required MMAP_LOCAL_CPU Own-core TLB invalidate Secure file system signing? 8 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Raw Results w/ and wo/ mm patch patched unpatched ZUFS penalty Threads Op/s Lat [us] 1 200,799 4.6 2 314,321 5.9 4 565,574 6.6 8 1,113,138 12 1,598,451 6.8 18 1,648,689 7.8 24 1,702,285 8.0 36 1,783,346 13.4 48 1,741,873 17.4 unpatched 185,391 4.9 197,993 9.6 310,597 12.1 546,702 13.8 641,728 17.2 744,750 22.2 790,805 28.3 849,763 38.9 792,000 44.6 w/ and wo/ mm patch 9 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Additional Design Considerations Single ZUS application server ZUFS filesystems are .so libraries loaded into ZUS. (pre configured or at run time) Regular mount command. New Super blocks created Devices are managed and owned by ZUF in Kernel Bind mount also works, the regular way. ZUS-API with fs-plugins very close to VFS API. Support for compiling zus-plugins as kernel modules also fed by ZUF? 10 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Thank you Please talk to me about ZUFS boazh@netapp.com 11 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

static int _zu_wait(struct file *file, void *parg) { struct zufs_thread *zt; int cpu = smp_processor_id(); int err; err = _zt_from_f(file, cpu, &zt); if (unlikely(err)) goto err; zt->fss_waiting = true; if (zt->app_waiting) { _unmap_pages(zt, zt->pages, zt->nump); zt->app_waiting = false; get_user(zt->next_opt.hdr.err, (int *)parg); _zu_wakeup_app(zt); } _zu_wait_fss(zt); zt->fss_waiting = false; /* call map here at the zuf thread so we need no locks */ if (zt->next_opt.operation && zt->next_opt.operation < ZUS_OP_BREAK) _map_pages(zt, zt->pages, zt->nump, false); err = copy_to_user(parg, &zt->next_opt, sizeof(zt->next_opt)); return err; err: put_user(err, (int *)parg); int zufs_dispatch(struct m1fs_sb_info *sbi, int operation, uint pgoffset, struct page **pages, uint nump, u64 filepos, uint len) if ((cpu < 0) || (sbi->_max_zts <= cpu)) return -ERANGE; zt = &sbi->_all_zt[cpu]; if (unlikely(!zt->file)) return -EIO; while (!zt->fss_waiting) { mb(); m1fs_err("[%d] can this be\n", cpu); msleep(100); zt->next_opt.operation = operation; zt->next_opt.offset = pgoffset; zt->next_opt.filepos = filepos; zt->next_opt.len = len; zt->pages = pages; zt->nump = nump; zt->app_waiting = true; _zu_wakeup_fss(zt); _zu_wait_app(zt); return zt->file ? zt->next_opt.hdr.err : -EIO; static void _zu_wakeup_fss(struct zufs_thread *zt) { zt->fss_wakeup = true; wake_up(&zt->fss_wq); } static void _zu_wakeup_app(struct zufs_thread *zt) zt->app_wakeup = true; wake_up(&zt->app_wq); static int _zu_wait_fss(struct zufs_thread *zt) zt->fss_wakeup = false; return wait_event_interruptible(zt->fss_wq, zt->fss_wakeup); static int _zu_wait_app(struct zufs_thread *zt) zt->app_wakeup = false; return wait_event_interruptible(zt->app_wq, zt->app_wakeup); 12 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---

Abstract FUSE enables user space file systems ever since kernel 2.6.14. It is a widely popular vehicle for rapid development and tens of file systems have used it to date. FUSE is asynchronous in nature and heavily relies on the operating system page cache. It was designed with hard drive latency in mind and was measured to add penalty of 12.5 to 1000 micro second [us], depending on the load. Emerging persistent memory technologies, such as NVDIMM-N and 3D XPoint / MRAM / ReRAM based NVDIMM, operate at near memory speed and require a different user space file system mechanism. One that is tuned to latency. The motivation of this work is to enable new bread of User-mode work, based on above Technologies that typically respond within a single micro second – faster than any caching, redundant data copying and queuing. ZUFS, pronounced Zoo-FS and stands for Zero-copy User-mode FS is a new kernel project designed to fill that gap. ZUFS is designed from the get go to provide an example of a full, PMEM based, FS, to demonstrate Speed, behavior and correctness. But the motivation is that not only full flagged Filesystems need apply. Any other User-Mode Service that wants to very effectively with modern direct mmapped access, service many applications and can enjoy a filesystem like interface. Can enjoy this new ABI from Kernel. The emphasis here is on multi-threaded, Low Latency, synchronous, Zero copy, direct mapped type access from the application to the Server-Application. And vise versa direct mapping of Server resources into the application. All this without sacrificing security or robustness. And how we think we can do this? Please see the attached paper, and come to our talk. But mainly this is more of an open question then a ready made proposal. 13 © 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---