J. Xu et al, FAST ‘16 Presenter: Rohit Sarathy April 18, 2017

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

The google file system Cs 595 Lecture 9.
File Systems.
Allocation Methods - Contiguous
File Systems Examples.
Ext2/Ext3 Linux File System Reporter: Po-Liang, Wu.
Chapter 11: File System Implementation
Lecture 18 ffs and fsck. File-System Case Studies Local FFS: Fast File System LFS: Log-Structured File System Network NFS: Network File System AFS: Andrew.
The design and implementation of a log-structured file system The design and implementation of a log-structured file system M. Rosenblum and J.K. Ousterhout.
Crash recovery All-or-nothing atomicity & logging.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Ext3 Journaling File System “absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionality” chadd williams.
Crash recovery All-or-nothing atomicity & logging.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Google File System.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
Log-structured File System Sriram Govindan
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
File System Implementation
Advanced file systems: LFS and Soft Updates Ken Birman (based on slides by Ben Atkin)
CS333 Intro to Operating Systems Jonathan Walpole.
Lecture 21 LFS. VSFS FFS fsck journaling SBDISBDISBDI Group 1Group 2Group N…Journal.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems File systems.
Lecture 22 SSD. LFS review Good for …? Bad for …? How to write in LFS? How to read in LFS?
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Day 28 File System.
File System Consistency
Failure-Atomic Slotted Paging for Persistent Memory
Jonathan Walpole Computer Science Portland State University
Chapter 11: File System Implementation
FileSystems.
Google File System.
AN IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM FOR UNIX
The Design and Implementation of a Log-Structured File System
Scaling a file system to many cores using an operation log
Filesystems.
Google Filesystem Some slides taken from Alan Sussman.
Operating Systems ECE344 Lecture 11: SSD Ding Yuan
Lecture 45 Syed Mansoor Sarwar
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
EECE.4810/EECE.5730 Operating Systems
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Lecture 20 LFS.
Overview Continuation from Monday (File system implementation)
Printed on Monday, December 31, 2018 at 2:03 PM.
Overview: File system implementation (cont)
A Highly Non-Volatile Memory Scalable and Efficient File System
M. Rosenblum and J.K. Ousterhout The design and implementation of a log-structured file system Proceedings of the 13th ACM Symposium on Operating.
Chapter 14: File-System Implementation
by Mikael Bjerga & Arne Lange
File System Implementation
File System Performance
EECE.4810/EECE.5730 Operating Systems
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
CSE 542: Operating Systems
The Design and Implementation of a Log-Structured File System
Presentation transcript:

J. Xu et al, FAST ‘16 Presenter: Rohit Sarathy April 18, 2017 NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories J. Xu et al, FAST ‘16 Presenter: Rohit Sarathy April 18, 2017

NVMM Challenges Performance Write reordering Atomicity Intel’s NVMM new technologies promise to revolutionize I/O performance. Performance = tradeoffs between software and hardware latency Software costs can dominate memory latency Power failures are a challenge for write reodering Atomicity needed for posix compliance Disks and processor have small atomicity guarantees

NOVA’s contributions Extends existing log-structured file system techniques Atomic-mmap Adaptable across many NVMM systems

Why LFS? Log-structuring provides cheaper atomicity than journaling and shadow paging NVMM supports fast, highly concurrent random accesses https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_xu.pdf

Comparison with other NVMMs https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_xu.pdf

NOVA Design Maintain logs in NVMM and indexes in DRAM Each inode has its own log Utilize logging and lightweight journaling for complex atomic updates Implement the log as a singly linked list No logging of file data Images and header bullet points from https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_xu.pdf

NVMM Data Structures Divides the NVMM into 4 parts (superblock and recovery inode, inode tables, journals, log/data pages Figure 1: NOVA data structure layout

Inode Table 2 MB block array of inodes Aligned on 128-byte boundary If inode table is full, build a linked list of 2 MB subtables Inode contains valid bit, NOVA reuses invalid nodes for new files and directories Log is linked list of 4KB pages Each inode has pointers to head and tail of its corresponding log

Journal 4KB circular buffer Managed with an <enqueue, dequeue> ptr pair Only one open transaction allowed per core Kernel VFS locks all affected inodes to protect concurrent transactions

Space Management NOVA divides NVMM into pools (one per CPU) If no pages free, allocate from largest pool Use a red-black tree to maintain free list sorted by address Allocator state not stored in NVMM Allocates log space aggressively to avoid resizes If full, allocate new pages for double the log space

Atomicity and write ordering 64-bit atomic updates Logging Lightweight journaling Enforce write ordering Commit data and log entries to NVMM Commit journal entries to NVMM Commit new versions of data pages to NVMM new_tail = append_to_log(inode->tail, entry); // writes back the log entry cachelines clwb(inode->tail, entry->length); sfence(); // orders subsequent PCOMMIT PCOMMIT(); // commits entry to NVMM sfence(); // orders subsequent store inode->tail = new_tail; Pseudocode for enforcing write ordering. If NOVA runs on a system that supports clfushopt(), clwb() and PCOMMIT operations, it will use the above code.

Directory Operations Closely linked to application performance NOVA directories = NVMM inode log + DRAM radix tree Log has two entries: dentry and inode update entries Use dentry timestamp to atomically update directory inode’s mtime and ctime

Directory Operations Create Append dentry to directory log Update log tail (transaction) Update radix tree Delete (Linux) Decrement link count of file inode Remove file from directory Delete (NOVA) Append delete dentry log entry to directory Append inode update entry Use journaling to atomically update both log tails Propagate changes in DRAM radix tree

Atomic File Operations Inode log records metadata changes Radix tree in DRAM to locate file data by file offset Large writes are broken into multiple write requests, all appended in single update Reads: update access time with atomic write, and copy data to user buffer

Atomic File Operations 8 KB (2-page) write write copy of data to new pages Append file write entry Update log tail Update radix tree Return old version of data to allocator

Atomic mmap Use NVMM to allocate replica pages Copy file to replica pages, and map replicas to address space Copy-on-write for file data, reclaim stale pages immediately Higher overhead, but stronger consistency guarantee Atomic mmap in DRAM?

Garbage Collection Collect stale pages during write ops Log entry is dead if it is not the last entry in the log OR File write entry is dead if it does not refer to valid data pages An inode update is dead if modifies the same piece of metadata Dentry update is dead if marked invalid

Fast GC vs. Thorough GC Fast GC No copying required Reclaim invalid log pages by deleting them from linked list Thorough GC If live entries account for < 50% of log space, applied after fast GC Copy live entries into new, compacted version of log Update DRAM data structure pointer Replace old w/new, reclaim old log NOVA log cleaning: (a) Fast GC (b) Thorough GC

Shutdown and Recovery “Lazy rebuild”: postpone rebuilding radix tree and inode until first system access Clean dismount Store NVMM page allocator state in recovery inode log No inode logs scanned = fast recovery (50 GB in 1.2 ms)

Shutdown and Recovery Unclean dismount Rebuild NVMM allocator info by scanning inode logs Check each journal, rollback uncommitted transitions Recovery thread on each CPU, scan inode tables in parallel Build bitmap of occupied pages, use result to rebuild allocator

Experimental setup NVMM Read Latency Write bandwidth Clwb latency PCOMMIT latency STT-RAM 100 ns Full DRAM 40 ns 200 ns PCM 300 ns 1/8 DRAM 500 ns Workload Avg file size I/O size (r/w) Threads R/W ratio # of files Sm / lg Fileserver 128 KB 16 KB / 16 KB 50 1:2 100K / 400K Webproxy 32 KB 1 MB / 16 KB 5:1 100K / 1M Webserver 64 KB 1 MB / 8 KB 10:1 100K / 500K Varmail 1:1

Filesystem Latencies NOVA provides lowest latency for each operation NOVA outperforms other filesystems between 35% to 17x NOVA only accounts for 21-28% of total latency in create and append ops NOVA delete latency increases by 76%

Filebench Throughput Figure 7: Filebench throughput with different filesystem patterns and dataset sizes on STT-RAM and PCM.

Full Filesystem Perfomance Duration 10s 30s 120s 600s 3600s NILFS2 FAIL F2FS 37,979 23,193 18,240 NOVA 222,337 222,229 220,158 209,454 205,347 GC pages Fast 255 17,385 159,406 1,170,611 Thorough 102 2,120 9,633 27,292 72,727 30 GB fileserver workload under 95% NVMM utilization with different durations.

Recovery Dataset Filesize # of files Dataset size I/O size Videoserver 128 MB 400 50 GB 1 MB Fileserver 50,000 64 KB Mailserver 128 KB 400,000 16 KB Dataset Videoserver Fileserver Mailserver STTRAM-normal 156 µs 313 µs 918 µs PCM-normal 311 µs 660 µs 1197 µs STTRAM-failure 37 ms 39 ms 72 ms PCM-failure 43 ms 50 ms 116 ms

Conclusion An extension of LFS to leverage NVMM Fast GC, recovery from system failures Provides stronger consistency and atomicity guarantees

References J. Xu, S. Swanson. NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories. FAST ’16 https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_xu.pdf. ONLINE Accessed 2017-04-12. https://github.com/NVSL/NOVA. ONLINE Accessed 2017-04-12.