Scaling a file system to many cores using an operation log

Slides:



Advertisements
Similar presentations
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Advertisements

Phase Reconciliation for Contended In-Memory Transactions Neha Narula, Cody Cutler, Eddie Kohler, Robert Morris MIT CSAIL and Harvard 1.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
File Systems.
SYSTOR2010, Haifa Israel Optimization of LFS with Slack Space Recycling and Lazy Indirect Block Update Yongseok Oh The 3rd Annual Haifa Experimental Systems.
Tolerating File-System Mistakes with EnvyFS Lakshmi N. Bairavasundaram NetApp, Inc. Swaminathan Sundararaman Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau.
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
File Systems Examples.
Ext2/Ext3 Linux File System Reporter: Po-Liang, Wu.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
File System Implementation: beyond the user’s view A possible file system layout on a disk.
File Systems Topics –File –Directory –File System Implementation Reference: Chapter 5: File Systems Operating Systems Design and Implementation (Second.
Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.
Files. System Calls for File System Accessing files –Open, read, write, lseek, close Creating files –Create, mknod.
6/24/2015B.RamamurthyPage 1 File System B. Ramamurthy.
7/15/2015B.RamamurthyPage 1 File System B. Ramamurthy.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
File Systems. Main Points File layout Directory layout.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
An Evaluation of Using Deduplication in Swappers Weiyan Wang, Chen Zeng.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Advanced file systems: LFS and Soft Updates Ken Birman (based on slides by Ben Atkin)
Project 6 Unix File System. Administrative No Design Review – A design document instead 2-3 pages max No collaboration with peers – Piazza is for clarifications.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
CS450/550 FileSystems.1 Adapted from MOS2E UC. Colorado Springs CS450/550 Operating Systems Lecture 6 File Systems Palden Lama Department of Computer.
Why Do We Need Files? Must store large amounts of data. Information stored must survive the termination of the process using it - that is, be persistent.
CS333 Intro to Operating Systems Jonathan Walpole.
UNIX File System (UFS) Chapter Five.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Operating Systems Engineering Based on MIT (2012, lec9) Recitation 5: File Systems.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Chapter 6 File Systems. Essential requirements 1. Store very large amount of information 2. Must survive the termination of processes persistent 3. Concurrent.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Review CS File Systems - Partitions What is a hard disk partition?
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
Journaling versus Softupdates Asynchronous Meta-Data Protection in File System Authors - Margo Seltzer, Gregory Ganger et all Presenter – Abhishek Abhyankar.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
The need for File Systems Need to store data and programs in files Must be able to store lots of data Must be nonvolatile and survive crashes and power.
Introduction to threads
Hathi: Durable Transactions for Memory using Flash
Jonathan Walpole Computer Science Portland State University
J. Xu et al, FAST ‘16 Presenter: Rohit Sarathy April 18, 2017
FileSystems.
Isotope: Transactional Isolation for Block Storage
Filesystems 2 Adapted from slides of Hank Levy
File System B. Ramamurthy B.Ramamurthy 11/27/2018.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Btrfs Filesystem Chris Mason.
Preventing Performance Degradation on Operating System Reboots
Introduction to Operating Systems
AN ANALYSIS OF LINUX SCALABILITY TO MANY CORES
Lecture 15 Reading: Bacon 7.6, 7.7
A Highly Non-Volatile Memory Scalable and Efficient File System
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
CSE 542: Operating Systems
CSE 451 Fall 2003 Section 11/20/2003.
CSE 542: Operating Systems
Dynamic Performance Tuning of Word-Based Software Transactional Memory
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Scaling a file system to many cores using an operation log Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL

Motivation: Current file systems don’t scale well Linux ext4 (4.9.21) Benchmark: dbench [https://dbench.samba.org] Experimental setup: 80-cores, 256 GB RAM Backing store: “RAM” disk

Linux ext4 scales poorly on multicore machines

Concurrent file creation in Linux ext4 creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block

Block contention limits scalability of file creation creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block file1 : 100 file2 : 200 Contends on the directory block! Contention on blocks limits scalability on 80 cores Even apps not limited by disk I/O don’t scale

Goal : Multicore scalability Problem : Contention limits scalability Contention involves cache-line conflicts Goal : Multicore scalability = No cache-line conflicts Even a single contended cache-line can wreck scalability Commutative operations can be implemented without cache-line conflicts [Scalable Commutativity Rule, Clements SOSP ’13] How do we scale all commutative operations in file systems?

ScaleFS approach: Two separate file systems MEMORY DISK MemFS DiskFS Journal Block cache Directories (as hash-tables) Link Name Inode number fsync Designed for multicore scalability Designed for durability

Concurrent file creation scales in ScaleFS (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number

Concurrent file creation scales in ScaleFS (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 No contention No cache-line conflicts Scalability!

Challenge: How to implement fsync? MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200

Challenge: How to implement fsync? DiskFS updates must be consistent with MemFS fsync must preserve conflict-freedom for commutative ops fsync MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 dirA file1 : 100 file2 : 200

Contributions ScaleFS, a file system that achieves excellent multicore scalability Two separate file systems: MemFS and DiskFS Design for fsync: Per-core operation logs to scalably defer updates to DiskFS Ordering operations using Time Stamp Counters Evaluation : Benchmarks on ScaleFS scale 35x-60x on 80 cores Workload/Machine independent analysis for cache-conflicts Suggests ScaleFS a good fit for workloads not limited by disk I/O

Designed for multicore scalability Designed for durability ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs

Design challenges How to order operations in the per-core operation logs? How to operate MemFS and DiskFS independently: How to allocate inodes in a scalable manner in MemFS? . . .

Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100

Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK

Problem: Preserve ordering of non-commutative ops unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK

Problem: Preserve ordering of non-commutative ops unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK op2: CREATE

Problem: Preserve ordering of non-commutative ops unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK Order: How?? op2: CREATE

Solution: Use synchronized Time Stamp Counters [ RDTSCP does not incur cache-line conflicts ] DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100

Solution: Use synchronized Time Stamp Counters [ RDTSCP does not incur cache-line conflicts ] unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK, ts1 Order: ts1 < ts2 op2: CREATE, ts2

Problem: How to allocate inodes scalably in MemFS? creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 ??? Inode Allocator

Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 ??? Per-core Mnode Allocator Inode Allocator

Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 Per-core Mnode Allocator Inode Allocator

Solution (2) : Defer allocating inodes in DiskFS until an fsync MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 mnode inode table mnode # inode # 100 456 …. Per-core Mnode Allocator Inode Allocator

Other design challenges How to scale concurrent fsyncs? How to order lock-free reads? How to resolve dependencies affecting multiple inodes? How to ensure internal consistency despite crashes?

Implementation ScaleFS is implemented on the sv6 research operating system Supported filesystem system calls: creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close ScaleFS component Lines of C++ code MemFS (based on FS from sv6) 2,458 DiskFS (based on FS from xv6) 2,331 Operation Logs 4,094

Evaluation Does ScaleFS achieve good scalability? Measure scalability on 80 cores Observe conflict-freedom for commutative operations Does ScaleFS achieve good disk throughput? What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?

Evaluation methodology Machine configuration: 80-cores, with Intel E7-8870 2.4 GHz CPUs 256 GB RAM Backing store: “RAM” disk Benchmarks: mailbench: mail server workload dbench: file server workload largefile: Creates a file, writes 100 MB, fsyncs and deletes it smallfile: Creates, writes, fsyncs and deletes lots of 1KB files

ScaleFS scales 35x-60x on a RAM disk [ Single-core performance of ScaleFS is on par with Linux ext4. ]

Machine-independent methodology Use Commuter [Clements SOSP ’13] to observe conflict-freedom for commutative ops Commuter: Generates testcases for pairs of commutative ops Reports observed cache-conflicts

Conflict-freedom for commutative ops on Linux ext4 : 65% 138

Conflict-freedom for commutative ops on ScaleFS: 99.2%

Conflict-freedom for commutative ops on ScaleFS: 99.2% Why not 100% conflict-free? Tradeoff scalability for performance Probabilistic conflicts

Evaluation summary ScaleFS scales well on an 80 core machine Commuter reports 99.2% conflict-freedom on ScaleFS Workload/machine independent Suggests scalability beyond our experimental setup and benchmarks

Related Work Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] Scaling file systems using sharding: Hare [Eurosys ’15], SpanFS [USENIX ’15] ScaleFS uses similar techniques: Operation Logging: OpLog [CSAIL TR ’14] Per-inode / Per-core logs : NOVA [FAST ’16], iJournaling [USENIX ’17], Strata [SOSP ’17] Decoupling in-memory and on-disk representations: Linux dcache, ReconFS [FAST ’14] ScaleFS focus : Achieve scalability by avoiding cache-line conflicts

https://github.com/mit-pdos/scalefs Conclusion ScaleFS – a novel file system design for multicore scalability Two separate file systems : MemFS and DiskFS Per-core operation logs Ordering using Time Stamp Counters ScaleFS scales 35x-60x on an 80 core machine ScaleFS is conflict-free for 99.2% of testcases in Commuter https://github.com/mit-pdos/scalefs