Scaling a file system to many cores using an operation log

Scaling a file system to many cores using an operation log
Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL

Motivation: Current file systems don’t scale well
Linux ext4 (4.9.21) Benchmark: dbench [ Experimental setup: 80-cores, 256 GB RAM Backing store: “RAM” disk

Linux ext4 scales poorly on multicore machines

Concurrent file creation in Linux ext4
creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block

Block contention limits scalability of file creation
creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block file1 : 100 file2 : 200 Contends on the directory block! Contention on blocks limits scalability on 80 cores Even apps not limited by disk I/O don’t scale

Goal : Multicore scalability
Problem : Contention limits scalability Contention involves cache-line conflicts Goal : Multicore scalability = No cache-line conflicts Even a single contended cache-line can wreck scalability Commutative operations can be implemented without cache-line conflicts [Scalable Commutativity Rule, Clements SOSP ’13] How do we scale all commutative operations in file systems?

ScaleFS approach: Two separate file systems
MEMORY DISK MemFS DiskFS Journal Block cache Directories (as hash-tables) Link Name Inode number fsync Designed for multicore scalability Designed for durability

Concurrent file creation scales in ScaleFS
(dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number

Concurrent file creation scales in ScaleFS
(dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 No contention No cache-line conflicts Scalability!

Challenge: How to implement fsync?
MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200

Challenge: How to implement fsync?
DiskFS updates must be consistent with MemFS fsync must preserve conflict-freedom for commutative ops fsync MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 dirA file1 : 100 file2 : 200

Contributions ScaleFS, a file system that achieves excellent multicore scalability Two separate file systems: MemFS and DiskFS Design for fsync: Per-core operation logs to scalably defer updates to DiskFS Ordering operations using Time Stamp Counters Evaluation : Benchmarks on ScaleFS scale 35x-60x on 80 cores Workload/Machine independent analysis for cache-conflicts Suggests ScaleFS a good fit for workloads not limited by disk I/O

Designed for multicore scalability Designed for durability
ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs

Design challenges How to order operations in the per-core operation logs? How to operate MemFS and DiskFS independently: How to allocate inodes in a scalable manner in MemFS? . . .

Problem: Preserve ordering of non-commutative ops
unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100

unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK

unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK

unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK op2: CREATE

unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK Order: How?? op2: CREATE

Solution: Use synchronized Time Stamp Counters
[ RDTSCP does not incur cache-line conflicts ] DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100

Solution: Use synchronized Time Stamp Counters
[ RDTSCP does not incur cache-line conflicts ] unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK, ts1 Order: ts1 < ts2 op2: CREATE, ts2

Problem: How to allocate inodes scalably in MemFS?
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 ??? Inode Allocator

Solution (1) : Separate mnodes in MemFS from inodes in DiskFS
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 ??? Per-core Mnode Allocator Inode Allocator

Solution (1) : Separate mnodes in MemFS from inodes in DiskFS
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 Per-core Mnode Allocator Inode Allocator

Solution (2) : Defer allocating inodes in DiskFS until an fsync
MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 mnode inode table mnode # inode # 100 456 …. Per-core Mnode Allocator Inode Allocator

Other design challenges
How to scale concurrent fsyncs? How to order lock-free reads? How to resolve dependencies affecting multiple inodes? How to ensure internal consistency despite crashes?

Implementation ScaleFS is implemented on the sv6 research operating system Supported filesystem system calls: creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close ScaleFS component Lines of C++ code MemFS (based on FS from sv6) 2,458 DiskFS (based on FS from xv6) 2,331 Operation Logs 4,094

Evaluation Does ScaleFS achieve good scalability?
Measure scalability on 80 cores Observe conflict-freedom for commutative operations Does ScaleFS achieve good disk throughput? What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?

Evaluation methodology
Machine configuration: 80-cores, with Intel E GHz CPUs 256 GB RAM Backing store: “RAM” disk Benchmarks: mailbench: mail server workload dbench: file server workload largefile: Creates a file, writes 100 MB, fsyncs and deletes it smallfile: Creates, writes, fsyncs and deletes lots of 1KB files

ScaleFS scales 35x-60x on a RAM disk
[ Single-core performance of ScaleFS is on par with Linux ext4. ]

Machine-independent methodology
Use Commuter [Clements SOSP ’13] to observe conflict-freedom for commutative ops Commuter: Generates testcases for pairs of commutative ops Reports observed cache-conflicts

Conflict-freedom for commutative ops on Linux ext4 : 65%
138

Conflict-freedom for commutative ops on ScaleFS: 99.2%

Conflict-freedom for commutative ops on ScaleFS: 99.2%
Why not 100% conflict-free? Tradeoff scalability for performance Probabilistic conflicts

Evaluation summary ScaleFS scales well on an 80 core machine
Commuter reports 99.2% conflict-freedom on ScaleFS Workload/machine independent Suggests scalability beyond our experimental setup and benchmarks

Related Work Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] Scaling file systems using sharding: Hare [Eurosys ’15], SpanFS [USENIX ’15] ScaleFS uses similar techniques: Operation Logging: OpLog [CSAIL TR ’14] Per-inode / Per-core logs : NOVA [FAST ’16], iJournaling [USENIX ’17], Strata [SOSP ’17] Decoupling in-memory and on-disk representations: Linux dcache, ReconFS [FAST ’14] ScaleFS focus : Achieve scalability by avoiding cache-line conflicts

https://github.com/mit-pdos/scalefs
Conclusion ScaleFS – a novel file system design for multicore scalability Two separate file systems : MemFS and DiskFS Per-core operation logs Ordering using Time Stamp Counters ScaleFS scales 35x-60x on an 80 core machine ScaleFS is conflict-free for 99.2% of testcases in Commuter

Scaling a file system to many cores using an operation log

Similar presentations

Presentation on theme: "Scaling a file system to many cores using an operation log"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling a file system to many cores using an operation log

Similar presentations

Presentation on theme: "Scaling a file system to many cores using an operation log"— Presentation transcript:

Similar presentations

About project

Feedback