Download presentation
Presentation is loading. Please wait.
Published byWendy Stevenson Modified over 6 years ago
1
Scaling a file system to many cores using an operation log
Srivatsa S. Bhat, Rasha Eqbal, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL
2
Motivation: Current file systems don’t scale well
Linux ext4 (4.9.21) Benchmark: dbench [ Experimental setup: 80-cores, 256 GB RAM Backing store: “RAM” disk
3
Linux ext4 scales poorly on multicore machines
4
Concurrent file creation in Linux ext4
creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block
5
Block contention limits scalability of file creation
creat(dirA/file1) creat(dirA/file2) CORE 1 CORE 2 DISK MEMORY ext4 Journal dirA’s block file1 : 100 file2 : 200 Contends on the directory block! Contention on blocks limits scalability on 80 cores Even apps not limited by disk I/O don’t scale
6
Goal : Multicore scalability
Problem : Contention limits scalability Contention involves cache-line conflicts Goal : Multicore scalability = No cache-line conflicts Even a single contended cache-line can wreck scalability Commutative operations can be implemented without cache-line conflicts [Scalable Commutativity Rule, Clements SOSP ’13] How do we scale all commutative operations in file systems?
7
ScaleFS approach: Two separate file systems
MEMORY DISK MemFS DiskFS Journal Block cache Directories (as hash-tables) Link Name Inode number fsync Designed for multicore scalability Designed for durability
8
Concurrent file creation scales in ScaleFS
(dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number
9
Concurrent file creation scales in ScaleFS
(dirA/file1) creat (dirA/file2) CORE 1 CORE 2 MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 No contention No cache-line conflicts Scalability!
10
Challenge: How to implement fsync?
MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200
11
Challenge: How to implement fsync?
DiskFS updates must be consistent with MemFS fsync must preserve conflict-freedom for commutative ops fsync MEMORY DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 100 file2 200 dirA file1 : 100 file2 : 200
12
Contributions ScaleFS, a file system that achieves excellent multicore scalability Two separate file systems: MemFS and DiskFS Design for fsync: Per-core operation logs to scalably defer updates to DiskFS Ordering operations using Time Stamp Counters Evaluation : Benchmarks on ScaleFS scale 35x-60x on 80 cores Workload/Machine independent analysis for cache-conflicts Suggests ScaleFS a good fit for workloads not limited by disk I/O
13
Designed for multicore scalability Designed for durability
ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs
14
Design challenges How to order operations in the per-core operation logs? How to operate MemFS and DiskFS independently: How to allocate inodes in a scalable manner in MemFS? . . .
15
Problem: Preserve ordering of non-commutative ops
unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100
16
Problem: Preserve ordering of non-commutative ops
unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK
17
Problem: Preserve ordering of non-commutative ops
unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 op1: UNLINK
18
Problem: Preserve ordering of non-commutative ops
unlink (file1) creat (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK op2: CREATE
19
Problem: Preserve ordering of non-commutative ops
unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK Order: How?? op2: CREATE
20
Solution: Use synchronized Time Stamp Counters
[ RDTSCP does not incur cache-line conflicts ] DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100
21
Solution: Use synchronized Time Stamp Counters
[ RDTSCP does not incur cache-line conflicts ] unlink (file1) creat (file1) fsync CORE 1 CORE 2 CORE 3 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Link Name Inode Number file1 100 200 op1: UNLINK, ts1 Order: ts1 < ts2 op2: CREATE, ts2
22
Problem: How to allocate inodes scalably in MemFS?
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Inode Number file1 ??? Inode Allocator
23
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 ??? Per-core Mnode Allocator Inode Allocator
24
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS
creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 Per-core Mnode Allocator Inode Allocator
25
Solution (2) : Defer allocating inodes in DiskFS until an fsync
MemFS DiskFS Journal dirA Block cache Link Name Mnode Number file1 100 mnode inode table mnode # inode # 100 456 …. Per-core Mnode Allocator Inode Allocator
26
Other design challenges
How to scale concurrent fsyncs? How to order lock-free reads? How to resolve dependencies affecting multiple inodes? How to ensure internal consistency despite crashes?
27
Implementation ScaleFS is implemented on the sv6 research operating system Supported filesystem system calls: creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close ScaleFS component Lines of C++ code MemFS (based on FS from sv6) 2,458 DiskFS (based on FS from xv6) 2,331 Operation Logs 4,094
28
Evaluation Does ScaleFS achieve good scalability?
Measure scalability on 80 cores Observe conflict-freedom for commutative operations Does ScaleFS achieve good disk throughput? What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?
29
Evaluation methodology
Machine configuration: 80-cores, with Intel E GHz CPUs 256 GB RAM Backing store: “RAM” disk Benchmarks: mailbench: mail server workload dbench: file server workload largefile: Creates a file, writes 100 MB, fsyncs and deletes it smallfile: Creates, writes, fsyncs and deletes lots of 1KB files
30
ScaleFS scales 35x-60x on a RAM disk
[ Single-core performance of ScaleFS is on par with Linux ext4. ]
31
Machine-independent methodology
Use Commuter [Clements SOSP ’13] to observe conflict-freedom for commutative ops Commuter: Generates testcases for pairs of commutative ops Reports observed cache-conflicts
32
Conflict-freedom for commutative ops on Linux ext4 : 65%
138
33
Conflict-freedom for commutative ops on ScaleFS: 99.2%
34
Conflict-freedom for commutative ops on ScaleFS: 99.2%
Why not 100% conflict-free? Tradeoff scalability for performance Probabilistic conflicts
35
Evaluation summary ScaleFS scales well on an 80 core machine
Commuter reports 99.2% conflict-freedom on ScaleFS Workload/machine independent Suggests scalability beyond our experimental setup and benchmarks
36
Related Work Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] Scaling file systems using sharding: Hare [Eurosys ’15], SpanFS [USENIX ’15] ScaleFS uses similar techniques: Operation Logging: OpLog [CSAIL TR ’14] Per-inode / Per-core logs : NOVA [FAST ’16], iJournaling [USENIX ’17], Strata [SOSP ’17] Decoupling in-memory and on-disk representations: Linux dcache, ReconFS [FAST ’14] ScaleFS focus : Achieve scalability by avoiding cache-line conflicts
37
https://github.com/mit-pdos/scalefs
Conclusion ScaleFS – a novel file system design for multicore scalability Two separate file systems : MemFS and DiskFS Per-core operation logs Ordering using Time Stamp Counters ScaleFS scales 35x-60x on an 80 core machine ScaleFS is conflict-free for 99.2% of testcases in Commuter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.