Download presentation
Presentation is loading. Please wait.
1
G22.3250-001 Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees
2
Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations?
3
Some Background Inode On-disk data structure containing a file’s metadata and pointers to its data Defines FS-internal namespace Inode numbers Originated in Unix FFS Vnode Kernel data structure for representing files Provides standard API for accessing different FS’s Originated in Sun OS
4
Motivation On one side: I/O bottleneck It’s becoming harder to utilize increasing I/O capacity and bandwidth 112 9 GB disk drives provide 1 TB of storage High end drives provide 500 MB/sec sustained disk bandwidth On the other side: I/O intensive applications Editing of uncompressed video 30 MB/sec per stream, 108 GB for one hour Streaming compressed video on demand 2.7 TB for 1,000 movies, 200 movies require 100 MB/sec
5
Scalability Problems of Existing File Systems Slow crash recovery fsck needs to scan entire disk No support for large file systems 32 bit block pointers address only 4 million blocks At 8 KB per block, 32 TB No support for large, sparse files 64 bit block pointers require more levels of indirection Are also quite inefficient Fixed-size extents are still too limiting
6
Scalability Problems of Existing File Systems (cont.) No support for large, contiguous files Bitmap structures for tracking free and allocated blocks do not scale Hard to find large regions of contiguous space But, we need contiguous allocation for good utilization of bandwidth No support for large directories Linear layout (inode number, name entries) does not scale In-memory hashing imposes high memory overheads No support for large numbers of files Inodes preallocated during file system creation
7
XFS in a Nutshell Use 64 bit block addresses Support for larger files systems Use B+ trees and extents Support for larger number of files, larger files (which may be sparse or contiguous), larger directories Better utilization of I/O bandwidth Log metadata updates Faster crash recovery
8
XFS Architecture I/O manager I/O requests Directory manager File system name space Space manager Free space, inode & file allocation Transaction manager Atomic metadata updates Unified buffer cache Volume manager Striping, concatenation, mirroring
9
Storage Scalability Allocation groups Are regions with their own free space maps and inodes Support AG-relative block and inode pointers Reduce size of data structures Improve (thread) parallelism of metadata management Allow concurrent accesses to different allocation groups Unlike FFS, are motivated (mostly) by scalability and parallelism and not by locality Free space Two B+ trees describing extents (what’s a B+ tree?) One indexed by starting block (used when?) One indexed by length of extent (used when?)
10
Storage Scalability (cont.) Large files File storage tracked by extent map Each entry: block offset in file, length in blocks, starting block on disk Small extent map organized as list in inode Large extent map organized as B+ tree rooted in inode Indexed by block offset in file Large number of files Inodes allocated dynamically In chunks of 64 Inode locations tracked by B+ tree Only points to inode chunks
11
Storage Scalability (cont.) Large directories Directories implemented as (surprisingly) B+ trees Map 4 byte hashes to directory entries (name, inode number) Fast crash recovery Enabled by write ahead log For all structural updates to metadata E.g., creating a file directory block, new inode, inode allocation tree block, allocation group header block, superblock Independent of actual data structures (just binary data) However, still need disk scavengers for catastrophic failures The customers spoke…
12
Performance Scalability Allocating files contiguously On-disk allocation is delayed until flush Uses (cheap and plentiful) memory to improve I/O performance Typically enables allocation in one extent Even for random writes (think memory-mapped files) Avoids allocation for short-lived files Extents have large range: 21 bit length field Two million file system blocks Block size can vary by file system Small blocks for file systems with many small files Large blocks for file systems with mostly large files What prevents long-term fragmentation?
13
Performance Scalability (cont.) Performing file I/O Read requests issued for large I/O buffers Followed by multiple read ahead requests for sequential reads Writes are clustered to form larger, asynch. I/O requests Delayed allocation helps with buffering writes Direct I/O lets applications bypass cache and use DMA Applications have control over I/O, while still accessing file system But also need to align data on block boundaries and issue requests on multiples of block size Reader/writer locking supports more concurrency Several processes on different CPUs can access the same file Direct I/O leaves serialization entirely to applications
14
Performance Scalability (cont.) Accessing and updating metadata Updates performed in asynchronous write-ahead log Modified data still only flushed after log update has completed But metadata not locked, multiple updates can be batched Log may be placed on different device from file system Including non-volatile memory (NV-RAM) Log operation is simple, but log is centralized (think SMP) Provide buffer space, copy, write out, notify Copying can be done by processors performing transaction
15
Experiences
16
I/O Throughput What can we conclude? Read speed, difference between creates and writes, parallelism
17
Benchmark Results (The Marketing Dept. Speaketh) Datamation sort 3.52 seconds (7 seconds previous record) Indy MinuteSort 1.6 GB sorted in 56 seconds (1.08 GB previously) SPEC SFS 8806 SPECnfs instead of 7023 SPECnfs with EFS 12% increase with mostly small & synchronous writes on similar hardware
18
Directory Lookups Why this noticeable break?
19
Are You Pondering What I Am Pondering? Are these systems/approaches contradictory? Recoverable virtual memory Simpler is better and better performing XFS Way more complex is better and better performing
20
What Do You Think?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.