Download presentation
Presentation is loading. Please wait.
Published byAubrey Shepherd Modified over 9 years ago
1
IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison
2
Understanding How Things Fail Is Important
3
How Disks Fail
4
Classic Failure Model: “Fail Stop” As defined [Schneider ‘90]: Stop: Upon failure, halt Make known: But first, switch to state s.t. other components can detect that you have failed Very simple model of disk failure Used by all early file and storage systems (once controllers could detect failure) But is it realistic?
5
Assertion: Modern Disks Are Not Whole-Disk Fail Stop
6
Real Failures Latent sector errors [Kari ’93, Bairavasundaram ‘07] Block or blocks becomes inaccessible Data corruption [Weinberg ‘04, Greene ’05, Bairavasundaram ‘08] Controller bugs, not bit rot Transient errors too [Talagala ‘99] Bus stuttering, etc. Result: Partial failures are a reality
7
So What Should We Do?
8
High-end Systems: Extra Measures Disk Scrubbing [Kari ‘93] Proactively scan drives in search of latent errors When detected, correct from redundant copy on another disk Extra redundancy [Corbett ‘04] RAID system with two parity disks Checksums [Bartlett ‘04, Weinberg ‘04] Extra computation over data Guard against corruption
9
But What About Desktop File Systems?
10
Desktop FS’s: Lost In The Past? Desktop file systems are important Home use: Photos, movies, tax returns,... Cluster use too: GoogleFS built on local FS’s Performance policies are well known e.g., FFS placement policy But what is their fault-handling policy? Do they handle partial disk failures? How can we tell?
11
Two Questions
12
Questions I Will Answer Question 1: How do local file systems react to the more realistic set of disk failures? Question 2: How can we change file systems to better handle these types of faults?
13
How Disks Fail: The Details
14
The Storage Stack Not just file system on top of the disk Many layers Lots of software Even within disk! Failures occur at all levels Media FirmwareFile SystemGeneric I/ODevice DriverDevice Controller Electrical Mechanical Cache Transport Disk Host
15
Latent Sector Errors Disks experience partial failures “a small portion of data on disk becomes temporarily or permanently unavailable” [Corbett ‘04] Root causes: Surface is scratched, inaccurate arm movement, interconnect problems Bottom line: A single read or write can fail
16
Data Corruption Sun’s ZFS [Weinberg ‘04] Misdirected writes: Right data, wrong location Phantom/Lost writes: “Yes I wrote the data!” (but didn’t) EIDE Interface on motherboards [Greene ‘05] Read reported as “done” when not (race) Similar problem at Google [Ghemewat ‘03] Network Appliance [Lewis ‘99] Disk occasionally returns byte-shifted data
17
Transient Errors 18-month study of large disk farm [Talagala ‘99] Most machines had SCSI timeout errors (loose cables, bad cables?) SCSI parity errors were common too (data corrupted when moving across the bus) Failures can be transient too Might work if just retried
18
Even Worse With ATA (Not SCSI) ATA drives: Less reliable [ Anderson ‘03, Hughes & Murray ‘05 ] Few are returned for “failure analysis” Some are “partially flaw marked during testing” Test conditions not as harsh (power, temp.) High-end reliability features missing (filters: remove particles, chemicals: humidity) Cheap disks -> less testing -> less reliability But cost drives many purchasing decisions…
19
Trend: More Problems, Not Less Denser drives: Capacity sells drives More logic -> more complexity More complexity -> more bugs Cost per byte dominates: “Pennies matter” Manufacturers will cut corners Reliability features are the first to go Increasing amount of software: ~400K lines of code in modern Seagate drive Hard to write, hard to debug
20
The Fail-Partial Failure Model
21
Disk failure: Entire disk may fail Block failure: Part of disk may fail Block corruption: Part of disk may get corrupted All can be either transient or sticky
22
Important Parameters Locality Are partial faults independent of each other? Frequency How often do partial faults occur?
23
Frequency of Failures Study of Latent Sector Errors [Bairavasundaram et al. ‘07] 1.53 millions disks, 3+ years of data ATA: 8.5% - SCSI: 1.9% Latent sector errors are not independent Spatial locality exists, disk capacity matters Study of Block Corruption [Bairavasundaram et al. ‘08] Same data set ATA: 0.6% - SCSI: 0.06% Corruptions within disk are not independent Spatial locality exists The “bad block number” problem
24
How Do File Systems React To Partial Failures?
25
How To Detect & Handle Failures? Need: Classification of techniques Detection: Discovering a failure took place Recovery: Recovering from the failure Detection + Recovery = IRON File systems with Internal RObustNess IRON Taxonomy: Classify techniques
26
IRON Detection Taxonomy How to detect block failure or corruption? Possible strategies: Zero: No detection technique used Error Code: Check return codes from disk Sanity: Check data structures for consistency Redundancy: Add checksums or other forms of computed replication to detect problems
27
IRON Recovery Taxonomy How to recover from a detected failure? Possible strategies: Zero: Don’t do anything Propagate: Pass error on to higher level Stop: Halt activity (“fail stop”) Guess: Manufacture data, return to user Retry: Assume failure is transient Repair: If inconsistency is detected Remap: Redirect to another block Redundancy: Use another copy of block
28
What IRON Techniques Do Modern File Systems Use?
29
Fault Injection Typical fault injection: Insert failures at random disk locations/times Watch system to see what happens Not good enough: May miss interesting behavior May find problems, but not explanatory What we do: Space- and Time-aware injection A “gray box” approach to testing
30
Space Awareness File systems comprised of many on-disk structures e.g., superblocks, inodes, etc. Idea: Make fault injection layer aware of file system structures Inject faults across all block types InodesDataSuper
31
Time Awareness Time is key to testing as well e.g., update sequence Idea: Build model of file system I/O activity S1S2 J K/S C J Use model to induce faults at crucial times Don’t miss interesting behaviors J: Journal C: Commit K: Checkpoint S: Superblock Writes Data Journaling (Simplified)
32
Making It Comprehensive Workloads Exercise as much of FS as possible Two types of workloads Singlets: Stress single system call (open, lstat, rename, symlink, write, etc.) Generics: Stress common functionality (path traversal, recovery, log writes, etc.)
33
Injecting Faults Disk: Hard to do -> it’s hardware Software approach: Easy Desirable Fail-partial faults: Read, write errors Read corruption Media FirmwareFile SystemGeneric I/ODevice DriverDevice Controller Electrical Mechanical Cache Transport Disk Host Fault Injector
34
The File Systems We Tested Linux ext3 Popular, simple, compatible Linux file system Linux ReiserFS Scalable, “database-like” file system Linux IBM JFS Big Blue’s classic journaling file system Windows NTFS Yes, a non-Linux file system
35
Result Matrix Workloads Data Structures Read() Inode Zero Stop Propagate Retry Redundancy N/A
36
Read Errors: Recovery Ext3: Stop and propagate (don’t tolerate transience) ReiserFS: Mostly propagate JFS: Stop, propagate, retry All: Some cases missed Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy
37
Write Errors: Recovery Ext3/JFS: Ignore write faults No detection -> no recovery Can corrupt entire volume ReiserFS always calls panic Exception: indirect blocks Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy
38
Corruption: Recovery Ext3/Reiser/JFS: Some sanity checking used Stop/Propagate common Sanity checking not enough Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy
39
File System Specific Results Ext3: Overall simplicity Checks error codes, modest sanity checking, propagates errors, aborts operation Overreacts on read errors -> halt instead of propagate But, some write errors are ignored ReiserFS: First, do no harm At slightest sign of failure, panic() file system Preserves integrity; overreacts to transients IBM JFS: The kitchen sink Uses broadest range of techniques Windows NTFS: Persistence is a virtue Liberal retry (understands disks can be flaky)
40
General Results (1 of 3) Illogical inconsistency is common Similar faults -> different reactions (e.g., JFS failed read of superblock) Bugs are common Code not stress-tested enough? (e.g., ReiserFS indirect block code paths) Error codes are sometimes ignored Highly surprising: Easiest to detect (but sometimes hard to act upon)
41
General Results (2 of 3) Sanity checking is of limited utility Doesn’t help if read right type, wrong block Hard to do for some structures (e.g., bitmaps) Stop is useful (if used correctly) ReiserFS halts on write errors Ext3 tries to do this (but aborts too late) Stop should not be overused Faults can be transient Faults can be sticky, too!
42
General Results (3 of 3) Retry is underutilized JFS does it some, NTFS quite a bit But transient faults occur Automatic repair is rare Almost all “stop” actions involve administrator intervention/repair (running fsck, reboot, etc.) Redundancy is rarely used Only superblocks are replicated, sometimes
43
Towards an IRON File System
44
IRON ext3: ixt3 Prototype of an IRON file system First cut: Many other possibilities still exist Start with Linux ext3 Add checksums: To detect corruption Add replication: For important structures (e.g., meta-data) Add parity: For user data Result: IRON ext3 (ixt3)
45
Ixt3 Implementation Checksums: Initially write to the ext3 log, then checkpoint them to their final location Meta-data replicas: Write to replica log, checkpoint later to their final on-disk location Parity protection for data One block per file, extra pointer in inode Performance issues: Space overhead: Low Time overhead?
46
Ixt3 Performance Evaluation For “home use” or read-mostly: No overhead Has cost for write-intensive workloads MetadataDataBoth SSH- Build 1.00 Web Server 1.00 PostMark 1.191.131.37 TPC-B1.201.101.42
47
Wrapping Up
48
Summary File systems are important Used everywhere, in many different ways Disks fail in interesting ways New model: Fail-partial failure model Local file systems: Not ready for local faults Illogical inconsistencies, bugs, and little recovery Need: IRON file systems Ixt3: Low-cost protection from partial failures
49
Challenges and Directions Need to rethink how we build file systems Performance policy isn’t the only policy Fault-handling policy is critical Testing and beyond testing Failure handling must be tested (continuously?) Beyond testing: Code analysis too? Guiding principles Lessons from networking Put simply: Don’t trust the disk
50
ADvanced Systems Lab (ADSL) www.cs.wisc.edu/adsl
51
ADvanced Systems Lab (ADSL) Who did the real work: Nitin Agrawal Lakshmi Bairavasundaram Haryadi Gunawi Vijayan Prabhakaran
52
Backup Slides
53
Read Errors: Detection Techniques Across all three file systems: Error codes checked for read errors (rarely ignored) Detection Zero ErrorCode Sanity Ext3ReiserFSJFS
54
Write Errors: Detection Techniques Ext3, JFS ignore write errors! Either ignored altogether or not used meaningfully ReiserFS: Much more careful Detection Zero ErrorCode Sanity Ext3ReiserFSJFS
55
Corruption: Detection Techniques Sanity checking used across all three file systems Sanity checking not sufficient e.g., when you read block of similar type Detection Zero ErrorCode Sanity Ext3ReiserFSJFS
56
File Systems: The Manager of Your Data
57
Why File Systems Are Important The file system: The manager of “most” data Consists of named files: Linear array of bytes Organized in directories: /this/is/my/file Access methods: open(), read(), write(), close() Where we use them: Everywhere Home use: Photos, tax returns, home movies Servers: Network file servers, Google search engine Why we use them: Simple, convenient Good performance: Subject of much research Reliable? Depends on how disks fail…
58
File System Background Meta-data: Structures the file system uses to track what it needs to track Superblock: File-system wide parameters Inodes: Information about a file Data: Blocks to hold user data SuperInodes Data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.