IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison.

IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison

Understanding How Things Fail Is Important

How Disks Fail

Classic Failure Model: “Fail Stop” As defined [Schneider ‘90]: Stop: Upon failure, halt Make known: But first, switch to state s.t. other components can detect that you have failed Very simple model of disk failure Used by all early file and storage systems (once controllers could detect failure) But is it realistic?

Assertion: Modern Disks Are Not Whole-Disk Fail Stop

Real Failures Latent sector errors [Kari ’93, Bairavasundaram ‘07] Block or blocks becomes inaccessible Data corruption [Weinberg ‘04, Greene ’05, Bairavasundaram ‘08] Controller bugs, not bit rot Transient errors too [Talagala ‘99] Bus stuttering, etc. Result: Partial failures are a reality

So What Should We Do?

High-end Systems: Extra Measures Disk Scrubbing [Kari ‘93] Proactively scan drives in search of latent errors When detected, correct from redundant copy on another disk Extra redundancy [Corbett ‘04] RAID system with two parity disks Checksums [Bartlett ‘04, Weinberg ‘04] Extra computation over data Guard against corruption

But What About Desktop File Systems?

Desktop FS’s: Lost In The Past? Desktop file systems are important Home use: Photos, movies, tax returns,... Cluster use too: GoogleFS built on local FS’s Performance policies are well known e.g., FFS placement policy But what is their fault-handling policy? Do they handle partial disk failures? How can we tell?

Two Questions

Questions I Will Answer Question 1: How do local file systems react to the more realistic set of disk failures? Question 2: How can we change file systems to better handle these types of faults?

How Disks Fail: The Details

The Storage Stack Not just file system on top of the disk Many layers Lots of software Even within disk! Failures occur at all levels Media FirmwareFile SystemGeneric I/ODevice DriverDevice Controller Electrical Mechanical Cache Transport Disk Host

Latent Sector Errors Disks experience partial failures “a small portion of data on disk becomes temporarily or permanently unavailable” [Corbett ‘04] Root causes: Surface is scratched, inaccurate arm movement, interconnect problems Bottom line: A single read or write can fail

Data Corruption Sun’s ZFS [Weinberg ‘04] Misdirected writes: Right data, wrong location Phantom/Lost writes: “Yes I wrote the data!” (but didn’t) EIDE Interface on motherboards [Greene ‘05] Read reported as “done” when not (race) Similar problem at Google [Ghemewat ‘03] Network Appliance [Lewis ‘99] Disk occasionally returns byte-shifted data

Transient Errors 18-month study of large disk farm [Talagala ‘99] Most machines had SCSI timeout errors (loose cables, bad cables?) SCSI parity errors were common too (data corrupted when moving across the bus) Failures can be transient too Might work if just retried

Even Worse With ATA (Not SCSI) ATA drives: Less reliable [ Anderson ‘03, Hughes & Murray ‘05 ] Few are returned for “failure analysis” Some are “partially flaw marked during testing” Test conditions not as harsh (power, temp.) High-end reliability features missing (filters: remove particles, chemicals: humidity) Cheap disks -> less testing -> less reliability But cost drives many purchasing decisions…

Trend: More Problems, Not Less Denser drives: Capacity sells drives More logic -> more complexity More complexity -> more bugs Cost per byte dominates: “Pennies matter” Manufacturers will cut corners Reliability features are the first to go Increasing amount of software: ~400K lines of code in modern Seagate drive Hard to write, hard to debug

The Fail-Partial Failure Model

Disk failure: Entire disk may fail Block failure: Part of disk may fail Block corruption: Part of disk may get corrupted All can be either transient or sticky

Important Parameters Locality Are partial faults independent of each other? Frequency How often do partial faults occur?

Frequency of Failures Study of Latent Sector Errors [Bairavasundaram et al. ‘07] 1.53 millions disks, 3+ years of data ATA: 8.5% - SCSI: 1.9% Latent sector errors are not independent Spatial locality exists, disk capacity matters Study of Block Corruption [Bairavasundaram et al. ‘08] Same data set ATA: 0.6% - SCSI: 0.06% Corruptions within disk are not independent Spatial locality exists The “bad block number” problem

How Do File Systems React To Partial Failures?

How To Detect & Handle Failures? Need: Classification of techniques Detection: Discovering a failure took place Recovery: Recovering from the failure Detection + Recovery = IRON File systems with Internal RObustNess IRON Taxonomy: Classify techniques

IRON Detection Taxonomy How to detect block failure or corruption? Possible strategies: Zero: No detection technique used Error Code: Check return codes from disk Sanity: Check data structures for consistency Redundancy: Add checksums or other forms of computed replication to detect problems

IRON Recovery Taxonomy How to recover from a detected failure? Possible strategies: Zero: Don’t do anything Propagate: Pass error on to higher level Stop: Halt activity (“fail stop”) Guess: Manufacture data, return to user Retry: Assume failure is transient Repair: If inconsistency is detected Remap: Redirect to another block Redundancy: Use another copy of block

What IRON Techniques Do Modern File Systems Use?

Fault Injection Typical fault injection: Insert failures at random disk locations/times Watch system to see what happens Not good enough: May miss interesting behavior May find problems, but not explanatory What we do: Space- and Time-aware injection A “gray box” approach to testing

Space Awareness File systems comprised of many on-disk structures e.g., superblocks, inodes, etc. Idea: Make fault injection layer aware of file system structures Inject faults across all block types InodesDataSuper

Time Awareness Time is key to testing as well e.g., update sequence Idea: Build model of file system I/O activity S1S2 J K/S C J Use model to induce faults at crucial times Don’t miss interesting behaviors J: Journal C: Commit K: Checkpoint S: Superblock Writes Data Journaling (Simplified)

Making It Comprehensive Workloads Exercise as much of FS as possible Two types of workloads Singlets: Stress single system call (open, lstat, rename, symlink, write, etc.) Generics: Stress common functionality (path traversal, recovery, log writes, etc.)

Injecting Faults Disk: Hard to do -> it’s hardware Software approach: Easy Desirable Fail-partial faults: Read, write errors Read corruption Media FirmwareFile SystemGeneric I/ODevice DriverDevice Controller Electrical Mechanical Cache Transport Disk Host Fault Injector

The File Systems We Tested Linux ext3 Popular, simple, compatible Linux file system Linux ReiserFS Scalable, “database-like” file system Linux IBM JFS Big Blue’s classic journaling file system Windows NTFS Yes, a non-Linux file system

Result Matrix Workloads Data Structures Read() Inode Zero Stop Propagate Retry Redundancy N/A

Read Errors: Recovery Ext3: Stop and propagate (don’t tolerate transience) ReiserFS: Mostly propagate JFS: Stop, propagate, retry All: Some cases missed Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy

Write Errors: Recovery Ext3/JFS: Ignore write faults No detection -> no recovery Can corrupt entire volume ReiserFS always calls panic Exception: indirect blocks Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy

Corruption: Recovery Ext3/Reiser/JFS: Some sanity checking used Stop/Propagate common Sanity checking not enough Ext3 ReiserFS JFS Zero Stop Propagate Retry Redundancy

File System Specific Results Ext3: Overall simplicity Checks error codes, modest sanity checking, propagates errors, aborts operation Overreacts on read errors -> halt instead of propagate But, some write errors are ignored ReiserFS: First, do no harm At slightest sign of failure, panic() file system Preserves integrity; overreacts to transients IBM JFS: The kitchen sink Uses broadest range of techniques Windows NTFS: Persistence is a virtue Liberal retry (understands disks can be flaky)

General Results (1 of 3) Illogical inconsistency is common Similar faults -> different reactions (e.g., JFS failed read of superblock) Bugs are common Code not stress-tested enough? (e.g., ReiserFS indirect block code paths) Error codes are sometimes ignored Highly surprising: Easiest to detect (but sometimes hard to act upon)

General Results (2 of 3) Sanity checking is of limited utility Doesn’t help if read right type, wrong block Hard to do for some structures (e.g., bitmaps) Stop is useful (if used correctly) ReiserFS halts on write errors Ext3 tries to do this (but aborts too late) Stop should not be overused Faults can be transient Faults can be sticky, too!

General Results (3 of 3) Retry is underutilized JFS does it some, NTFS quite a bit But transient faults occur Automatic repair is rare Almost all “stop” actions involve administrator intervention/repair (running fsck, reboot, etc.) Redundancy is rarely used Only superblocks are replicated, sometimes

Towards an IRON File System

IRON ext3: ixt3 Prototype of an IRON file system First cut: Many other possibilities still exist Start with Linux ext3 Add checksums: To detect corruption Add replication: For important structures (e.g., meta-data) Add parity: For user data Result: IRON ext3 (ixt3)

Ixt3 Implementation Checksums: Initially write to the ext3 log, then checkpoint them to their final location Meta-data replicas: Write to replica log, checkpoint later to their final on-disk location Parity protection for data One block per file, extra pointer in inode Performance issues: Space overhead: Low Time overhead?

Ixt3 Performance Evaluation For “home use” or read-mostly: No overhead Has cost for write-intensive workloads MetadataDataBoth SSH- Build 1.00 Web Server 1.00 PostMark 1.191.131.37 TPC-B1.201.101.42

Wrapping Up

Summary File systems are important Used everywhere, in many different ways Disks fail in interesting ways New model: Fail-partial failure model Local file systems: Not ready for local faults Illogical inconsistencies, bugs, and little recovery Need: IRON file systems Ixt3: Low-cost protection from partial failures

Challenges and Directions Need to rethink how we build file systems Performance policy isn’t the only policy Fault-handling policy is critical Testing and beyond testing Failure handling must be tested (continuously?) Beyond testing: Code analysis too? Guiding principles Lessons from networking Put simply: Don’t trust the disk

ADvanced Systems Lab (ADSL) www.cs.wisc.edu/adsl

ADvanced Systems Lab (ADSL) Who did the real work: Nitin Agrawal Lakshmi Bairavasundaram Haryadi Gunawi Vijayan Prabhakaran

Backup Slides

Read Errors: Detection Techniques Across all three file systems: Error codes checked for read errors (rarely ignored) Detection Zero ErrorCode Sanity Ext3ReiserFSJFS

Write Errors: Detection Techniques Ext3, JFS ignore write errors! Either ignored altogether or not used meaningfully ReiserFS: Much more careful Detection Zero ErrorCode Sanity Ext3ReiserFSJFS

Corruption: Detection Techniques Sanity checking used across all three file systems Sanity checking not sufficient e.g., when you read block of similar type Detection Zero ErrorCode Sanity Ext3ReiserFSJFS

File Systems: The Manager of Your Data

Why File Systems Are Important The file system: The manager of “most” data Consists of named files: Linear array of bytes Organized in directories: /this/is/my/file Access methods: open(), read(), write(), close() Where we use them: Everywhere Home use: Photos, tax returns, home movies Servers: Network file servers, Google search engine Why we use them: Simple, convenient Good performance: Subject of much research Reliable? Depends on how disks fail…

File System Background Meta-data: Structures the file system uses to track what it needs to track Superblock: File-system wide parameters Inodes: Information about a file Data: Blocks to hold user data SuperInodes Data

IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison.

Similar presentations

Presentation on theme: "IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison.

Similar presentations

Presentation on theme: "IRON File Systems Remzi Arpaci-Dusseau University of Wisconsin, Madison."— Presentation transcript:

Similar presentations

About project

Feedback