Presentation is loading. Please wait.

Presentation is loading. Please wait.

Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift.

Similar presentations


Presentation on theme: "Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift."— Presentation transcript:

1 Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift

2  Applications require data  Use FS to reliably store data  Both hardware and software can fail  Typical Solution  Large clusters for availability  Reliability through replication 2 GFS Master Slave Nodes

3 OS FS  Replication infeasible for desktop environments  Wouldn’t RAID work?  Can only tolerate H/W failures  FS crash are more severe  Services/applications are killed  Requiring OS reboot and recovery  Need: better reliability in the event of file system failures 3 Raid Controller Disks Disk App

4  Motivation  Background  Restartable file systems  Advantages and limitations  Conclusions 4

5  Exception paths not tested thoroughly  Exceptions: failed I/O, bad arguments, null pointer  On errors: call panic,BUG,BUG_ON  After failure: data becomes inaccessible  Reason for no recovery code  Hard to apply corrective measures  Not straightforward to add recovery 5

6 6 int journal_mark_dirty(….) { struct reiserfs_journal_cnode *cn = NULL; if (!cn) { cn = get_cnode(p_s_sb); if (!cn) { reiserfs_panic(p_s_sb, "get_cnode failed!\n"); }} } void reiserfs_panic(struct super_block *sb,...) { BUG(); /* this is not actually called, but makes reiserfs_panic() "noreturn" */ panic("REISERFS: panic %s\n“, error_buf); } ReiserFS File systems already detect failures Recovery: simplified by generic recovery mechanism

7 1. Code to recover from all failures  Not feasible in reality 2. Restart on failure  Previous work have taken this approach FS need: stateful & lightweight recovery 7 HeavyweightLightweight Stateless Stateful Nooks/Shadow Xen, Minix L4, Nexus SafeDrive Singularity CuriOS EROS

8 Goal: build lightweight & stateful solution to tolerate file-system failures Solution: single generic recovery mechanism for any file system failure 1.Detect failures through assertions 2.Cleanup resources used by file system 3.Restore file-system state before crash 4.Continue to service new file system requests 8 FS Failures: completely transparent to applications

9 9  Transparency  Multiple applications using FS upon crash  Intertwined execution  Fault-tolerance  Handle a gamut of failures  Transform to fail-stop failures  Consistency  OS and FS could be left in an inconsistent state

10 FS consistency required to prevent data loss 10  Not all FS support crash-consistency  FS state constantly modified by applications  Periodically checkpoint FS state  Mark dirty blocks as Copy-On-Write  Ensure each checkpoint is atomically written  On Crash: revert back to the last checkpoint

11 11 VFS File System Application Epoch 0Epoch 1 time checkpoint Open (“file”)write()read() CompletedIn-progress Legend: Crash write() Periodically create checkpoints 1 Move to recent checkpoint 4 Replay completed operations 5 Unwind in-flight processes 3 File System Crash 2 Re-execute unwound process 6 12456 write()Close() 3

12  File systems constantly modified  Hard to identify a consistent recovery point  Naïve Solution: Prevent any new FS operation and call sync  Inefficient and unacceptable overhead 12

13 13 VFS File System Page Cache Disk App File Systems write to disk through Page Cache All requests go through the VFS layer ext 3 VFAT Control requests to FS and dirty pages to disk

14 14 VFS File System Page Cache Disk App VFS File System Page Cache App Disk Regular At Checkpoint After Checkpoint VFS File System Page Cache App Disk STOP Membrane 1 1

15  Have built-in crash consistency mechanism  Journaling or Snapshotting  Seamlessly integrate with these mechanism  Need FSes to indicate beginning and end of an transaction  Works for data and ordered journaling mode  Need to combine writeback mode with COW 15

16  Log operations at the VFS level  Need not modify existing file systems  Operations: open, close, read, write, symlink, unlink, seek, etc.  Read:  Logs are thrown away after each checkpoint  What about logging writes? 16

17  Mainly used for replaying writes  Goal: Reduce the overhead of logging writes  Soln: Grab data from page cache during recovery 17 VFS File System Page Cache VFS File System Page Cache Before Crash During Recovery VFS File System Page Cache After Recovery Write (fd, buf, offset, count)

18 18

19 19

20  Setup 20

21 21

22 22

23  Restart ext2 during random-read micro benchmark 23

24 Data (Mb) Recovery Time (ms) 1012.9 2013.2 4016.1 24

25  Improves tolerance to file system failures  Build trust in new file systems (e.g., ext4, btrfs)  Quick-fix bug patching  Developer transform corruptions to restart  Restart instead of extensive code restructuring  Encourage more integrity checks in FS code  Assertions could be seamlessly transformed to restart  File systems more robust to failures/crashes 25

26  Only tolerate fail-stop failures  Not address-space based  Faults could corrupt other kernel components  FS restart may be visible to application  e.g., Inode numbers could be changed after restart 26 VFS File System Application Epoch 0 After Crash Recovery Before Crash Epoch 0 create (“file1”)stat (“file1”)write (“file1”, 4k) File : file1 Inode# : 15 create (“file1”) stat (“file1”)write (“file1”, 4k) File1: inode# 12 File1: inode# 15 Inode# Mismatch File : file1 Inode# : 12

27  Failures are inevitable in file systems  Learn to cope and not hope to avoid them  Generic recovery mechanism for FS failures  Improves FS reliability availability of data  Users: Install new FSes with confidence  Developers: Ship FS faster; as not all exception cases are now show-stoppers 27

28 Questions and Comments 28 Advanced Systems Lab (ADSL) University of Wisconsin-Madison http://www.cs.wisc.edu/adsl


Download ppt "Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift."

Similar presentations


Ads by Google