File System Consistency
Outline Overview of file systems File system design Sharing files Unix file system Consistency and crash recovery Journaling file systems Log-structured file systems
Review We saw that file systems use write-back caching for improving the performance of file writes However dirty blocks in memory will not be saved to disk on a crash, i.e., they are lost This can lead to file system inconsistency on disk Let's look at this problem in more detail
Understanding File System Inconsistency File updates require several block write operations E.g., deleting a Unix file involves at least two steps Step 1 Remove the file’s directory entry in directory data block Step 2 Mark inode of file as free in inode bitmap Mark file blocks as free in block bitmap Update metadata in inode of directory (e.g., timestamp) Imagine that a crash occurs after Step 1 Inode and file blocks will remain allocated, hence a storage leak
Understanding File System Inconsistency What if we switch the order of the steps? E.g., deleting a Unix file involves at least two steps Step 2 Mark inode of file as free in inode bitmap Mark file blocks as free in block bitmap Update metadata in inode of directory Step 1 Remove the file’s directory entry in directory data block Imagine that a crash occurs after Step 2 Directory entry is a dangling pointer! Could point to freed blocks or to a new file Very serious issue
Reducing File System Inconsistency Original Unix used two strategies to reduce inconsistency Write metadata blocks synchronously to disk Write them in such an order so as to avoid dangling pointers Write data asynchronously (every 30 seconds) Most blocks are data blocks, improves file system performance Data blocks can be lost, but it doesn’t affect file system consistency! Device Block # Disk blocks in memory Buffer Cache write metadata blocks synchronously write data blocks asynchronously Write them in such an order so as to avoid dangling pointers: so in the previous example, step 1 is performed before step 2. Disk
Crash Recovery After a crash, all the metadata blocks of the last file system operation may not have reached disk File system can be inconsistent On reboot, we need to restore file system consistency This is called crash recovery Requires a full scan of the file system Needs to check that all file system data structures are consistent, or else recover them to some consistent state Takes long time Time is getting longer because disk capacities are increasing faster than disk throughput
Avoiding Crash Recovery We have seen that crash recovery takes a long time Can we avoid it altogether?
Avoiding Crash Recovery We have seen that crash recovery takes a long time Can we avoid it altogether? The reason we need crash recovery is that only some metadata block writes of a file system operation reached disk before the crash We can avoid this problem by using battery-backed RAM to ensure that all writes are completed After crash, need enough power to write all dirty blocks to disk Problems Need to ensure that battery is in good condition Most systems don’t have battery-backed RAM
Failure Atomicity If we can’t complete all writes, what if we could revert all the writes of a partially completed file system operation We would like to ensure failure atomicity A file system operation either doesn’t happen at all or happens completely in the presence of failures Problem: once a block has been written, we cannot revert it to its previous contents Idea: to revert block contents, always make a copy before overwriting a block
Two Options for Reverting Operations Undo recovery 1. Copy old version of Block B in memory to spare Block S on disk 2. Copy new version of Block B from memory to Block B to disk 3. Wait until all block operations (step 1 and 2) are written to disk (done) 4. Remove Block S On failure 1. Before Step 2 starts: Block B has old version, nothing to do 2. After Step 2 starts, until Step 4 starts: Revert Block B using Block S 3. After Step 4 starts: Block B has new version, nothing to do step 2 Old Old→New Old->New step 1 Spare Block S on disk Block B in memory Block B on disk Note that after crash recovery, the block contents have the old versions if the crash happened before “done” (before step 4 starts), and new contents if the crash happened after “done” (after step 4 starts). One challenge that is not discussed in this slide is how do we know whether we reached “done” or not after failure. We will see a solution to this problem when we talk about journaling file systems. Note that after failure, in all cases, we do want to remove the spare block S, if it exists.
Two Options for Reverting Operations Redo recovery 1. Copy new version of Block B in memory to spare Block S on disk 2. Wait until all block operations (Step 1) are written to disk (done) 3. Copy new version of Block in memory to Block B on disk 4. Remove Block S On failure 1. Before Step 3 starts: Block B has old version, nothing to do 2. After Step 3 starts, until Step 4 starts: Update Block B using Block S 3. After Step 4 starts: Block B has new version, nothing to do step 3 New Old→New step 1 Spare Block S on disk Block B in memory Block B on disk Note that after crash recovery, the block contents have the old versions if the crash happened before “done” (before step 3 starts), and new contents if the crash happened after “done” (after step 3 starts).
Journaling File Systems Use a technique called write-ahead logging to enable fast file-system recovery after a crash Write-ahead log is used to perform undo or redo recovery Journaling file system with redo logging Write new versions of blocks associated with a file system operation in a journal (circular log) Add a commit record in the journal when operation is done Why? File system is then updated asynchronously On crash failure: copy journal blocks to file system for which commit record is present If crash occurs during recovery, copy journal blocks again Requires idempotent operations Why? Note that both for undo and redo recovery, whether the operation is “done” needs to be recorded durably (on disk). Otherwise, on failure, we would not know whether the crash happened before or after “done”, so we wouldn’t know whether to do nothing (keep old version), or update the blocks (install new version). The commit block records that the operation is considered done. If we don’t see it, we want to preserve the old file system state. If we see it, we want to ensure that the file system operation has succeeded. So the commit block enables us to provide failure atomicity.
Journaling Example I B D … T7 B’ I’ D’ C T6 Suppose a transaction needs to update three blocks, Bitmap (B), Inode (I) and Data (D) Step 1: Write blocks to journal Write transaction header block (T7), updated values of B, I and D to the journal Step 2: Write commit block (C) to journal Step 1 Step 2 T7 contents: [1] = 1000 [2] = 2000 [3] = 3000 The first two steps together constitute write-ahead logging because we are logging all the updated values before updating the file system. T7 (transaction header block for transaction 7) describes where B’, I’ and D’ should be written to the file system, so [1] refers to B’, and it should be written to blk# 1000. blk# 1000 2000 3000 I B D … File System T7 B’ I’ D’ C T6 Journal
Journaling Example T7 B’ I’ D’ C T6 … I’ B’ D’ … T7 B’ I’ D’ C T6 … Step 3: Install After all journal records (T7-C) records are stored on disk, updated B, I and D blocks can be copied to the file system T7 B’ I’ D’ C T6 … Journal I’ B’ D’ … File System blk# 1000 blk# 2000 blk# 3000 Notice the four steps shown are the same as with redo recovery Step 4: Free transaction in journal Journal is a fixed-sized circular buffer, transactions must be periodically freed, free (T7-C) after file system is updated in Step 3 T7 B’ I’ D’ C T6 … Journal
Observation Based on FFS As memory gets larger, need to optimize for writes Reads Disk is accessed first time it is read After that, data can be read from buffer cache As memory becomes bigger, reads become less of a performance problem Writes I/O will eventually become write heavy For synchronous operations For data integrity in the presence of crashes Also, writes are not well clustered Directory, inodes and data blocks are written
Log-Structured File Systems Key idea in a Log-Structured file system (LFS) is to write all file system data and metadata in a contiguous log Issues How is data read? What data is written? How is space freed in the log?
LFS Reads When inodes are updated, they are written in log Scattered in the log, unlike FFS LFS maintains an array in memory to locate inodes This array is called inode-map Inode-map maps inode nr. to inode location in log Helps locates inode block From inode, indirect/data blocks are read like FFS
LFS Writes When inode is updated, inode-map has to be updated and stored on disk Inode-map itself is stored in the log LFS uses checkpoint region in a fixed area on disk Helps locate inode-map blocks in log Region is updated when inode-map blocks are written superblock inodes inode-map checkpoint region segment header segment data blocks
Log Space Reclamation A cleaner process reclaims segments In practice When blocks are overwritten by new version When blocks are deleted Live blocks have to be copied out of segments In practice Cleaner process is complicated Requires lots of tuning
Summary File systems use write-back caching for efficiency, but this can cause inconsistency on power failure or operating system crashes (e.g., due to bugs) We would like to provide failure atomicity, so that file system operations either complete successfully or do not happen at all Undo and redo recovery are two failure atomicity techniques They are implemented using write-ahead logging Modern file systems that implement write-ahead logging are called journaling file systems Log-structured file systems optimize write performance They provide failure atomicity as well
Think Time What is the motivation for a journaling file system? What are the benefits and drawbacks of a journaling file system over a non-journaling file system? Describe the sequence of operations needed to issue a file write operation in a journaling file system Describe the sequence of operations needed to read a block in a journaling file system What happens in a journaling file system if a block is updated twice in two consecutive file system operations? E.g., OP1: updates B1, I1, D->D’ OP2: updates B2, I2, D’->D’’ motivation for journaling: speeding up crash recovery benefits/drawbacks of journaling file system benefit: speeds up crash recovery drawback every block write becomes two block writes issuing a file write in a journaling file system: see the slides reading a block in a journaling file system: when a block is read, and it has a dirty/modified copy in the journal, then the journal version should be used. normally, the journal blocks are pinned in memory, so any updated block in the journal can be read quickly. this is especially beneficial when installing updated blocks because they will be in memory. block is updated twice: the second update cannot simply overwrite the first update in the journal block (in memory, see above), or the journal may contain [S1, B1, I1, D'', C1], when these journal blocks in memory are written to the journal on disk. The reason this can happen is because of asynchronous DMA. When the OS writes D' to disk, it simply tells the disk that this buffer in memory needs to be written. The disk controller then issues a DMA from memory to disk, but this can occur asynchronously (i.e., the OS doesn't wait for the DMA). So if the buffer associated with D is modified again, the modified contents (D’’) may reach the disk. As a result, which the DMA to the journal is in progress, OP2 has to be delayed until the journal has [S1, B1, I1, D', C1], after which D'' can be logged as part of another commit. Alternatively, the two updates can be combined and committed together [S, B1, B2, I1, I2, D'', C]. This is called group commit, because a group of operations are committed together, and this improves journaling performance significantly.
Think Time What is the motivation for a log-structured file system? If all block updates are written to the log, how do we locate a block in the log? How do log-structured file systems provide failure atomicity? What are the drawbacks of a log-structured file system? When a block is written, how many writes are performed in a 1) journaling file system, 2) log-structured file system? Why are log-structured file systems used in an SSD? motivation for log-structured file system: reads will be absorbed by buffer cache, let’s optimize for writes by issuing all writes sequentially the checkpoint area is separate from the log and not written in a log-like manner. the checkpoint area contains information about where the inode-map is stored in the log. the inode-map contains information about where inodes are stored in the log. the inodes contain block pointers to get to the data blocks. provide failure atomicity: each time the checkpoint is written, the file system essentially performs a commit operation. See from the answer above that the checkpoint area helps locate all the blocks in the file system. You can think of any blocks in the log that have been updated since the last checkpoint write as pending writes (similar to writes to the journal in a journaling file system). On a crash, all these writes can be discarded, because they have not been committed yet (i.e., the checkpoint does have any pointers to these blocks). But what about the checkpoint itself. How can we write it failure atomically? LFS uses two checkpoint areas (recall the principle that to revert block contents, always make a copy before overwriting a block). It writes to these areas alternately. On a crash, it knows which area was last written completely (using a monotonically increasing sequence number), and picks that checkpoint. To ensure consistency, before the checkpoint block is written, a flush should be issued, and then the checkpoint block should be written, and then a flush is issued. This ensures that a checkpoint block always points to consistent data. drawbacks of LFS: 1) reads are no longer sequential. say a file is initially created sequentially. when it is modified, the modified blocks are stored in the log, so they can be far from the original blocks. As more writes are performed, the file will become fragmented. LFS argued that this is not a problem because when blocks are read, they are cached. however, the initial file read will be slower than regular file systems. 2) a second drawback is the need to perform garbage collection, because old versions of blocks and deleted blocks need to be reclaimed. this cleaning process can interfere with regular read and write operations. how many writes: 1) journaling: two. Once to the journal, once to the final location in the file system. 2) log structured: one or more. Once to the log, and then again during cleaning, if the segment in which this live block is contained is being cleaned. Since a segment may be cleaned many times, a block may be copied to the log many times. this is why the cleaning process needs to be optimized or else log-structured file systems can perform poorly. log-structured file systems in SSD: flash blocks in an SSD have limited endurance. Each block can be written a certain number of times (e.g., 10,000-100,000) before it starts failing. notice that writes are issued in a log fashion to all blocks on the SSD, so writes are spread across all blocks evenly. this reduces the chance that specific blocks on the flash device are written much more frequently, when would eventually cause them to fail. writing to the SSD in a log fashion is also called wear leveling.