Chapter 11: File System Implementation

Chapter 11: File System Implementation

Chapter 11: File System Implementation
File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management Efficiency and Performance Recovery NFS Example: WAFL File System

Objectives To describe the details of implementing local file systems and directory structures. To describe the implementation of remote file systems. To discuss block allocation and free-block algorithms and their trade-offs.

File-System Structure
File systems resides on secondary storage (often disks). They provide a user interface to storage, mapping logical addresses to physical locations. What algorithms and data structures will be used? They also provide efficient and convenient access to the disk by allowing data to be stored, located retrieved easily. How will it appear to the user? Disk are convenient because they provide random access and the ability to read information, modify it, and rewrite it in the same physical location (in-place rewrite). To improve efficiency I/O transfers are performed in blocks or sectors (usually 512 bytes).

File-System Structure
The file system is composed of several layers. I/O control – device drivers and interrupt handlers Basic file system – issue generic commands Blocks are identified by physical location (drive, cylinder, track, sector) File-organization module – blocks are grouped into files. Logical file blocks are mapped to physical locations. Free space and disk allocation are managed. Logical file system – manages the metadata about the files. Directory structure Logical address File attributes (owner, group, protections, file type, etc.) Metadata about each file is stored in its file control block (FCB). In Unix these are called inodes.

Layered File System

File System Layers (Cont.)
Layering useful for reducing complexity and redundancy. Multiple file systems could share the same I/O control layer for example. The same logical file system could be implemented on different types of devices with different I/O control. Hard drive CD ROM Flash Drive Layering can also add overhead and can decrease performance. A file name must be translated into a file number The file number is translated into a logical address using the FCB (inode). The logical address is translated into a physical address.

File System Layers (Cont.)
There are many file systems used and several of these are often used within an operating system. Each with its own format (CD-ROM is ISO 9660; Unix has UFS, FFS; Windows has FAT, FAT32, NTFS as well as floppy, CD, DVD Blu-ray, Linux has more than 40 types, with extended file system ext2 and ext3 leading; plus distributed file systems, etc.) New ones still arriving – ZFS, GoogleFS, Oracle ASM, FUSE

File-System Implementation
We have system calls at the API level, but how do we implement their functions? On-disk and in-memory structures are used to implement the file system. On-disk structures include: Boot control block Volume control block Directory structure File control blocks In-memory structures include: Mount table Directory cache System open-file table Per-process open-file table Read/Write buffers

On-Disk File System Structures
Boot control block contains info needed by the system to boot OS from that volume This is needed if volume contains an OS, It is typically the first block of the volume. Volume control block (UFS – superblock, NTFS – master file table) contains volume/partition details Total # of blocks, # of free blocks, block size, free block pointers or array Directory structure organizes the files UFS – file names and inode numbers NTFS – master file table

On-Disk File System Structures
Per-file File Control Block (FCB) contains many details about the file UFS – inode number, permissions, size, dates NFTS – this information is stored in the master file table using relational database structures.

In-Memory File System Structures
Mount table stores file system mounts, mount points, file system types Directory-structure cache holds information about recently accessed directories System-wide open-file table contains a copy of the FCB or each open file. Per-process open-file – for each process there is a table that contains, for every open file, contains a pointer to the appropriate entry in the system-wide open-file table. Buffers hold file system blocks when they are being read to the disk or written to the disk.

Opening a File Opening a file involves the following steps:
The open call passes the file name to the logical file system. The system-wide open-file table is searched to determine if the file is already open. If so then a per-process open-file table entry is created pointing to the existing entry. Otherwise the directory is searched for the given name. When found the FCB is copied into the system-wide open-file table and an entry in the per-process file table is made to point to the FCB entry. Open returns a pointer to the per-process file table entry (file descriptor or file handle).

Reading From a File When a read (or write) call is made the following steps occur. The entry for the file is located in the per-process file table. This entry is a pointer to the appropriate entry in the system-wide open-file table. The system-wide open-file table entry contains a copy of the FCB for the file which contains information about its location on the disk. The information can be read from (or written to) the disk.

Mounting Before the files in a volume can be accessed, the volume must be mounted. At mount time, file system consistency is checked. Is all the metadata correct? If not, fix it, and try again. If yes, add to the volume to the mount table and allow access.

Virtual File Systems How does a system allow multiple types of file systems to be used concurrently? Virtual File Systems (VFS) on Unix provide an object-oriented way of implementing file systems. VFS allows the same system call interface (the API) to be used for different types of file systems. It separates the file-system generic operations from the implementation details. Implementation can be on one of many file systems types, or can be on a network file system. VFS implements vnodes which holds a copy of the inodes or network file details. It dispatches operations to the appropriate file system implementation routines.

Virtual File Systems (Cont.)
The API is to the VFS interface, rather than any specific type of file system

Virtual File System Implementation
For example, Linux has four object types: inode, file, superblock, dentry VFS defines the set of operations on all of these objects that must be implemented Every object has a pointer to a function table. The function table has the addresses of routines to implement that function on that type of object. For example: • int open(. . .)—Open a file • int close(. . .)—Close an already-open file • ssize t read(. . .)—Read from a file • ssize t write(. . .)—Write to a file • int mmap(. . .)—Memory-map a file

Directory Implementation
There are several methods of implementing a directory structure, each with advantages and disadvantages. Each directory could be a Linear list of file names with a pointer to the FCB for each file. Simple to program. Time-consuming to execute. Linear search time. To speed things up this could keep ordered alphabetically via a linked list or in a red-black tree. Hash Table – linear list with hash data structure. This decreases directory search time. Collisions – situations where two file names hash to the same location must be handled. The usual difficulties with and solutions with hash tables.

Allocation Methods - Contiguous
An allocation method refers to how disk blocks are allocated for files. There are several different allocation methods, each with advantages and disadvantages. Contiguous allocation Linked allocation Indexed allocation Two big problems that need to be managed are fragmentation and wasted memory: External fragmentation – the tendency for the free memory to get chopped up into units that are too small to be useful. Internal fragmentation – the tendency for a process to not use all of the space allocated to it.

Allocation Methods - Contiguous
Contiguous allocation – each file occupies a set of contiguous blocks. This offers the best performance in most cases. It is simple to implement – only the starting location (block #) and length (number of blocks) of each file are required. Problems include: Finding enough contiguous space for the file Knowing file size (what if a file gets bigger?) External fragmentation – As files are created and deleted, free space gets chopped up. This implementation would need a method for compacting the file and collecting free space.

Contiguous Allocation

Extent-Based Systems Many newer file systems (i.e., Veritas File System) use a modified contiguous allocation scheme. A contiguous block of free space is allocated for each file. A guess is made about how much space will be needed for the file. If the file needs more space it gets another chunk of free space called and extent. Extents are stored as a linked list. Each extent has a pointer to the next one in the list.

Allocation Methods - Linked
Linked allocation – each file consists of a linked list of blocks. The FCB contains the pointer to the first block. Each block contains pointer to next block The file ends at nil (null) pointer. No external fragmentation – any block on the disk could be used by any file. Free space management system called when new block needed – files can grow and shrink. Efficiency can be improved by clustering blocks into groups. Clusters are allocated to files instead of individual blocks. This increases internal fragmentation – some clusters will be only partially full. Reliability can be a problem – if one block is damaged then the entire file after that point may be lost. Locating a block can take many I/Os and disk seeks.

Linked Allocation

FAT Format FAT (File Allocation Table) format uses a variation of linked allocation. The beginning of every volume has a table, indexed by block number. The directory entry contains the block number of the first block in the file. Looking up this block number in the table gives the number of the next block This operated much like a linked list but all of the links are stored in a single table. This means it is faster on disk and cacheable. Allocating new blocks is simple.

File-Allocation Table

Allocation Methods - Indexed
Indexed allocation – each file has its own index block(s) of pointers to its data blocks. These pointers show the blocks that make up the file. Logical view

Example of Indexed Allocation

Indexed Allocation (Cont.)
Using indexed allocation there are several advantages: Random access – it is easy to jump around in a file (unlike linked allocation). There is no external fragmentation – files can have their blocks scattered around the disk. The disadvantages are: There is overhead required to store the table. The maximum size of the file is limited by the size of the table. We want a small table so there won’t be wasted space. We want a big table so we can store large files.

Indexed Allocation – Mapping (Cont.)
There are many variations to the indexed scheme in an attempt to avoid its problems. Linked scheme – Each file has a linked list of index blocks so there is no limit on size. Multilevel scheme – a first level index block contains pointers to a set of second level index blocks which in turn contain pointers to file blocks. Suppose our block size is 4096 bytes and each can contain 1024 pointers (32 bits addresses). This scheme could address files of = 1,048,608 blocks = 4 GB in size. One problem with the multilevel scheme is that it can take thre I/O accesses to read or write a particular block.

Indexed Allocation – Mapping (Cont.)

Combined Scheme Combined scheme – this modification of the multilevel index is used by UNIX/Linux (UFS). Keep 15 pointers in the FCB. The first 12 are used for the first 12 blocks (direct blocks) of the file. The next three pointers go to indirect blocks. The first is a single level indirect block – it contains pointers to file blocks. The second is a double level indirect block – it contains pointers for more indirect blocks which in tern contain pointers to file blocks The third is a triple level indirect block.

Combined Scheme: UNIX UFS

Performance Which method produces the best performance depends on the file access type. Contiguous allocation is great for both sequential and random access. Linked allocation is good for sequential access but not for random access. One option would be to declare access type of a file at its creation. Contiguous allocation for random access files. Linked allocation for sequential files. Indexed allocation is more complex. A single block access could require 2 index block reads followed by the data block read Clustering can help to improve throughput and reduce CPU overhead.

Free-Space Management
The file system maintains a free-space list to track available blocks/clusters (We will use the term “block” for simplicity) Bit vector or bit map can be used to keep track of (n blocks) … 1 2 n-1 bit[i] =  1  block[i] free 0  block[i] occupied It is relatively easy to find a free block by treating the vector as a series of numbers. The formula to calculate the position of the first free block is CPUs have instructions check for 0-value words and to return offset within word of first “1” bit. (number of bits per word) * (number of 0-value words) + offset of first 1 bit

Free-Space Management (Cont.)
The bit map requires extra space Example: block size = 4KB = 212 bytes disk size = 240 bytes (1 terabyte) The disk contains n = 240/212 = 228 blocks. The bit map will require 228 bits (or 32MB) if the disk is divided into clusters of 4 blocks the bit map will require 8MB of memory One advantage of the bit map approach is that it is relatively easy to find free memory to produce contiguous files.

Linked Free Space List on Disk
Another approach it to use the free space itself to maintain a linked list of free space. The first block of free space actually contains a pointer to the next free block. With this scheme there is no wasted space. A disadvantage is that the scheme cannot easily produce contiguous space. Recording the # of free blocks can save the effort of traversing the entire list.

Modified Linked Schemes
One modification to the linked scheme is grouping. Modify the linked list to store address of next n-1 free blocks in first free block, plus a pointer to next block that contains free-block-pointers. Another modification to the linked scheme involves counting. Space is frequently contiguously used and freed. This is particularly true with contiguous-allocation allocation, extents, or clustering. The first free block of a contiguous chunk of memory keeps the count of the free block and a pointer to the next free chunk. Free space list then has entries containing addresses and counts.

Space Maps Oracle’s ZFS uses Space Maps.
This scheme is designed for very large, high performance file systems. Full data structures like bit maps couldn’t fit in memory. This would result thousands of I/Os to update the structures when files are allocated or freed. This scheme divides device space into metaslab units and manages the metaslabs. A given volume could contain hundreds of metaslabs.

Space Maps Each metaslab has an associated space map
This uses the counting algorithm but writes the data structure to a log file rather than using the free space. The system logs all of the block activity, in time order, using the counting format. When ZPS needs to allocate for free disk space, it loads the space map for the metaslab into memory in a balanced-tree structure, indexed by offset. It replay the log into that structure. It also combines contiguous free blocks into single entry.

Improving Efficiency and Performance
The efficiency of a file system is dependent on: Disk allocation and directory algorithms Types of data kept in file’s directory entry Pre-allocation or as-needed allocation of metadata structures Fixed-size or varying-size of data structures

Efficiency and Performance (Cont.)
Once an algorithm is chosen, there are several considerations that can increase performance. Keep data and metadata close together. Buffer cache – use a separate section of main memory for frequently used blocks. The system may treat synchronous and asynchronous writes differently. Synchronous writes are sometimes requested by apps or needed by OS and these must occur in the order received. No buffering / caching – writes must hit disk before acknowledgement. Asynchronous writes more common, buffer-able, faster. The processing of a sequential file can be optimized. Free-behind – when we are done with a block there is no need to keep it in the cache. Read-ahead – we are very likely to need the next block.

Page Cache A page cache caches pages rather than disk blocks using virtual memory techniques and addresses. Memory-mapped I/O uses a page cache. Routine I/O through the file system uses the buffer (disk) cache.

Unified Buffer Cache A unified buffer cache uses the same page cache to cache both memory-mapped pages and ordinary file system I/O to avoid double caching. Which caches get priority, and what replacement algorithms to use?

Recovery With all of the metadata being stored in memory and on disk, errors can occur. The ability to recover from errors is important. Consistency checking – compares data in directory structure with data blocks on disk, and tries to fix inconsistencies. Can be slow and sometimes fails. Use system programs to back up data from disk to another storage device (magnetic tape, other magnetic disk, optical). Recover lost file or disk by restoring data from backup

Log Structured File Systems
Log structured (or journaling) file systems record each metadata update to the file system as a transaction. All transactions are written to a log file. A transaction is considered committed once it is written to the log (sequentially). However, the file system may not yet be updated. The log may be stored on a separate device or section of the disk. The transactions in the log are asynchronously written to the file system structures. When the file system structures are modified, the transaction is removed from the log. If the file system crashes, all remaining transactions in the log must still be performed. This provides faster recovery from crash and removes the chance of inconsistencies in the metadata.

The Sun Network File System (NFS)
NFS is both an implementation and a specification of a software system for accessing remote files across LANs (or WANs). The implementation is part of the Solaris and SunOS operating systems and runs on Sun workstations using an unreliable datagram protocol (UDP/IP protocol and Ethernet).

NFS (Cont.) Interconnected workstations are viewed as a set of independent machines with independent file systems, which allows sharing among these file systems in a transparent manner. A remote directory is mounted over a local file system directory The mounted directory looks like an integral subtree of the local file system, replacing the subtree descending from the local directory. Specification of the remote directory for the mount operation is nontransparent; the host name of the remote directory has to be provided. Files in the remote directory can then be accessed in a transparent manner. Subject to access-rights accreditation, potentially any file system (or directory within a file system), can be mounted remotely on top of any local directory.

NFS (Cont.) NFS is designed to operate in a heterogeneous environment of different machines, operating systems, and network architectures; the NFS specifications are independent of these media. This independence is achieved through the use of RPC primitives built on top of an External Data Representation (XDR) protocol used between two implementation-independent interfaces The NFS specification distinguishes between the services provided by a mount mechanism and the actual remote-file-access services.

NFS Mount Protocol Establishes initial logical connection between server and client Mount operation includes name of remote directory to be mounted and name of server machine storing it Mount request is mapped to corresponding RPC and forwarded to mount server running on server machine Export list – specifies local file systems that server exports for mounting, along with names of machines that are permitted to mount them Following a mount request that conforms to its export list, the server returns a file handle—a key for further accesses File handle – a file-system identifier, and an inode number to identify the mounted directory within the exported file system The mount operation changes only the user’s view and does not affect the server side

Three Independent File Systems
Suppose there are three independent file systems (U, S1 and S2). The directory shared on system S1 could be mounted over the local directory local on system U. Using cascading mounts, dir2 on system S2 could be mounted over dir1. The path /usr/local/dir1 on system U actually specifies directory dir2 on system S2.

NFS Protocol NFS provides a set of remote procedure calls for remote file operations. The procedures support the following operations: searching for a file within a directory reading a set of directory entries manipulating links and directories accessing file attributes reading and writing files NFS servers are stateless (no open() and close() ). Each request has to provide a full set of arguments (NFS V4 is just coming available – very different, stateful) The NFS protocol does not provide concurrency-control mechanisms.

Three Major Layers of NFS Architecture
NFS is built in three layers. UNIX file-system interface (based on the open, read, write, and close calls, and file descriptors). Virtual File System (VFS) layer – distinguishes local files from remote ones, and local files are further distinguished according to their file-system types. The VFS activates file-system-specific operations to handle local requests according to their file-system types. Calls the NFS protocol procedures for remote requests. NFS service layer – bottom layer of the architecture Implements the NFS protocol.

Schematic View of NFS Architecture

NFS Path-Name Translation
The translation from the local pathname to the remote path name is performed by breaking the path into component names and performing a separate NFS lookup call for every pair of component name and directory vnode. To make lookup faster, a directory name lookup cache on the client’s side holds the vnodes for remote directory names.

NFS Remote Operations There is nearly a one-to-one correspondence between regular UNIX system calls and the NFS protocol RPCs (except opening and closing files). NFS adheres to the remote-service paradigm, but employs buffering and caching techniques for the sake of performance. File-blocks cache – when a file is opened, the kernel checks with the remote server whether to fetch or revalidate the cached attribute. Cached file blocks are used only if the corresponding cached attributes are up to date. File-attribute cache – the attribute cache is updated whenever new attributes arrive from the server.

NFS Remote Operations Caching can be done on the client side but modified data must be committed to the server’s disk before results are returned to the client. Write must be done synchronously losing the advantages of caching. Clients do not free delayed-write blocks until the server confirms that the data have been written to disk.

Example: WAFL File System
Network Appliance provides the write-anywhere file layout (WAFL) – a distributed file system, optimized for random access. Serves up NFS, CIFS, http, ftp NVRAM is used for write caching Similar to Berkeley Fast File System, with extensive modifications

End of Chapter 12

Homework Chapter 11, page 555 1, 3, 4, 5, 8, 10, 13, 15, 16, 17, 21

Chapter 11: File System Implementation

Similar presentations

Presentation on theme: "Chapter 11: File System Implementation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 11: File System Implementation

Similar presentations

Presentation on theme: "Chapter 11: File System Implementation"— Presentation transcript:

Similar presentations

About project

Feedback