File Systems
Why Files? To store large amount of data Information persistent
Files File Naming File Structure File Types File Access File Attributes File Operations
File Naming Files are abstraction mechanism. They provide a way to store information on the disk and read it back later. How and where the information is stored, and how the disks actually work, is hidden from the user. Important characteristic of any abstraction mechanism is the way the objects being managed are named When a process creates a file, it gives the file a name. When the process terminates, the file continues to exist and can be accessed by other processes using its name.
File Naming The exact rules of file naming vary somewhat from system to system But, some common of them It permits 1 to 8 letters as legal file names. Digits and special characters are also permitted. Some file systems differentiate between upper case letters and lower case letters. In UNIX, URGENT, urgent, Urgent, URgent, UrGent are treated as different file names. MS-DOS does not differentiate them for it all names are same. Many OS supports two-part file names, separated by a period ‘.’ Part following the period is called file extension. In MS-DOS file name are 1-8 character , plus optional 1-3 characters for extension.
Some Typical Extensions
File Structure 3 Types of File structures Byte Sequence (Typically used) Record Sequence (No Longer used) Tree (Still on a few machines)
File Structure
Byte sequence It is an unstructured sequence of bytes. The OS does not know what is in the file. All it sees are bytes. UNIX and MS-DOS both use this approaches. User programs can put anything they want in files and name them any way that is convenient.
Record sequence Here a file is a sequence of fixed length records each with some internal structure. Read Operation reads one record and write operation overwrites one record or append one record. Punched cards with 80-columns punched on it and 132 column printers used this. CP/M used such type of fixed length record files.
Tree A file consists of a tree of records, not necessarily all of same length, but each containing a key field in a fixed position in the record. The tree is sorted on the key field to allow rapid searching for a particular key. It is widely used on the large mainframe computers in commercial data processing.
File Types Many OS supports several types of files UNIX & MS-DOS supports Regular Files These are the ones that contain user information. In last example all are regular files. Directories They are the system files for maintaining the structure of the file systems. Character Special Files They are related to i/p –o/p devices such as terminals, printers and networks. Block Special Files These files are used for disks.
Regular Files Regular files are generally either ASCII files or Binary files The ASCII files consist of lines of text. The great advantage of ASCII files is that they can be displayed and printed as it is, and they can be edited with an ordinary text editor. A binary file means that it is not ASCII types. Technically it is just a sequence of bytes; the OS will only execute a file if it has the proper format.
Executable File b) Archive
Executable File It has five sections Header, text, data, relocation bits and symbol table The header starts with a magic number, identifying the file as an executable file (to prevent the accidental execution of a file not in this format.) Then comes 16-bit integers giving the sizes of the various pieces of the file, the address it starts execution at i.e. entry point, and some flag bits. Following the header are the text and data of the program itself. These are loaded into memory, and relocated using the relocation bits. The Symbol table is used for debugging. Block Started by Symbol
Archive Another type of binary file is archive from UNIX. It consists of a collection of library procedures (modules) compiled but not linked. Each one prefaced by the header telling its name, creation date, owner, protection code and size. Many PC based OS associate file types with the specific applications that generate them. In Windows for e.g., the file created by notepad has different icon and adobe has different icon (environment in which the file was created) and extensions.
File Access We know that file stores information. When it is used, this information must be accessed and read into computer memory. There are several ways that the information in the file can be accessed. Sequential Access Direct Access/ Random Access Index Sequential Access
Sequential Access A process could read all the bytes or records in a file in order, starting at the beginning, but could not skip around and read them out of order. It can be rewound and read as often as needed. Ex: magnetic tape It has two sub-types: Variable length and fixed length. It is used for batch systems. Ex editors and compilers , LIC forms processing etc
Direct Access/ Random Access Files whose bytes or records can be read in any order are called random access files. Disks are used for such type of access. Generally applied in Database applications. Basically used to retrieve the immediate information to large amount of information.
Index Sequential Access Index for the file is constructed. This Index contains pointers to various records. The Index file contains record key and the record numbers. First index file is searched using binary search and then secondary again using binary search. And then finally block found is read sequentially.
File Attributes Every file has a name and its data. In addition, all OS associate other information with each file i.e. date and time, size etc. These extra items are called file’s attribute. These attributes may vary from OS to OS.
File Attributes The first four attributes are related to the file’s protection and tell who may access it and who may not. In some systems user must specify the password in order to access the file. Flags are the bits or short fields that control or enable some specific property. Hidden files, do not appear in the listings of all the files. The archive flag is a bit that keeps track of whether the file has been backed up. The temporary flag allows a file to be marked for automatic deletion when the process that created it terminates.
File Attributes The record length, key position, key length fields are only present in the files whose records can be looked up using a key. The various times keep track of when the file was created, most recently accessed and most recently modified. The current size tells how big the file is at present. Some mainframe OS needs maximum size to be specified when the file is created.
File Operations Files exist to store information and allow it to be retrieved later. Different systems provide different operations to allow storage and retrieval The most common system calls relating to files: Create Delete Open Close Read Write Append Seek Get Attributes Set Attributes Rename
File Operations Create: Initially the file is created with no data. Delete: When the file is no longer needed, it has to be deleted to free up disk space. Open: Before using a file, a process must open it. The purpose of open call is to allow the system to fetch the attributes and list of disk addresses into main memory for rapid access on subsequent calls. Close: When all the accesses are finished, the attributes and disk addresses are no longer needed, file should be closed to free up internal table space. Read: Data are read from file. Usually the bytes come from the current position. Write: Data are written to the file, again, usually at the current position. Append: This call is a restricted form of WRITE i.e. It allows user to add data only to the end of the file.
File Operations Seek: For random access files, a method is needed to specify from where to take the data. System call SEEK, repositions the pointer to the current position to a specific place in the file. After this call has completed, data can be read from or written to, that position. Get Attributes: Process often needs to read file attributes to do their work. Set Attributes: Some of the attributes are user-settable and can be changed after the file has been created. Rename: It frequently happens that a user needs to change the name of an existing file. This system call does this job.
Memory-Mapped Files Many people feel that File access methods, are cumbersome and inconvenient, especially when compared to accessing ordinary memory (Main Memory). MAP & UNMAP system calls are used for this. File mapping works best in a system that supports segmentation. Each file can be mapped onto its own segment so that byte k in the file is also byte k in the segment. There are two segmentations in file: Text and Data. Suppose for file copying, first it maps source file onto the destination file. Then it creates an empty segment and maps it onto the destination file, xyz.
Memory Mapped Files Process can copy the source segment into the destination segment using an ordinary copy loop. Here there is no READ or WRITE system calls are needed. When it is done, it will call UNMAP system call to remove the files from the address space, and then it exits. File mapping eliminates the need for I/O calls and so programming is easier.
Memory Mapped Files But it introduces some of the problems like: 1st it is hard for the system to know the exact length of the output file, xyz, in our example It can easily tell the number of the highest page written, but it has no way of knowing how many bytes in that page were written. All OS can do is to keep the length of the file equal to the page size. 2nd Problem is if a file is mapped in by one process and opened for conventional reading by another. If the file is modified by one process and that change will not be reflected in the file on disk until the page is evicted. System has to take care that two processes do not see inconsistent versions of the file.
Memory Mapped Files 3rd problem with mapping is that a file may be larger than a segment, or even larger than the entire virtual address space. Only way out is to arrange the MAP system call to be able to map a portion of a file, rather than the entire file.
Hierarchical Directory Systems
Types of Paths Path Names When the file system is organized as a directory tree, some way is needed for specifying file names. Two methods are commonly used. Absolute Path Name Relative Path Name
Path Types Absolute Path Name It consists of the direct path from the root directory to the file. /usr/ast/mailbox usr is directory, ast is directory, mailbox is filename. Absolute path names always start at the root directory.
Path Types Relative Path Name This is used in conjunction with the concept of the working directory (also called the current directory). User can designate one directory as the current working directory, in which case all path names not beginning at the root directory are taken relative to the working directory. Ex if the current working directory is /usr/ast Then file with /usr/ast/mailbox is referred as just mailbox. Relative path name is more convenient compare to absolute path name. Most OS that support a hierarchical directory system have two special entries in every directory, “.” and “..” “.” means current directory and “..” means parent directory.
Unix Directory Structure
Directory Operations Create: A directory is created. It is empty except for . and .. (automatically system puts it) Delete: A directory is deleted. Only an empty directory can be deleted. . and .. cannot be deleted. Opendir: Directories can be read. For ex. to list all the files in a directory, a listing program opens the directory to read out the names of all the files it contains. Closedir: When a directory has been read, it should be closed to free up internal table space.
Directory Operations Readdir: This call returns the next entry in an open dir. Rename: This call is used to rename a directory. Link: Linking is a technique that allows a file to appear in more than one directory. It means that the file is shared but it is not copied. Unlink: It removes the directory entry.
Implementing Files The key issue in implementing file storage is keeping track of which disk blocks go with which file. There are four types of memory allocation: Contiguous allocation Linked list allocation Linked list allocation using an Index I-nodes
Contiguous allocation It is the simplest allocation scheme. Here it stores each file as a contiguous block of data on the disk. With 1k block, 50k file would be allocated 50 consecutive blocks. Two advantages: 1st it is simple to implement because keeping track of where a file’s blocks are, is reduced to remembering one number-the disk address of the first block. 2nd the performance is excellent because the entire file can be read from the disk in a single operation. No other allocation method even comes close.
Contiguous allocation Disadvantages 1st it is not feasible unless the maximum file size is known at the time the file is created. Without this information OS does not know how much disk space to reserve. 2nd is the fragmentation of the disk that results from this allocation policy. Space is wasted that might otherwise have been used.
Continuous allocation……..
Linked List Allocation The second method for storing files is to keep each one as a linked list of disk blocks. The first word of each block is used as a pointer to the next one, the rest of the block is for data.
Linked List Allocation
Linked List Allocation Advantages: Unlike, contiguous allocation, every disk block can be used in this method. No space is lost to disk fragmentation. It is sufficient to store the disk address of the first block, the rest can be found using that. Disadvantages: It is slow. Space is required for pointer in a block.
Linked List Allocation using a Table Index Disadvantages of linked list allocation can be eliminated by taking the pointer word from each disk block and putting it in a table or index in memory. File A uses disk blocks 4,7,2,10,12 File B uses disk blocks 6,3,11,14 Start with first block in order and follow the chain all the way to the end. Thus the entire block is available for data. The chain is entirely in table, and also the pointers are in table. Disadvantage is that table occupies memory space and it has to be in memory for all the time.
Linked List Allocation using Table Index
I-nodes To associate with each file a little table called an i-node(index-node) is introduced. The first few disk addresses are stored in the i-node itself, so for small files, all the necessary information is right in the i-node. For somewhat larger files, one of the addresses in the i-node is the address of a disk block called a single indirect block. If still not enough, another address in the i-node, called a double indirect block, contains the address of a list of single indirect blocks. Each of these single indirect blocks points to a few hundred data blocks. If even this is not enough, a triple indirect block can also be use
I-nodes
I-nodes
Implementing Directories The directory entry provides the information needed to find the disk blocks. The main aim to map the file name is to locate the data. Directories in few OSs: Directories in CP/M Directories in MS-DOS Directories in UNIX
CP/M (Memory Layout)
CP/M In CP/M there is only one directory. All the file system has to do to look up a file name is search the one and only directory. When it finds the entry, it also has the disk block numbers, since they are stored right in the directory entry, as are all the attributes. If file uses more disk blocks than fit in one entry, the file is allocated additional directory entries.
CP/M
CP/M The user code field keeps track of which user owns the file. The next two fields gives name and extension of the file. The Extent field is needed because a file larger than 16 blocks occupies multiple directory entries. The Block Count field tells how many of the 16 disk block entries are in use. The final 16 fields contain the disk block numbers themselves (Each field for 1KByte block so at max file size possible is 16 KB). The last block may not be full so system has no way to determine the exact size of the file (file sizes are in blocks, not in bytes).
Directories in MS/DOS The Directory entry in MS-DOS 32 bytes long and contains the file name, attributes, and the number of the first disk block. The first disk block is used as an index into a table of the type, linked list allocation using index. Using chain one can get all the blocks.
Directories in MS-DOS
Directories in UNIX Directory Entry in Unix is simple. It contains just the file name and an i-node number All the information about the type, size, times, ownership, disk blocks is contained in the i-node.
find /user/ast/mailbox
Shared Files When several users are working together on a project, they often need to share files. So it is convenient for a shared file to appear simultaneously in different directories belonging to different users. Ex C’s file shared by B. The connection between B’s directory and the shared file is called a link. The file system itself is now a directed acyclic graph, DAG. It also has some problems: Like CP/M if directory contains disk block addresses, then B’s directory must copy this addresses. If B or C later on appends to the file, the new blocks will be listed only in the directory which appends not visible to other directory user.
Shared Files Solutions is Disk blocks are not listed in directories only the data structure like UNIX must be associated with file itself. B can share C’s file using link command also called symbolic linking.
Shared Files
Disk Space Management Files are stored on disks so management of disk space is a major concern to file system designers. Two general strategies for storing an n byte file: n consecutive bytes of disk space are allocated. Or the file is split up into a number of blocks. Storing file as a contiguous sequence of bytes has the obvious problem that if a file grows, it will probably have to be moved on the disk.
Disk Space Management The same problem with segments also, but moving segments in memory is faster compare to moving a file from one disk position to another. So all file systems chop files up into fixed-size blocks that need not be adjacent.
Disk Space Management Block Size Keeping track of free Blocks Once it has been decided to store files in fixed-size blocks, the question arises of how big the block should be? The usual compromise is to choose a block size of 512, 1K or 2K bytes. Keeping track of free Blocks Once the block size has been chosen, the next issue is how to keep track of free blocks. There are two methods used One consists of using a linked list of disk blocks, with each block holding as many free disk block numbers as will fit. Second technique is bit map; a disk with n blocks requires a bit map with n bits. Free blocks are represented by 1s and allocated by 0s in the map.
Free block management techniques 1. Linked List 2. Bit Map
Disk Space Management Disk Quotas To prevent people from hogging too much disk space, multiuser operating systems, such as UNIX, often provides a mechanism for enforcing disk quotas. The idea is that the system administrator assigns each user a maximum allotment of files and blocks, and the OS make sure that the users do not exceed their quotas.
File System Reliability Destruction of a file system is often a far greater disaster. If a file system is irrevocably lost, due to hardware, software or any problem, restoring all the information will be difficult, time consuming, in many cases, impossible. People whose programs, documents, customer files, tax records, data bases, marketing plans any other data are gone forever, the consequences can be catastrophic. File system cannot protect against physical destruction, but it can help protecting the information. Some issues involved in safeguarding the file system.
File System Reliability Bad block management Disks often have bad blocks. Disks are perfect but while using it develop bad blocks. Hard disk already has bad block at the start. Two solutions for the bad block problem Hard ware Soft ware Hard ware solution is that to dedicate a sector on the disk to the bad block list. When the controller is first initialized, it reads the bad block list and picks a spare block to replace the defective ones, recording the mapping in the bad block list. Henceforth, all the bad block request will use the spare.
File System Reliability Software solution requires the user or file system to carefully construct a file containing all the bad blocks. This technique removes them from the free list, so they will never occur in the data files. Care has to be take to avoid reading this file while taking backups. Backups even with a cleaver strategy for dealing with bad blocks, it is important to backup the files frequently.
File System Reliability The small disks can be backed up by just copying it entirely on another disk/ CD. For hard disks entire drive can be copied on another hard disk, means computer with two hard drives. In incremental dumping periodically, weekly, monthly, daily and only modified files to be dumped. Another area where reliability is an issue is file system consistency. Many file systems read blocks, modify them, and write them later. If system crashes before modified blocks have been written out, the file system can be left in inconsistent state. To deal with the problem of inconsistent file system, most computers have a utility program that checks file system consistency. It can be run whenever the system is booted, particularly after a crash.
File system Performance We know that the access to disk is much slower than access to memory. Reading memory word required nanoseconds where as reading disk blocks requires tens of milliseconds. It means it is a factor of 100,000 times slower. As a result of this difference in access time, many file systems have been designed to reduce the number of disk accesses needed. The most common technique used to reduce disk accesses is the block cache or buffer cache. Another important technique is to reduce the amount of disk arm motion by putting blocks that are likely to be accessed in sequence close to each other, preferably in the same cylinder. Third put the i-node in the middle of the disk, rather than at the start, thus reducing the average seek between the i-node and the first block by a factor of two.