Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 Matt Welsh – Harvard University 1 CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara,

Similar presentations


Presentation on theme: "© 2006 Matt Welsh – Harvard University 1 CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara,"— Presentation transcript:

1 © 2006 Matt Welsh – Harvard University 1 CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY URL: http://kovan.ceng.metu.edu.tr/ceng334 Filesystems Topics:

2 © 2006 Matt Welsh – Harvard University 2 File System Caching Most filesystems cache significant amounts of disk in memory e.g., Linux tries to use all “free” physical memory as a giant cache Avoids huge overhead for going to disk for every I/O Issues: When do you commit a write to the disk? What happens if you write only to the memory cache and then the system crashes? How do you keep the memory and disk views of a file consistent? What if the file metadata (inodes, etc.) is modified before the data blocks? Read-ahead Read a few extra blocks into memory when you do one read operation Amortize the cost of the seek Useful if the blocks of a file are laid out in contiguous blocks Take advantage of sequential access patterns on the file

3 © 2006 Matt Welsh – Harvard University 3 Berkeley FFS Motivated by performance problems with older UNIX filesystems: Small blocks (512 bytes) Free list was unordered; no notion of allocating chunks of space at a time Inodes and data blocks may be located far from each other (long seek time) Related files (in same directory) might be very far apart No symbolic links, file locking, limited filenames (14 chars), no quotas Main goal of FFS was to improve performance: Use a larger block size – why does this help?? Allocate blocks of a file (and files in same directory) near each other on the disk Entire filesystem described by a superblock Contains free block bitmap, location of root directory inode, etc. Copies of superblock stored at different locations on disk (for safety)

4 © 2006 Matt Welsh – Harvard University 4 FFS Cylinder Groups Store related blocks on nearby tracks but on different platters That is, a whole group of cylinders: Allocate blocks in a rotationally optimal fashion: Try to estimate rotation speed of disk and allocate next block where the disk head will happen to be when the next read will be ready! data blocks superblock inode blocks

5 © 2006 Matt Welsh – Harvard University 5 Does this stuff matter anymore? Modern disks have a lot of internal buffering and logic Batch multiple write requests into a single write Internally reorder multiple outstanding requests Internal remapping of bad blocks to different places on the physical disk OS has little information on physical disk geometry anyway! Blocks with similar block #'s are usually close to each other, but that's about it... So, how useful are this fancy OS-driven block layout techniques? Clearly used to have significant impact on disk performance These days, not clear that they are so useful Still, lots of debate in the FS community about this Modern filesystems still use notion of block and cylinder grouping Argument that OS can know more about the workload, multiple users, different request priorities, and tradeoffs in terms of bandwidth vs. latency

6 © 2006 Matt Welsh – Harvard University 6 Recall: Multilevel Indexed Files Inode contains a list of 10-15 direct blocks First few blocks of file Also contains a pointer to a single indirect, double indirect, and triple indirect blocks Allows file to grow to be incredibly large!!! inode direct blocks single-indirect blocks double-indirect blocks

7 © 2006 Matt Welsh – Harvard University 7 Maximum File Size Assume 1 KB blocks. How large can a file be with... Single-level indirect table? How many block pointers can be stored in one block? Assume 4 bytes per block pointer, so (1KB / 4) = 256 blocks So... 256 * 1KB = 256 KB Double-level indirect table? 256 * 256 * 1KB = 65536 KB = 64 MB Triple-level indirect table? 256 * 256 * 256 * 1KB = 16 GB FFS-style: 13 direct blocks, 1 single indirect, 1 double indirect, 1 triple indirect? (13 * 1KB) + (256 * 1KB) + (256 * 256 * 1KB) + (256 * 256 * 256 * 1KB) = 16.06 GB So why use this wacko multi-level indirection scheme rather than just a triple-level indirect table???

8 © 2006 Matt Welsh – Harvard University 8 FFS Block Sizes Older UNIX filesystems used small blocks (512B or 1KB) Low disk bandwidth utilization Maximum file size is limited (how many blocks can an inode keep track of) FFS introduced larger block sizes (4KB) Allows multiple sectors to be read/written at once Introduces internal fragmentation: a whole block may not be used Fix: Block “fragments” (1KB) The last block in a file may consist of 1, 2, or 3 fragments Fragments from different files stored on the same block

9 © 2006 Matt Welsh – Harvard University 9 Log-structured Fileystems (LFS) Around '91, two trends in disk technology were emerging: Disk bandwidth was increasing rapidly (over 40% a year) Seek latency not improving much at all Machines had increasingly large main memories Large buffer caches absorb a large fraction of read I/Os Can use for writes as well! Coalesce several small writes into one larger write Some lingering problems with FFS... Writing to file metadata (inodes) was required to be synchronous Couldn't buffer metadata writes in memory Lots of small writes to file metadata means lots of seeks! LFS takes advantage of both to increase FS performance Mendel Rosenblum and John Ousterhout Mendel is now a prof at Stanford Also lots of contributions by our own Margo Seltzer

10 © 2006 Matt Welsh – Harvard University 10 LFS: Basic Idea Treat the entire disk as one big append-only log for writes! Don't try to lay out blocks on disk in some predetermined order Whenever a file write occurs, append it to the end of the log Whenever file metadata changes, append it to the end of the log Collect pending writes in memory and stream out in one big write Maximizes disk bandwidth No “extra” seeks required (only those to move the end of the log) When do writes to the actual disk happen?

11 © 2006 Matt Welsh – Harvard University 11 LFS: Basic Idea Treat the entire disk as one big append-only log for writes! Don't try to lay out blocks on disk in some predetermined order Whenever a file write occurs, append it to the end of the log Whenever file metadata changes, append it to the end of the log Collect pending writes in memory and stream out in one big write Maximizes disk bandwidth No “extra” seeks required (only those to move the end of the log) When do writes to the actual disk happen? When a user calls sync() -- synchronize data on disk for whole filesystem When a user calls fsync() -- synchronize data on disk for one file When OS needs to reclaim dirty buffer cache pages Note that this can often be avoided, eg., by preferring clean pages Sounds simple... But lots of hairy details to deal with!

12 © 2006 Matt Welsh – Harvard University 12 LFS Example File 1File 2 Log File 1 Writing a block in the middle of the file just appends that block to the log

13 © 2006 Matt Welsh – Harvard University 13 LFS and inodes How do you locate file data? Sequential scan of the log is probably a bad idea... Solution: Use FFS-style inodes! File 1 File 2 Log File 1 inode 1inode 2

14 © 2006 Matt Welsh – Harvard University 14 LFS and inodes How do you locate file data? Sequential scan of the log is probably a bad idea... Solution: Use FFS-style inodes! File 1 File 2 Log File 1 inode 1inode 2 inode 1 Every update to a file writes a new copy of the inode!

15 © 2006 Matt Welsh – Harvard University 15 inode map (this is getting fun) Well, now, how do you find the inodes?? Could also be anywhere in the log! Solution: inode maps Maps “file number” to the location of its inode in the log Note that inode map is also written to the log!!!! Cache inode maps in memory for performance File 1 File 2 File 1 inode 1 inode map inode 1 inode 2 Ckpoint area Fixed checkpoint region tracks location of inode map blocks in log New inode map block!

16 © 2006 Matt Welsh – Harvard University 16 Reading from LFS But wait... now file data is scattered all over the disk! Seems to obviate all of the benefits of grouping data on common cylinders Basic assumption: Buffer cache will handle most read traffic Or at least, reads will happen to data roughly in the order in which it was written Take advantage of huge system memories to cache the heck out of the FS!

17 © 2006 Matt Welsh – Harvard University 17 Log Cleaner With LFS, eventually the disk will fill up! Need some way to reclaim “dead space” What constitutes “dead space?” Deleted files File blocks that have been “overwritten” Solution: Periodic “log cleaning” Scan the log and look for deleted or overwritten blocks Effectively, clear out stale log entries Copy live data to the end of the log The rest of the log (at the beginning) can now be reused!

18 © 2006 Matt Welsh – Harvard University 18 Log cleaning example LFS cleaner breaks log into segments Each segment is scanned by the cleaner Live blocks from a segment are copied into a new segment The entire scanned segment can then be reclaimed Dead Empty segment

19 © 2006 Matt Welsh – Harvard University 19 Log cleaning example LFS cleaner breaks log into segments Each segment is scanned by the cleaner Live blocks from a segment are copied into a new segment The entire scanned segment can then be reclaimed Cleaner runs

20 © 2006 Matt Welsh – Harvard University 20 Log cleaning example LFS cleaner breaks log into segments Each segment is scanned by the cleaner Live blocks from a segment are copied into a new segment The entire scanned segment can then be reclaimed Cleaner runs

21 © 2006 Matt Welsh – Harvard University 21 Log cleaning example LFS cleaner breaks log into segments Each segment is scanned by the cleaner Live blocks from a segment are copied into a new segment The entire scanned segment can then be reclaimed These two segments are now empty and ready to store new data

22 © 2006 Matt Welsh – Harvard University 22 Cleaning Issues When does the cleaner run? Generally when the system (or at least the disk) is otherwise idle Can cause problems on a busy system with little idle time Cleaning a segment requires reading the whole thing! Can reduce this cost if the data to be written is already in cache How does segment size affect performance?

23 © 2006 Matt Welsh – Harvard University 23 Cleaning Issues When does the cleaner run? Generally when the system (or at least the disk) is otherwise idle Can cause problems on a busy system with little idle time Cleaning a segment requires reading the whole thing! Can reduce this cost if the data to be written is already in cache How does segment size affect performance? Large segments amortize cost of access/seek time to read/write entire segment during cleaning Small segments introduce more variance in segment utilizations More segments will contain only dead blocks, making cleaning trivial Could imagine dynamically chaging segment sizes based on observed overhead for cleaning

24 © 2006 Matt Welsh – Harvard University 24 LFS Debate First LFS paper by Rosenblum and Ousterhout in '91 1992 port of LFS to BSD by Margo Seltzer and others... Seltzer et al. Publish paper in USENIX'93 pointing out some flaws Ousterhout publishes critique of '93 LFS paper Seltzer publishes revised paper in '95 Ousterhout publishes critique of '95 paper Seltzer publishes response to critique Ousterhout publishes response to response to critique... “Lies, damn lies, and benchmarks” It is very difficult to come up with definitive benchmarks proving that one system is better than another Can always find a scenario where one system design outperforms another Difficult to extrapolate based on benchmark tests

25 © 2006 Matt Welsh – Harvard University 25 Filesystem corruption What happens when you are making changes to a filesystem and the system crashes? Example: Modifying block 5 of a large directory, adding lots of new file entries System crashes while the block is being written The new files are “lost!” System runs fsck program on reboot Scans through the entire filesystem and locates corrupted inodes and directories Can typically find the bad directory, but may not be able to repair it! The directory could have been left in any state during the write fsck can take a very long time on large filesystems And, no guarantees that it fixes the problems anyway

26 © 2006 Matt Welsh – Harvard University 26 Example: Example: removing a file requires Remove the file from its directory Release the i-node to the pool of free i-nodes Return all the disk blocks to the pool of free disk blocks In the absence of crashes the order these steps taken do not matter. In the presence of crashes, however, it does!

27 © 2006 Matt Welsh – Harvard University 27 Example: Example: removing a file requires Remove the file from its directory Release the i-node to the pool of free i-nodes Return all the disk blocks to the pool of free disk blocks The inodes and file blocks will not be accessible from any file yet they will not be available for reassignment.

28 © 2006 Matt Welsh – Harvard University 28 Example: Example: removing a file requires Remove the file from its directory Release the i-node to the pool of free i-nodes Return all the disk blocks to the pool of free disk blocks The directory node will point to an invalid inode or (if the inode is reassigned) point to a different file. The blocks of the file will not be available for reassignment.

29 © 2006 Matt Welsh – Harvard University 29 Example: Example: removing a file requires Remove the file from its directory Release the i-node to the pool of free i-nodes Return all the disk blocks to the pool of free disk blocks The file will point to empty blocks, or (after reassignment) it will share the blocks of other files to which these were reassigned..

30 © 2006 Matt Welsh – Harvard University 30 Journaling Filesystems Ensure that changes to the filesystem are made atomically That is, a group of changes are made all together, or not at all In the directory modification example, this means that after the system reboots: The directory either looks exactly as it did before the block was modified Or the directory looks exactly as it did after the block was modified Cannot leave an FS entity (data block, inode, directory, etc.) in an intermediate state! Idea: Maintain a log of all changes to the filesystem Log contains entries that indicate what was done e.g., “Directory 2841 had inodes 404, 407, and 408 added to it” To make a filesystem change: 1. Write an intent-to-commit record to the log 2. Write the appropriate changes to the log Do not modify the filesystem data directly!!! 3. Write a commit record to the log This is essentially the same as the notion of database transactions

31 © 2006 Matt Welsh – Harvard University 31 Journaling FS Recovery What happens when the system crashes? Filesystem data has not actually been modified, just the log! So, the FS itself reflects only what happened before the crash Periodically synchronize the log with the filesystem data Called a checkpoint Ensures that the FS data reflects all of the changes in the log No need to scan the entire filesystem after a crash... Only need to look at the log entries since the last checkpoint! For each log entry, see if the commit record is there If not, consider the changes incomplete, and don't try to make them

32 © 2006 Matt Welsh – Harvard University 32 Journaling FS Example File 2File 1 Log Checkpoint

33 © 2006 Matt Welsh – Harvard University 33 Journaling FS Example File 2File 1 Log Checkpoint

34 © 2006 Matt Welsh – Harvard University 34 Journaling FS Example File 2File 1 Log Checkpoint Filesystem reflects changes up to last checkpoint Fsck scans changelog from last checkpoint forward Doesn't find a commit record... changes are simply ignored

35 © 2006 Matt Welsh – Harvard University 35 Virtual File Systems (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639 VFS: another layer of abstraction Upper interface: for processes implementing POSIX interface Lower interface: for concrete file systems VFS translates the POSIX calls to the calls of the filesystems under it. Developed by Sun to support NFS (Network File System) protocol.

36 © 2006 Matt Welsh – Harvard University 36 At boot time, the root filesystem is registered with VFS. When other filesystems are mounted, they must also register with VFS. When a filesystem registers, it provides the list of addresses of the functions that the VFS demands, such as reading a block. After regitratin, when one opens a file: open(“/usr/include/unistd.h”, O_RDONLY) VFS creates a v-node and makes a call to the concrete filesystem to return all the information needed. The created v-node also contains pointers to the table of functions for the concrete filesystem that the file resides.

37 © 2006 Matt Welsh – Harvard University 37 Figure 4-19. A simplified view of the data structures and code used by the VFS and concrete file system to do a read. Virtual File Systems (2) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

38 © 2006 Matt Welsh – Harvard University 38 Keeping track of Free Blocks Two methods: Linked list of free blocks Bitmap structure

39 © 2006 Matt Welsh – Harvard University 39 Figure 4-22. (a) Storing the free list on a linked list. (b) A bitmap. Keeping Track of Free Blocks (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

40 © 2006 Matt Welsh – Harvard University 40 Figure 4-24. Quotas are kept track of on a per-user basis in a quota table. Disk Quotas Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

41 © 2006 Matt Welsh – Harvard University 41 Backups to tape are generally made to handle one of two potential problems: Recover from disaster. Recover from stupidity. File System Backups (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

42 © 2006 Matt Welsh – Harvard University 42 Figure 4-25. A file system to be dumped. Squares are directories, circles are files. Shaded items have been modified since last dump. Each directory and file is labeled by its i-node number. File System Backups (2) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

43 © 2006 Matt Welsh – Harvard University 43 Figure 4-26. Bitmaps used by the logical dumping algorithm. File System Backups (3) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

44 © 2006 Matt Welsh – Harvard University 44 Figure 4-27. File system states. (a) Consistent. (b) Missing block. (c) Duplicate block in free list. (d) Duplicate data block. File System Consistency Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

45 © 2006 Matt Welsh – Harvard University 45 Figure 4-28. The buffer cache data structures. Caching (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

46 © 2006 Matt Welsh – Harvard University 46 Some blocks, such as i-node blocks, are rarely referenced two times within a short interval. Consider a modified LRU scheme, taking two factors into account: Is the block likely to be needed again soon? Is the block essential to the consistency of the file system? Caching (2) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

47 © 2006 Matt Welsh – Harvard University 47 Figure 4-29. (a) I-nodes placed at the start of the disk. (b) Disk divided into cylinder groups, each with its own blocks and i-nodes. Reducing Disk Arm Motion Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

48 © 2006 Matt Welsh – Harvard University 48 Figure 4-30. The ISO 9660 directory entry. The ISO 9660 File System Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

49 © 2006 Matt Welsh – Harvard University 49 Rock Ridge extension fields: PX - POSIX attributes. PN - Major and minor device numbers. SL - Symbolic link. NM - Alternative name. CL - Child location. PL - Parent location. RE - Relocation. TF - Time stamps. Rock Ridge Extensions Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

50 © 2006 Matt Welsh – Harvard University 50 Joliet extension fields: Long file names. Unicode character set. Directory nesting deeper than eight levels. Directory names with extensions Joliet Extensions Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

51 © 2006 Matt Welsh – Harvard University 51 Figure 4-31. The MS-DOS directory entry. The MS-DOS File System (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

52 © 2006 Matt Welsh – Harvard University 52 Figure 4-32. Maximum partition size for different block sizes. The empty boxes represent forbidden combinations. The MS-DOS File System (2) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

53 © 2006 Matt Welsh – Harvard University 53 Figure 4-33. A UNIX V7 directory entry. The UNIX V7 File System (1) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

54 © 2006 Matt Welsh – Harvard University 54 Figure 4-34. A UNIX i-node. The UNIX V7 File System (2) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

55 © 2006 Matt Welsh – Harvard University 55 Figure 4-35. The steps in looking up /usr/ast/mbox. The UNIX V7 File System (3) Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639


Download ppt "© 2006 Matt Welsh – Harvard University 1 CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara,"

Similar presentations


Ads by Google