IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng CTB 265.

IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng ccteng@byu.edu CTB 265

Administrivia Quiz 3: memory management – to be posted today, due next Monday in class Project 2: HW 6-8: punted HW 9: file system report HW 10: echics Labs 6 & 7: Case study: 12/25/20152

3 File systems The concept of a file system is simple –the implementation of the abstraction for secondary storage abstraction = files –logical organization of files into directories the directory hierarchy –sharing of data between processes, people and machines access control, consistency, …

12/25/20154 Files A file is a collection of data with some properties –contents, size, owner, last read/write time, protection … Files may also have types –understood by file system device, directory, symbolic link –understood by other parts of OS or by runtime libraries executable, dll, source code, object code, text file, … Type can be encoded in the file’s name or contents –windows encodes type in name.com,.exe,.bat,.dll,.jpg,.mov,.mp3, … –old Mac OS stored the name of the creating program along with the file –unix has a smattering of both in content via magic numbers or initial characters

12/25/20155 Basic operations Windows CreateFile(name, CREATE) CreateFile(name, OPEN) ReadFile(handle, …) WriteFile(handle, …) FlushFileBuffers(handle, …) SetFilePointer(handle, …) CloseHandle(handle, …) DeleteFile(name) CopyFile(name) MoveFile(name) Unix create(name) open(name, mode) read(fd, buf, len) write(fd, buf, len) sync(fd) seek(fd, pos) close(fd) unlink(name) rename(old, new)

12/25/20156 File access methods Some file systems provide different access methods that specify ways the application will access data –sequential access read bytes one at a time, in order –direct access random access given a block/byte # –record access file is array of fixed- or variable-sized records –indexed access FS contains an index to a particular field of each record in a file apps can find a file based on value in that record (similar to DB) Why do we care about distinguishing sequential from direct access? –what might the FS do differently in these cases?

12/25/20157 Directories Directories provide: –a way for users to organize their files –a convenient file name space for both users and FS’s Most file systems support multi-level directories –naming hierarchies (/, /usr, /usr/local, /usr/local/bin, …) Most file systems support the notion of current directory –absolute names: fully-qualified starting from root of FS bash$ cd /usr/local –relative names: specified with respect to current directory bash$ cd /usr/local (absolute) bash$ cd bin (relative, equivalent to cd /usr/local/bin)

12/25/20158 Directory internals A directory is typically just a file that happens to contain special metadata –directory = list of (name of file, file attributes) –attributes include such things as: size, protection, location on disk, creation time, access time, … –the directory list is usually unordered (effectively random) when you type “ls”, the “ls” command sorts the results for you

12/25/20159 Path name translation Let’s say you want to open “/one/two/three” fd = open(“/one/two/three”, O_RDWR); What goes on inside the file system? –open directory “/” (well known, can always find) –search the directory for “one”, get location of “one” –open directory “one”, search for “two”, get location of “two” –open directory “two”, search for “three”, get loc. of “three” –open file “three” –(of course, permissions are checked at each step) FS spends lots of time walking down directory paths –this is why open is separate from read/write (session state) –OS will cache prefix lookups to enhance performance /a/b, /a/bb, /a/bbb all share the “/a” prefix

12/25/201510 File protection FS must implement some kind of protection system –to control who can access a file (user) –to control how they can access it (e.g., read, write, or exec) More generally: –generalize files to objects (the “what”) –generalize users to principals (the “who”, user or program) –generalize read/write to actions (the “how”, or operations) A protection system dictates whether a given action performed by a given principal on a given object should be allowed –e.g., you can read or write your files, but others cannot –e.g., your can read /etc/motd but you cannot write to it

12/25/201511 Model for representing protection Two different ways of thinking about it: –access control lists (ACLs) for each object, keep list of principals and principals’ allowed actions –capabilities for each principal, keep list of objects and principal’s allowed actions Both can be represented with the following matrix: /etc/passwd/home/gribble/home/guest rootrw gribblerrwr guestr principals objects ACL capability

12/25/201512 ACLs vs. Capabilities Capabilities are easy to transfer –they are like keys: can hand them off –they make sharing easy ACLs are easier to manage –object-centric, easy to grant and revoke to revoke capability, need to keep track of principals that have it hard to do, given that principals can hand off capabilities ACLs grow large when object is heavily shared –can simplify by using “groups” put users in groups, put groups in ACLs you are all in the “VMware powerusers” group on Win2K –additional benefit change group membership, affects ALL objects that have this group in its ACL

12/25/201513 The original Unix file system Dennis Ritchie and Ken Thompson, Bell Labs, 1969 “UNIX rose from the ashes of a multi-organizational effort in the early 1960s to develop a dependable timesharing operating system” – Multics Designed for a “workgroup” sharing a single system Did its job exceedingly well –Although it has been stretched in many directions and made ugly in the process A wonderful study in engineering tradeoffs

12/25/201514 All disks are divided into five parts … Boot block –can boot the system by loading from this block Superblock –specifies boundaries of next 3 areas, and contains head of freelists of inodes and file blocks i-node area –contains descriptors (i-nodes) for each file on the disk; all i- nodes are the same size; head of freelist is in the superblock File contents area –fixed-size blocks; head of freelist is in the superblock Swap area –holds processes that have been swapped out of memory

12/25/201515 So … You can attach a disk to a dead system … Boot it up … Find, create, and modify files … –because the superblock is at a fixed place, and it tells you where the i-node area and file contents area are –by convention, the second i-node is the root directory of the volume

12/25/201516 i-node format User number Group number Protection bits Times (file last read, file last written, inode last written) File code: specifies if the i-node represents a directory, an ordinary user file, or a “special file” (typically an I/O device) Size: length of file in bytes Block list: locates contents of file (in the file contents area) –more on this soon! Link count: number of directories referencing this i-node

12/25/201517 The flat (i-node) file system Each file is known by a number, which is the number of the i-node –seriously – 1, 2, 3, etc.! –why is it called “flat”? Files are created empty, and grow when extended through writes

12/25/201518 The tree (directory, hierarchical) file system A directory is a flat file of fixed-size entries Each entry consists of an i-node number and a file name i-node numberFile name 152. 18.. 216my_file 4another_file 93oh_my_gosh 144a_directory It’s as simple as that!

12/25/201519 The “block list” portion of the i-node (Unix Version 7) Points to blocks in the file contents area Must be able to represent very small and very large files. How? Each inode contains 13 block pointers –first 10 are “direct pointers” (pointers to 512B blocks of file data) –then, single, double, and triple indirect pointers 0 1 10 11 12 … ………………

12/25/201520 So … Only occupies 13 x 4B in the i-node Can get to 10 x 512B = a 5120B file directly –(10 direct pointers, blocks in the file contents area are 512B) Can get to 128 x 512B = an additional 65KB with a single indirect reference –(the 11 th pointer in the i-node gets you to a 512B block in the file contents area that contains 128 4B pointers to blocks holding file data) Can get to 128 x 128 x 512B = an additional 8MB with a double indirect reference –(the 12 th pointer in the i-node gets you to a 512B block in the file contents area that contains 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks holding file data)

12/25/201521 Can get to 128 x 128 x 128 x 512B = an additional 1GB with a triple indirect reference –(the 13 th pointer in the i-node gets you to a 512B block in the file contents area that contains 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks in the file contents area that contain 128 4B pointers to 512B blocks holding file data) Maximum file size is 1GB + a smidge

12/25/201522 A later version of Bell Labs Unix utilized 12 direct pointers rather than 10 –Why? Berkeley Unix went to 1KB block sizes –What’s the effect on the maximum file size? 256x256x256x1K = 17 GB + a smidge –What’s the price? Suppose you went to 4KB blocks? –1Kx1Kx1Kx4K = 4TB + a smidge Linux? How many i-nodes?

12/25/201523 File system consistency Both i-nodes and file blocks are cached in memory The “sync” command forces memory-resident disk information to be written to disk –system does a sync every few seconds A crash or power failure between sync’s can leave an inconsistent disk You could reduce the frequency of problems by reducing caching, but performance would suffer big- time

12/25/201524 i-check: consistency of the flat file system Is each block on exactly one list? –create a bit vector with as many entries as there are blocks –follow the free list and each i-node block list –when a block is encountered, examine its bit If the bit was 0, set it to 1 if the bit was already 1 –if the block is both in a file and on the free list, remove it from the free list and cross your fingers –if the block is in two files, call support! –if there are any 0’s left at the end, put those blocks on the free list

12/25/201525 d-check: consistency of the directory file system Do the directories form a tree? Does the link count of each file equal the number of directories links to it? –I will spare you the details uses a zero-initialized vector of counters, one per i-node walk the tree, then visit every i-node

12/25/201526 File Allocation Table (FAT) In March 1983, IBM launched the PC XT computer, which featured a “massive” 10 MB hard disk. –DOS 2.0 supports hierarchical directories. –FAT12 In 1984 IBM released the PC AT, which featured a 20 MB hard disk. –DOS 3.0 –FAT16 Windows 95: FAT32 –Supports > 4GB partitions.

12/25/201527 All partitions are divided into … Boot Sector Reserved Sectors FAT #1 FAT #2 Root Directory (FAT16) Data Region (To the end of partition)

12/25/201528 Sector & Cluster Sector size: 512B Clusters: equal size block in the data region, > 2KB –FAT16 Max 2 16 clusters (64K) Max cluster size 32KB Max partition size 2GB –FAT32 Max 2 28 clusters (256M) Max cluster size 32KB Max partition size 8TB

12/25/201529 FAT Entries Each file occupies one or more clusters. –Chain of clusters: singly linked list List of entries map to each cluster. –FAT16: up to 2 16 –FAT32: up to 2 28 Each entry records one of five things: –Cluster # of next one in the chain. –End of cluster chain (EOC) –Free/unused (0) –Bad cluster –Reserved No free cluster list !!! chkdsk

12/25/201530 Directory Special file with table of file/subdirectory entries. Directory entries: –32 bytes –Starting cluster # –Max file size 4GB –Undelete ???

To De-frag or Not to De-frag DOS Unix/Linux Windows 12/25/201531

12/25/201532 Protection Objects: individual files Principals: owner/group/world Actions: read/write/execute This is pretty simple and rigid, but it has proven to be about what we can handle!

12/25/201533 Review: Model for representing protection Two different ways of thinking about it: –access control lists (ACLs) for each object, keep list of principals and principals’ allowed actions –capabilities for each principal, keep list of objects and principal’s allowed actions Both can be represented with the following matrix: /etc/passwd/home/gribble/home/guest rootrw gribblerrwr guestr principals objects ACL capability

12/25/201534 Review: ACLs vs. Capabilities Capabilities are easy to transfer –they are like keys: can hand them off –they make sharing easy ACLs are easier to manage –object-centric, easy to grant and revoke to revoke capability, need to keep track of principals that have it hard to do, given that principals can hand off capabilities ACLs grow large when object is heavily shared –can simplify by using “groups” put users in groups, put groups in ACLs you are all in the “VMware powerusers” group on Win2K –additional benefit change group membership, affects ALL objects that have this group in its ACL

12/25/201535 File sharing Each user has a “channel table” (or “per-user open file table”) Each entry in the channel table is a pointer to an entry in the system-wide “open file table” Each entry in the open file table contains a file offset (file pointer) and a pointer to an entry in the “memory- resident i-node table” If a process opens an already-open file, a new open file table entry is created (with a new file offset), pointing to the same entry in the memory-resident i- node table If a process forks, the child gets a copy of the channel table (and thus the same file offset)

12/25/201536 User 1User 2User 3 channel table open file table file offset memory-resident i-node table

Administrivia HW 7 & 8 solution will be posted today 4 weeks till end of semester No Lab this week Project 2 status check, turn in write up Case study http://www.et.byu.edu/groups/it344/10wi/Class%20Note s/CaseStudies/CaseStudies.htm#_Assignments 12/25/201537

IT 344: Operating Systems Module 15 BSD UNIX Fast File System Chia-Chi Teng ccteng@byu.edu CTB 265

12/25/201539 Advanced file system implementations We’ve looked at disks We’ve looked at file systems generically We’ve looked in detail at the implementation of the original Bell Labs UNIX file system –a great simple yet practical design –exemplifies engineering tradeoffs that are pervasive in system design Now we’ll look at a more advanced file system –Berkeley Software Distribution (BSD) UNIX Fast File System (FFS) enhanced performance for the UNIX file system at the heart of most UNIX file systems today

12/25/201540 BSD UNIX FFS Original (1970) UNIX file system was elegant but slow –poor disk throughput far too many seeks, on average Berkeley UNIX project did a redesign in the mid ’80’s –McKusick, Joy, Fabry, and Leffler –improved disk throughput, decreased average request response time –principal idea is that FFS is aware of disk structure it places related things on nearby cylinders to reduce seeks

12/25/201541 Recall the UNIX disk layout Boot block –can boot the system by loading from this block Superblock –specifies boundaries of next 3 areas, and contains head of freelists of inodes and file blocks i-node area –contains descriptors (i-nodes) for each file on the disk; all i- nodes are the same size; head of freelist is in the superblock File contents area –fixed-size blocks; head of freelist is in the superblock Swap area –holds processes that have been swapped out of memory

12/25/201542 0 1 10 11 12 … ……………… Recall the UNIX block list / file content structure directory entries point to i-nodes – file headers each i-node contains a bunch of stuff including 13 block pointers –first 10 point to file blocks (i.e., 512B blocks of file data) –then single, double, and triple indirect indexes

12/25/201543 UNIX FS data and i-node placement Original UNIX FS had two major performance problems: –data blocks are allocated randomly in aging file systems blocks for the same file allocated sequentially when FS is new as FS “ages” and fills, need to allocate blocks freed up when other files are deleted –deleted files are essentially randomly placed –so, blocks for new files become scattered across the disk! –i-nodes are allocated far from blocks all i-nodes at beginning of disk, far from data traversing file name paths, manipulating files, directories requires going back and forth from i-nodes to data blocks BOTH of these generate many long seeks!

12/25/201544 FFS: Cylinder groups FFS addressed these problems using the notion of a cylinder group –disk is partitioned into groups of cylinders –data blocks from a file are all placed in the same cylinder group –files in same directory are placed in the same cylinder group –i-node for file placed in same cylinder group as file’s data Introduces a free space requirement –to be able to allocate according to cylinder group, the disk must have free space scattered across all cylinders –in FFS, 10% of the disk is reserved just for this purpose! good insight: keep disk partially free at all times! this is why it may be possible for df to report >100% full!

12/25/201545 FFS: Increased block size, fragments The original UNIX FS had 512B blocks –even more seeking –small maximum file size ( ~1GB maximum file size) Then a version had 1KB blocks –still pretty puny FFS uses a 4KB blocksize –allows for very large files (4TB) –but, introduces internal fragmentation on average, each file wastes 2K! –why? worse, the average file size is only about 1K! –why? –fix: introduce “fragments” 1KB pieces of a block

12/25/201546 FFS: Awareness of hardware characteristics Original UNIX FS was unaware of disk parameters FFS parameterizes the FS according to disk and CPU characteristics –e.g., account for CPU interrupt and processing time, plus disk characteristics, in deciding where to lay out sequential blocks of a file, to reduce rotational latency

12/25/201547 FFS: Performance This was a long time ago – look at the relative performance, not the absolute performance! (CPU maxed doing block allocation!) (block size / fragment size) (983KB/s is theoretical disk throughput)

12/25/201548 FFS: Faster, but less elegant (warts make it faster but ugly) Multiple cylinder groups –effectively, treat a single big disk as multiple small disks –additional free space requirement (this is cheap, though) Bigger blocks –but fragments, to avoid excessive fragmentation Aware of hardware characteristics –ugh!

IT 344: Operating Systems Module 16 Journaling File Systems Chia-Chi Teng ccteng@byu.edu CTB 265

12/25/201550 In our most recent exciting episodes … Original Bell Labs UNIX file system –a simple yet practical design –exemplifies engineering tradeoffs that are pervasive in system design –elegant but slow and performance gets worse as disks get larger BSD UNIX Fast File System (FFS) –solves the throughput problem larger blocks cylinder groups awareness of disk performance details

12/25/201551 Review: File system consistency Both i-nodes and file blocks are cached in memory The “sync” command forces memory-resident disk information to be written to disk –system does a sync every few seconds A crash or power failure between sync’s can leave an inconsistent disk You could reduce the frequency of problems by reducing caching, but performance would suffer big- time

12/25/201552 i-check: consistency of the flat file system Is each block on exactly one list? –create a bit vector with as many entries as there are blocks –follow the free list and each i-node block list –when a block is encountered, examine its bit If the bit was 0, set it to 1 if the bit was already 1 –if the block is both in a file and on the free list, remove it from the free list and cross your fingers –if the block is in two files, call support! –if there are any 0’s left at the end, put those blocks on the free list

12/25/201553 d-check: consistency of the directory file system Do the directories form a tree? Does the link count of each file equal the number of directories links to it? –I will spare you the details uses a zero-initialized vector of counters, one per i-node walk the tree, then visit every i-node

12/25/201554 Both are real dogs when a crash occurs Buffering is necessary for performance Suppose a crash occurs during a file creation: 1.Allocate a free inode 2.Point directory entry at the new inode In general, after a crash the disk data structures may be in an inconsistent state –metadata updated but data not –data updated but metadata not –either or both partially updated fsck (i-check, d-check) are very slow –must touch every block –worse as disks get larger!

12/25/201555 Journaling file systems Became popular ~2002 There are several options that differ in their details –Ext3, ReiserFS, XFS, JFS, NTFS Basic idea –update metadata, or all data, transactionally “all or nothing” –if a crash occurs, you may lose a bit of work, but the disk will be in a consistent state more precisely, you will be able to quickly get it to a consistent state by using the transaction log/journal – rather than scanning every disk block and checking sanity conditions

12/25/201556 Where is the Data? In the file systems we have seen already, the data is in two places: –On disk –In in-memory caches The caches are crucial to performance, but also the source of the potential “corruption on crash” problem The basic idea of the solution: –Always leave “home copy” of data in a consistent state –Make updates persistent by writing them to a sequential (chronological) journal partition/file –At your leisure, push the updates (in order) to the home copies and reclaim the journal space

12/25/201557 Redo log Log: an append-only file containing log records – transaction t has begun – transaction t has updated block x and its new value is v –Can log block “diffs” instead of full blocks – transaction t has committed – updates will survive a crash Committing involves writing the redo records – the home data needn’t be updated at this time

12/25/201558 If a crash occurs Recover the log Redo committed transactions –Walk the log in order and re-execute updates from all committed transactions –Aside: note that update (write) is idempotent: can be done any positive number of times with the same result. Uncommitted transactions –Ignore them. It’s as though the crash occurred a tiny bit earlier…

12/25/201559 Managing the Log Space A cleaner thread walks the log in order, updating the home locations of updates in each transaction –Note that idempotence is important here – may crash while cleaning is going on Once a transaction has been reflected to the home blocks, it can be deleted from the log

12/25/201560 Impact on performance The log is a big contiguous write –very efficient And you do fewer synchronous writes –very costly in terms of performance So journaling file systems can actually improve performance (immensely) As well as making recovery very efficient

12/25/201561 Want to know more? This is a direct ripoff of database system techniques –But it is not what Microsoft Windows Vista was supposed to be before they backed off – “the file system is a database” –Nor is it a “log-structured file system” – that’s a file system in which there is nothing but a log (“the log is the file system”) “New-Value Logging in the Echo Replicated File System”, Andy Hisgen, Andrew Birrell, Charles Jerian, Timothy Mann, Garret Swart –http://citeseer.ist.psu.edu/hisgen93newvalue.html

NTFS http://www.ntfs.com http://technet.microsoft.com/en- us/library/cc758691(v=ws.10).aspxhttp://technet.microsoft.com/en- us/library/cc758691(v=ws.10).aspx Master File Table Multiple Data Streams 12/25/201562

S.M.A.R.T. Self-Monitoring Analysis and Reporting Technology http://en.wikipedia.org/wiki/S.M.A.R.T HDD Duty Cycle 12/25/201563

IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng CTB 265.

Similar presentations

Presentation on theme: "IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng CTB 265."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng CTB 265.

Similar presentations

Presentation on theme: "IT 344: Operating Systems Module 14 File Systems Chia-Chi Teng CTB 265."— Presentation transcript:

Similar presentations

About project

Feedback