Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to File Systems. File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing.

Similar presentations


Presentation on theme: "Introduction to File Systems. File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing."— Presentation transcript:

1 Introduction to File Systems

2 File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing files. Controlling access to files. Performance issues - how to deal with the bottleneck of disks? What is the “right” way to optimize file access?

3 Role of Files Persistence  long-lived  data for posterity  non-volatile storage media  semantically meaningful (memorable) names

4 Abstractions Addressbook, record for Duke CPS User view Application File System addrfile  fid, byte range* Disk Subsystem device, block # surface, cylinder, sector bytes fid block#

5 *File Abstractions UNIX-like files –Sequence of bytes –Operations: open (create), close, read, write, seek Memory mapped files –Sequence of bytes –Mapped into address space –Page fault mechanism does data transfer Named, Possibly typed

6 Unix File Syscalls int fd, num, success, bufsize; char data[bufsize]; long offset, pos; fd = open (filename, mode [,permissions]); success = close (fd); pos = lseek (fd, offset, mode); num = read (fd, data, bufsize); num = write (fd, data, bufsize); O_RDONLY O_WRONLY O_RDWR O_CREAT O_APPEND... User grp others rwx rwx rwx 111 100 000 Relative to beginning, current position, end of file

7 Memory Mapped Files fd = open (somefile, consistent_mode); pa = mmap(addr, len, prot, flags, fd, offset); VAS len pa fd + offset R, W, X, none Shared, Private, Fixed, Noreserve Reading performed by Load instr.

8 Nachos File Syscalls/Operations Create(“zot”); OpenFileId fd; fd = Open(“zot”); Close(fd); char data[bufsize]; Write(data, count, fd); Read(data, count, fd); Limitations: 1. small, fixed-size files and directories 2. single disk with a single directory 3. stream files only: no seek syscall 4. file size is specified at creation time 5. no access control, etc.

9 Functions of File System Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad blocks. Handle read and write system calls Initiate I/O operations for movement of blocks to/from disk. Maintain buffer cache

10 File System Data Structures stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode File data pos

11 UNIX Inodes File Attributes Block Addr... Data Block Addr 1 1 1 2 2 2 2 33 33 Data blocks Decoupling meta-data from directory entries

12 File Sharing Between Parent/Child main(int argc, char *argv[]) { char c; int fdrd, fdwt, fdpriv; if ((fdrd = open(argv[1], O_RDONLY)) == -1) exit(1); if ((fdwt = creat([argv[2], 0666)) == -1) exit(1); fork(); if ((fdpriv = open([argv[3], O_RDONLY)) == -1) exit(1); while (TRUE) { if (read(fdrd, &c, 1) != 1) exit(0); write(fdwt, &c, 1); }

13 File System Data Structures stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode forked process’s Process descriptor openafterfork

14 Sharing Open File Instances shared seek offset in shared file table entry system open file table user ID process ID process group ID parent PID signal state siblings children user ID process ID process group ID parent PID signal state siblings children process file descriptors process objects shared file (inode or vnode) child parent

15 Directory Subsystem Map filenames to fileids: open (create) syscall. Create kernel data structures. Maintain naming structure (unlink, mkdir, rmdir)

16 Pathname Resolution inode# current Directory node File Attributes inode#Proj Directory nodeFile Attributes cps110 current inode#proj3 Directory node Proj File Attributesproj3 data file “cps110/current/Proj/proj3” File Attributes index node of wd

17 Access Patterns Along the Way Proc Dir Subsys open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock); File Sys Device Subsys

18 Functions of Device Subsystem In general, deal with device characteristics Translate block numbers (the abstraction of device shown to file system) to physical disk addresses. Device specific intelligent placement of blocks. (subject to change with upgrades in technology) Schedule (reorder?) disk operations

19 What to do about Disks?  Avoid them altogether! Caching Disk scheduling Idea is to reorder outstanding requests to minimize seeks.  Layout on disk –Placement to minimize disk overhead Build a better disk (or substitute) –Example: RAID

20 Disk Scheduling for Requests Minimize seek AND rotational delay Maintain a queue of requests on EACH cylinder –Sort each queue by sector (rotational position) –At one cylinder process request based on rotational position to minimize rotational delay Move from cylinder when all requests on old cylinder are processed, in old direction, until no request remain in that direction. –“Elevator” algorithm

21 Redundant Array of Inexpensive Disks Parallel seeks and data transmission Each record is “smeared” across disks so that some bytes come from each disk –All disks positioned at same cylinder, track –Add extra redundant disk, for error check Reading one file is much faster, if no seek needed Error correction possible, so re-reading on error not needed

22 File Naming

23 Goals of File Naming Foremost function - to find files (e.g., in open() ), Map file name to file object. To store meta-data about files. To allow users to choose their own file names without undue name conflict problems. To allow sharing. Convenience: short names, groupings. To avoid implementation complications

24 Possible Naming Structures Flat name space - 1 system-wide table, –Unique naming with multiple users is hard. Name conflicts. –Easy sharing, need for protection Per-user name space –Protection by isolation, no sharing –Easy to avoid name conflicts –Associate process with directory to use to resolve names, allow user to change this “current working directory” (cd)

25 Naming Structures Naming network Component names - pathnames –Absolute pathnames - from a designated root –Relative pathnames - from a working directory –Each name carries how to resolve it. Could allow defining short names for files anywhere in the network. This might produce cycles, but it makes naming things more convenient.

26 Full Naming Network* /Jamie/lynn/project/D /Jamie/d /Jamie/lynn/jam/proj1/C (relative from Terry) A (relative from Jamie) d root Lynn Jamie Terry project A B C DE proj1 project D d lynn jam TA grp1 * not Unix

27 Full Naming Network* /Jamie/lynn/project/D /Jamie/d /Jamie/lynn/jam/proj1/C (relative from Terry) A (relative from Jamie) d root Lynn Jamie Terry project A B C DE proj1 project D d lynn jam TA grp1 * Unix Why?

28 Meta-Data File size File type Protection - access control information History: creation time, last modification, last access. Location of file - which device Location of individual blocks of the file on disk. Owner of file Group(s) of users associated with file

29 Restricting to a Hierarchy Problems with full naming network –What does it mean to “delete” a file? –Meta-data interpretation

30 Operations on Directories ( UNIX ) link (oldpathname, newpathname) - make entry pointing to file unlink (filename) - remove entry pointing to file mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file) getdents(fd, buf, structsize) - reads dir entries

31 Reclaiming Storage root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks What should be dealloc?

32 root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks Reclaiming Storage

33 root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks 2 3 1 2 1 2 Reference Counting?

34 root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks Garbage Collection * * * Phase 1 marking Phase 2 collect

35 Restricting to a Hierarchy Problems with full naming network –What does it mean to “delete” a file? –Meta-data interpretation Eliminating cycles –allows use of reference counts for reclaiming file space –avoids garbage collection

36 Given: Naming Hierarchy (because of implementation issues) / tmpusretcbinvmunix lssh projectusers packages (volume root) texemacs mount point leaf

37 A Typical Unix File Tree / tmpusretc File trees are built by grafting volumes from different devices or from network servers. Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages coveredDir In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point mount (coveredDir, volume) coveredDir: directory pathname volume: device volume root contents become visible at pathname coveredDir

38 A Typical Unix File Tree / tmpusretc File trees are built by grafting volumes from different devices or from network servers. Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages coveredDir In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point mount (coveredDir, volume) /usr/project/packages/coveredDir/emacs (volume root) texemacs

39 / tmpusretc Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages (volume root) texemacs mount point mount (coveredDir, volume) coveredDir: directory pathname volume: device specifier or network volume volume root contents become visible at pathname coveredDir /usr/project/packages/coveredDir/emacs File trees are built by grafting volumes from different devices or from network servers. In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. A Typical Unix File Tree

40 Reclaiming Convenience Symbolic links - indirect files filename maps, not to file object, but to another pathname –allows short aliases –slightly different semantics Search path rules

41 Unix File Naming (Hard Links) 0 rain: 32 hail: 48 0 wind: 18 sleet: 48 inode 48 inode link count = 2 directory Adirectory B A Unix file may have multiple names. link system call link (existing name, new name) create a new name for an existing file increment inode link count unlink system call (“remove”) unlink(name) destroy directory entry decrement inode link count if count = 0 and file is not in active use free blocks (recursively) and on-disk inode Each directory entry naming the file is called a hard link. Each inode contains a reference count showing how many hard links name it.

42 Unix Symbolic (Soft) Links Unix files may also be named by symbolic (soft) links. –A soft link is a file containing a pathname of some other file. 0 rain: 32 hail: 48 inode 48 inode link count = 1 directory A 0 wind: 18 sleet: 67 directory B../A/hail/0 inode 67 symlink system call symlink (existing name, new name) allocate a new file (inode) with type symlink initialize file contents with existing name create directory entry for new file with new name The target of the link may be removed at any time, leaving a dangling reference. How should the kernel handle recursive soft links? Convenience, but not performance!

43 Soft vs. Hard Links What’s the difference in behavior?

44 Soft vs. Hard Links What’s the difference in behavior? Terry Lynn Jamie /

45 Soft vs. Hard Links What’s the difference in behavior? Terry Lynn Jamie / X

46 Soft vs. Hard Links What’s the difference in behavior? Terry Lynn Jamie / X

47 File Structure

48 After Resolving Long Pathnames OPEN(“/usr/faculty/carla/classes/cps110/spring02/lectures/lecture13.ppt”,…) Finally Arrive at File What do users seem to want from the file abstraction? What do these usage patterns mean for file structure and implementation decisions? –What operations should be optimized 1st? –How should files be structured? –Is there temporal locality in file usage? –How long do files really live?

49 Know your Workload! File usage patterns should influence design decisions. Do things differently depending: –How large are most files? How long-lived? Read vs. write activity. Shared often? –Different levels “see” a different workload. Feedback loop Usage patterns observed today File System design and impl

50 Generalizations from UNIX Workloads Standard Disclaimers that you can’t generalize…but anyway… Most files are small (fit into one disk block) although most bytes are transferred from longer files. Most opens are for read mode, most bytes transferred are by read operations Accesses tend to be sequential and 100%

51 More on Access Patterns There is significant reuse (re-opens)  most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. –Looks good for caching! Use of temp files is significant part of file system activity in UNIX  very limited reuse, short lifetimes (less than a minute).

52 Access Patterns Along the Way Proc Dir subsys open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock); F.S

53 File Structure Implementation: Mapping File  Block Contiguous –1 block pointer, causes fragmentation, growth is a problem. Linked –each block points to next block, directory points to first, OK for sequential access Indexed –index structure required, better for random access into file.

54 UNIX Inodes File Attributes Block Addr... Data Block Addr 1 1 1 2 2 2 2 33 33 Data blocks Decoupling meta-data from directory entries

55 File Allocation Table (FAT) Lecture.ppt Pic.jpg Notes.txt eof

56 Meta-Data File size File type Protection - access control information History: creation time, last modification, last access. Location of file - which device Location of individual blocks of the file on disk. Owner of file Group(s) of users associated with file

57 Access Control for Files Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied. UNIX RWX - owner, group, everyone

58 UNIX access control Each file carries its access control with it. rwx rwx rwx setuid Owner UID Group GID Everybody elseWhen bit set, it allows process executing object to assume UID of owner temporarily - enter owner domain (rights amplification) Owner has chmod, chgrp rights (granting, revoking)

59 More on Access Control Later

60 Files to Blocks: Allocation Clustering Log Structure

61 What to do about Disks?  Avoid them altogether! Caching Disk scheduling –Idea is to reorder outstanding requests to minimize seeks.  Layout on disk –Placement to minimize disk overhead Build a better disk (or substitute) –Example: RAID

62 Goal: Good Layout on Disk Placement to minimize disk overhead: can address both seek and rotational latency Cluster related things together (e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi- block file, files in same directory) Sub-block allocation to reduce fragmentation for small files Log-Structured File Systems

63 Effect of Clustering Access time = seek time + rotational delay + transfer time average seek time = 2 ms for an intra-cylinder group seek, let’s say rotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 ms transfer time = 1 millisecond for an 8KB block at 8 MB/s 8 KB blocks deliver about 15% of disk bandwidth. 64KB blocks/clusters deliver about 50% of disk bandwidth. 128KB blocks/clusters deliver about 70% of disk bandwidth. Actual performance will likely be better with good disk layout, since most seek/rotate delays to read the next block/cluster will be “better than average”.

64 Block Allocation to Disk Layout The level of indirection in the file block maps allows flexibility in file layout. “File system design is 99% block allocation.” [McVoy] Competing goals for block allocation: –allocation cost –bandwidth for high-volume transfers –efficient directory operations Goal: reduce disk arm movement and seek overhead. metric of merit: bandwidth utilization

65 FFS and LFS Two different approaches to block allocation: –Cylinder groups in the Fast File System (FFS) [McKusick81] clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96] FFS can also be extended with metadata logging [e.g., Episode] –Log-Structured File System (LFS) proposed in [Douglis/Ousterhout90] implemented/studied in [Rosenblum91] BSD port, sort of maybe: [Seltzer93] extended with self-tuning methods [Neefe/Anderson97] –Other approach: extent-based file systems

66 FFS Cylinder Groups FFS defines cylinder groups as the unit of disk locality, and it factors locality into allocation choices. –typical: thousands of cylinders, dozens of groups –Strategy: place “related” data blocks in the same cylinder group whenever possible. seek latency is proportional to seek distance –Smear large files across groups: Place a run of contiguous blocks in each group. –Reserve inode blocks in each cylinder group. This allows inodes to be allocated close to their directory entries and close to their data blocks (for small files).

67 FFS Allocation Policies 1. Allocate file inodes close to their containing directories. For mkdir, select a cylinder group with a more-than-average number of free inodes. For creat, place inode in the same group as the parent. 2. Concentrate related file data blocks in cylinder groups. Most files are read and written sequentially. Place initial blocks of a file in the same group as its inode. How should we handle directory blocks? Place adjacent logical blocks in the same cylinder group. Logical block n+1 goes in the same group as block n. Switch to a different group for each indirect block.

68 Allocating a Block 1. Try to allocate the rotationally optimal physical block after the previous logical block in the file. Skip rotdelay physical blocks between each logical block. (rotdelay is 0 on track-caching disk controllers.) 2. If not available, find another block at a nearby rotational position in the same cylinder group We’ll need a short seek, but we won’t wait for the rotation. If not available, pick any other block in the cylinder group. 3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group. Pick a block at the beginning of a run of free blocks.

69 Representing Small Files Internal fragmentation in the file system blocks can waste significant space for small files. E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive file system with an 8KB block size. Most files are small: one study [Irlam93] shows a median of 22KB. FFS solution: optimize small files for space efficiency. –Subdivide blocks into 2/4/8 fragments (or just frags). –Free block maps contain one bit for each fragment. To determine if a block is free, examine bits for all its fragments. –The last block of a small file is stored on fragment(s). If multiple fragments they must be contiguous.

70 Small Files with Bigger Blocks Internal fragmentation in the file system blocks can waste significant space for small files. E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive file system with an 8KB block size. Most files are small: one study [Irlam93] shows a median of 22KB. FFS solution: optimize small files for space efficiency. –Subdivide blocks into 2/4/8 fragments (or just frags). –Free block maps contain one bit for each fragment. To determine if a block is free, examine bits for all its fragments. –The last block of a small file is stored on fragment(s). If multiple fragments they must be contiguous.

71 Clustering in FFS Clustering improves bandwidth utilization for large files read and written sequentially. Allocate clumps/clusters/runs of blocks contiguously; read/write the entire clump in one operation with at most one seek. –Typical cluster sizes: 32KB to 128KB. FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space. –This (usually) occurs as a side effect of setting rotdelay = 0. Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well. –Must modify buffer cache to group buffers together and read/write in contiguous clusters.

72 Log-Structured File Systems Assumption: Cache is effectively filtering out reads so we should optimize for writes Basic Idea: manage disk as an append-only log (subsequent writes involve minimal head movement) Data and meta-data (mixed) accumulated in large segments and written contiguously Reads work as in UNIX - once inode is found, data blocks located via index. Cleaning an issue - to produce contiguous free space, correcting fragmentation developing over time. Claim: LFS can use 70% of disk bandwidth for writing while Unix FFS can use only 5-10% typically because of seeks.

73 LFS logs In LFS, all block and metadata allocation is log- based. –LFS views the disk as “one big log” (logically). –All writes are clustered and sequential/contiguous. Intermingles metadata and blocks from different files. –Data is laid out on disk in the order it is written. –No-overwrite allocation policy: if an old block or inode is modified, write it to a new location at the tail of the log. –LFS uses (mostly) the same metadata structures as FFS; only the allocation scheme is different. Cylinder group structures and free block maps are eliminated. Inodes are found by indirecting through a new map

74 LFS Data Structures on Disk Inode – in log, same as FFS Inode map – in log, locates position of inode, version, time of last access Segment summary – in log, identifies contents of segment (file#, offset for each block in segment) Segment usage table – in log, counts live bytes in segment and last write time Checkpoint region – fixed location on disk, locates blocks of inode map, identifies last checkpoint in log. Directory change log – in log, records directory operations to maintain consistency of ref counts in inodes

75 Structure of the Log clean Checkpoint region = inode map block = inode D1 = directory node File 2 File 1 block 2 = data block File 1 block1 = segment summary, usage, dirlog

76 Writing the Log in LFS 1.LFS “saves up” dirty blocks and dirty inodes until it has a full segment (e.g., 1 MB). –Dirty inodes are grouped into block-sized clumps. –Dirty blocks are sorted by (file, logical block number). –Each log segment includes summary info and a checksum. 2. LFS writes each log segment in a single burst, with at most one seek. –Find a free segment “slot” on the disk, and write it. –Store a back pointer to the previous segment. Logically the log is sequential, but physically it consists of a chain of segments, each large enough to amortize seek overhead.

77 Growth of the Log write (file1, block1) creat (D1/file3) write (file3, block1) clean Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1

78 Death in the Log write (file1, block1) creat (D1/file3) write (file3, block1) clean Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1

79 Writing the Log: the Rest of the Story 1. LFS cannot always delay writes long enough to accumulate a full segment; sometimes it must push a partial segment. –fsync, update daemon, NFS server, etc. –Directory operations are synchronous in FFS, and some must be in LFS as well to preserve failure semantics and ordering. 2. LFS allocation and write policies affect the buffer cache, which is supposed to be filesystem- independent. –Pin (lock) dirty blocks until the segment is written; dirty blocks cannot be recycled off the free chain as before. –Endow *indirect blocks with permanent logical block numbers suitable for hashing in the buffer cache.

80 Cleaning in LFS What does LFS do when the disk fills up? 1. As the log is written, blocks and inodes written earlier in time are superseded (“killed”) by versions written later. –files are overwritten or modified; inodes are updated –when files are removed, blocks and inodes are deallocated 2. A cleaner daemon compacts remaining live data to free up large hunks of free space suitable for writing segments. –look for segments with little remaining live data benefit/cost analysis to choose segments –write remaining live data to the log tail –can consume a significant share of bandwidth, and there are lots of cost/benefit heuristics involved.

81 Cleaning the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 clean

82 Cleaning the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 File 1 block 2 File 2

83 Cleaning the Log Checkpoint region File 2 File 1 block1 File 3 D1 File 1 block 2 clean

84 Cleaning Issues Must be able to identify which blocks are live Must be able to identify the file to which each block belongs in order to update inode to new location Segment Summary block contains this info –File contents associated with uid (version # and inode #) –Inode entries contain version # (incr. on truncate) –Compare to see if inode points to block under consideration

85 Policies When cleaner cleans – threshold based How much – 10s at a time until threshold reached Which segments –Most fragmented segment is not best choice. –Value of free space in segment depends on stability of live data (approx. age) –Cost / benefit analysis Benefit = free space available (1-u) * age of youngest block Cost = cost to read segment + cost to move live data –Segment usage table supports this How to group live blocks

86 Recovering Disk Contents Checkpoints – define consistent states –Position in log where all data structures are consistent –Checkpoint region (fixed location) – contains the addresses of all blocks of inode map and segment usage table, ptr to last segment written Actually 2 that alternate in case a crash occurs while writing checkpoint region data Roll-forward – to recover beyond last checkpoint –Uses Segment summary blocks at end of log – if we find new inodes, update inode map found from checkpoint –Adjust utilizations in segment usage table –Restore consistency in ref counts within inodes and directory entries pointing to those inodes using Directory operation log (like an intentions list)

87 Recovery of the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 Written since checkpoint

88 Recovery in Unix fsck Traverses the directory structure checking ref counts of inodes Traverses inodes and freelist to check block usage of all disk blocks

89 File Caching What to cache? How to manage the file buffer cache?

90 File Buffer Cache Memory File cache Proc V.M. What do we want to cache? inodes dir nodes whole files disk blocks The answer will determine where in the request path the cache sits Virtual memory and File buffer cache will compete for physical memory How?

91 Access Patterns Along the Way cache Proc F.S. open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock);

92 Why Are File Caches Effective? 1. Locality of reference: storage accesses come in clumps. spatial locality: If a process accesses data in block B, it is likely to reference other nearby data soon. (e.g., the remainder of block B) example: reading or writing a file one byte at a time temporal locality: Recently accessed data is likely to be used again. 2. Read-ahead: if we can predict what blocks will be needed soon, we can prefetch them into the cache. most files are accessed sequentially

93 What to Cache? Locality in File Access Patterns (UNIX Workloads) Most files are small (often fitting into one disk block) although most bytes are transferred from longer files. Accesses tend to be sequential and 100% –Spatial locality –What happens when we cache a huge file? Most opens are for read mode, most bytes transferred are by read operations

94 There is significant reuse (re-opens)  most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. –Looks good for caching! Use of temp files is significant part of file system activity in UNIX  very limited reuse, short lifetimes (less than a minute). Long absolute pathnames are common in file opens –Name resolution can dominate performance – why? What to Cache? Locality in File Access Patterns (continued)

95 stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode File data Byte# Proc read(fd, buf, num) File buffer cache V.M. Block# ? Caching File Blocks

96 Caching as “The Answer” Avoid the disk for as many file operations as possible. Cache acts as a filter for the requests seen by the disk Reads served best -- reuse. Delayed writeback will avoid going to disk at all for temp files. Memory File cache Proc

97 How to Manage the I/O Cache? Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m). 1. What happens on the first access to each item? Fetch it into some slot of the cache, use it, and leave it there to speed up access if it is needed again later. 2. How to determine if an item is resident in the cache? Maintain a directory of items in the cache: a hash table. Hash on a unique identifier (tag) for the item (fully associative). 3. How to find a slot for an item fetched into the cache? Choose an unused slot, or select an item to replace according to some policy, and evict it from the cache, freeing its slot.

98 File Block Buffer Cache HASH(vnode, logical block) Buffers with valid data are retained in memory in a buffer cache or file cache. Each item in the cache is a buffer header pointing at a buffer. Blocks from different files may be intermingled in the hash chains. System data structures hold pointers to buffers only when I/O is pending or imminent. - busy bit instead of refcount - most buffers are “free” free/inactive list head free/inactive list tail Most systems use a pool of buffers in kernel memory as a staging area for memory disk transfers.

99 Handling Updates in the File Cache 1. Blocks may be modified in memory once they have been brought into the cache. Modified blocks are dirty and must (eventually) be written back. Write-back, write-through (104?) 2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous). Delayed writes absorb many small updates with one disk write. How long should the system hold dirty data in memory? Asynchronous writes allow overlapping of computation and disk update activity (write-behind). Do the write call for block n+1 while transfer of block n is in progress. Thus file caches also can improve performance for writes. 3. Knowing data gets to disk Force it but you can’t trust to a “write” syscall - fsync

100 Mechanism for Cache Eviction/Replacement Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse. –Busy items in active use are not on the list. E.g., some in-memory data structure holds a pointer to the item. E.g., an I/O operation is in progress on the item. –The best candidates are slots that do not contain valid items. Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed). –Other slots are listed in order of value of the items they contain. These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.

101 Replacement Policy The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list. Defines the replacement policy A typical cache replacement policy is LRU –Assume hot items used recently are likely to be used again. –Move the item to the tail of the free list on every release. –The item at the front of the list is the coldest inactive item. Other alternatives: –FIFO: replace the oldest item. –MRU/LIFO: replace the most recently used item. HASH(vnode, logical block) free/inactive list head free/inactive list tail

102 Viewing Memory as a Unified I/O Cache A key role of the I/O system is to manage the page/block cache for performance and reliability. tracking cache contents and managing page/block sharing choreographing movement to/from external storage balancing competing uses of memory Modern systems attempt to balance memory usage between the VM system and the file cache. Grow the file cache for file-intensive workloads. Grow the VM page cache for memory-intensive workloads. Support a consistent view of files across different style of access. unified buffer cache

103 Prefetching To avoid the access latency of moving the data in for that first cache miss. Prediction! “Guessing” what data will be needed in the future. How? –It’s not for free: Consequences of guessing wrong Overhead – removal of useful stuff, disk bandwidth consumed

104 Intrafile prediction Sequential access suggests prefetching block n+1 when n is requested Upon seek (sequentiality is broken) –Stop prefetching –Detect a “stride” or pattern automatically –Depend on hints from program Compiler generated “prefetch” statements User supplied –How often is this issue relevant? Big files, nonsequential files, predictable accesses.

105 Interfile prediction Related files – what does that mean? –Directory nodes that are ancestors of a cached object must be cached in order to resolve pathname. –Detection of “file working sets ” Trees representing program executions are constructed –Capture semantic relationships among files in “semantic distance” measure – SEER system


Download ppt "Introduction to File Systems. File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing."

Similar presentations


Ads by Google