Outline Administrative: –Issues? Objective: –File System Design.

Outline Administrative: –Issues? Objective: –File System Design

Introducing User Programs into Nachos SPARC HW OS Kernel (Solaris) MIPS sim Nachos User Programs Syscalls Machine instructions MIPS instr Nachos calls User Process Internal Nachos thread Conceptually: Nachos thread encapsulates user program, remains the schedulable entity

Nachos Systems Call (Process) userprog/syscall.h Spaceid Exec (char *name, int argc, char** argv, int pipectrl) - Creates a user process by –creating a new address space, –reading the executable file into it, and –creating a new internal thread (via Thread::Fork ) to run it. –To start execution of the child process, the kernel sets up the CPU state for the new process and then calls Machine::Run to start the machine simulator executing the specified program's instructions in the context of the newly created child process.

Nachos Systems Call (Process) userprog/syscall.h Exit (int status) - user process quits with status returned. The kernel handles an Exit system call by –destroying the process data structures and thread(s), –reclaiming any memory assigned to the process, and –arranging to return the exit status value as the result of the Join on this process, if any. Join (Spaceid pid) - called by a process (the joiner) to wait for the termination of the process (the joinee) whose SpaceId is given by the pid argument. –If the joinee is still active, then Join blocks until the joinee exits. When the joinee has exited, Join returns the joinee's exit status to the joiner.

StartProcess(char *filename) { OpenFile *executable = fileSystem->Open(filename); AddrSpace *space; if (executable == NULL) { printf("Unable to open file %s\n", filename); return; } space = new AddrSpace(executable); currentThread  space = space; delete executable; // close file space  InitRegisters(); // set the initial register values space  RestoreState(); // load page table register machine  >Run(); // jump to the user progam ASSERT(FALSE); // machine->Run never returns; // the address space exits // by doing the syscall "exit" }  Exec

ExceptionHandler(ExceptionType which) { int type = machine->ReadRegister(2); if ((which == SyscallException) && (type == SC_Halt)) { DEBUG('a', "Shutdown, initiated by user program.\n"); interrupt->Halt(); } else { printf("Unexpected user mode exception %d %d\n", which, type); ASSERT(FALSE); } SPARC HW OS Kernel MIPS sim Nachos User Programs Syscalls Machine instructions MIPS instr ExceptionHandler Note: system call code must convert user-space addresses to Nachos machine addresses or kernel addresses before they can be dereferenced

AddrSpace::AddrSpace(OpenFile *executable) {... executable->ReadAt((char *)&noffH, sizeof(noffH), 0); if ((noffH.noffMagic != NOFFMAGIC) && (WordToHost(noffH.noffMagic) == NOFFMAGIC)) SwapHeader(&noffH); ASSERT(noffH.noffMagic == NOFFMAGIC); // how big is address space? size = noffH.code.size + noffH.initData.size + noffH.uninitData.size + UserStackSize; // we need to increase the size to leave room for the stack numPages = divRoundUp(size, PageSize); size = numPages * PageSize; ASSERT(numPages <= NumPhysPages); // check we're not trying // to run anything too big -- // at least until we have virtual memory

// first, set up the translation pageTable = new TranslationEntry[numPages]; for (i = 0; i < numPages; i++) { pageTable[i].virtualPage = i; // for now, virtual page # = phys page # pageTable[i].physicalPage = i; pageTable[i].valid = TRUE; pageTable[i].use = FALSE; pageTable[i].dirty = FALSE; pageTable[i].readOnly = FALSE; // if the code segment was entirely on // a separate page, we could set its // pages to be read-only } // zero out the entire address space, to zero the unitialized data segment // and the stack segment bzero(machine->mainMemory, size);

// then, copy in the code and data segments into memory if (noffH.code.size > 0) { DEBUG('a', "Initializing code segment, at 0x%x, size %d\n", noffH.code.virtualAddr, noffH.code.size); executable->ReadAt(&(machine->mainMemory[noffH.code.virtualAddr]), noffH.code.size, noffH.code.inFileAddr); } if (noffH.initData.size > 0) { DEBUG('a', "Initializing data segment, at 0x%x, size %d\n", noffH.initData.virtualAddr, noffH.initData.size); executable-> ReadAt(&(machine->mainMemory [noffH.initData.virtualAddr]), noffH.initData.size, noffH.initData.inFileAddr); }

Non-contiguous VM 0: NumPhysPages - 1: mainMemory pageTable Now: Need to know which frames are free Need to allocate non-contiguously and not based at zero

Introduction to File Systems

File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing files. Controlling access to files. Performance issues - how to deal with the bottleneck of disks? What is the “right” way to optimize file access?

Role of Files Persistence  long-lived  data for posterity  non-volatile storage media  semantically meaningful (memorable) names

Abstractions Addressbook, record for Duke CPS User view Application File System addrfile  fid, byte range* Disk Subsystem device, block # surface, cylinder, sector bytes fid block#

*File Abstractions UNIX-like files –Sequence of bytes –Operations: open (create), close, read, write, seek Memory mapped files –Sequence of bytes –Mapped into address space –Page fault mechanism does data transfer Named, Possibly typed

Unix File Syscalls int fd, num, success, bufsize; char data[bufsize]; long offset, pos; fd = open (filename, mode [,permissions]); success = close (fd); pos = lseek (fd, offset, mode); num = read (fd, data, bufsize); num = write (fd, data, bufsize); O_RDONLY O_WRONLY O_RDWR O_CREAT O_APPEND... User grp others rwx rwx rwx 111 100 000 Relative to beginning, current position, end of file

Memory Mapped Files fd = open (somefile, consistent_mode); pa = mmap(addr, len, prot, flags, fd, offset); VAS len pa fd + offset R, W, X, none Shared, Private, Fixed, Noreserve Reading performed by Load instr.

Nachos File Syscalls/Operations Create(“zot”); OpenFileId fd; fd = Open(“zot”); Close(fd); char data[bufsize]; Write(data, count, fd); Read(data, count, fd); Limitations: 1. small, fixed-size files and directories 2. single disk with a single directory 3. stream files only: no seek syscall 4. file size is specified at creation time 5. no access control, etc.

Functions of File System Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad blocks. Handle read and write system calls Initiate I/O operations for movement of blocks to/from disk. Maintain buffer cache

File System Data Structures stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode File data pos

UNIX Inodes File Attributes Block Addr... Data Block Addr 1 1 1 2 2 2 2 33 33 Data blocks Decoupling meta-data from directory entries

File Sharing Between Parent/Child main(int argc, char *argv[]) { char c; int fdrd, fdwt, fdpriv; if ((fdrd = open(argv[1], O_RDONLY)) == -1) exit(1); if ((fdwt = creat([argv[2], 0666)) == -1) exit(1); fork(); if ((fdpriv = open([argv[3], O_RDONLY)) == -1) exit(1); while (TRUE) { if (read(fdrd, &c, 1) != 1) exit(0); write(fdwt, &c, 1); }

File System Data Structures stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode forked process’s Process descriptor openafterfork

Sharing Open File Instances shared seek offset in shared file table entry system open file table user ID process ID process group ID parent PID signal state siblings children user ID process ID process group ID parent PID signal state siblings children process file descriptors process objects shared file (inode or vnode) child parent

Directory Subsystem Map filenames to fileids-open (create) syscall. Create kernel data structures. Maintain naming structure (unlink, mkdir, rmdir)

Pathname Resolution inode# current Directory node File Attributes inode#Proj Directory nodeFile Attributes cps110 current inode#proj3 Directory node Proj File Attributesproj3 data file “cps110/current/Proj/proj3” File Attributes index node of wd

Access Patterns Along the Way Proc Dir subsys open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock); F.S

Functions of Device Subsystem In general, deal with device characteristics Translate block numbers (the abstraction of device shown to file system) to physical disk addresses. Device specific (subject to change with upgrades in technology) intelligent placement of blocks. Schedule (reorder?) disk operations

What to do about Disks?  Avoid them altogether! Caching Disk scheduling Idea is to reorder outstanding requests to minimize seeks.  Layout on disk –Placement to minimize disk overhead Build a better disk (or substitute) –Example: RAID

File Naming

Goals of File Naming Foremost function - to find files (e.g., in open() ), Map file name to file object. To store meta-data about files. To allow users to choose their own file names without undue name conflict problems. To allow sharing. Convenience: short names, groupings. To avoid implementation complications

Naming Structures Flat name space - 1 system-wide table, –Unique naming with multiple users is hard. Name conflicts. –Easy sharing, need for protection Per-user name space –Protection by isolation, no sharing –Easy to avoid name conflicts –Register identifies with directory to use to resolve names, possibility of user-settable (cd)

Naming Structures Naming network Component names - pathnames –Absolute pathnames - from a designated root –Relative pathnames - from a working directory –Each name carries how to resolve it. Short names to files anywhere in the network produce cycles, but convenience in naming things.

Full Naming Network* /Jamie/lynn/project/D /Jamie/d /Jamie/lynn/jam/proj1/C (relative from Terry) A (relative from Jamie) d root Lynn Jamie Terry project A B C DE proj1 project D d lynn jam TA grp1 * not Unix

Full Naming Network* /Jamie/lynn/project/D /Jamie/d /Jamie/lynn/jam/proj1/C (relative from Terry) A (relative from Jamie) d root Lynn Jamie Terry project A B C DE proj1 project D d lynn jam TA grp1 * Unix Why?

Meta-Data File size File type Protection - access control information History: creation time, last modification, last access. Location of file - which device Location of individual blocks of the file on disk. Owner of file Group(s) of users associated with file

Restricting to a Hierarchy Problems with full naming network –What does it mean to “delete” a file? –Meta-data interpretation

Operations on Directories ( UNIX ) link (oldpathname, newpathname) - make entry pointing to file unlink (filename) - remove entry pointing to file mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file) getdents(fd, buf, structsize) - reads dir entries

Reclaiming Storage root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks What should be dealloc?

root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks Reclaiming Storage

root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks 2 3 1 2 1 2 Reference Counting?

root Jo Jamie Terry project A B C DE proj1 project D d joe jam TA grp1 X X X Series of unlinks Garbage Collection * * * Phase 1 marking Phase 2 collect

Restricting to a Hierarchy Problems with full naming network –What does it mean to “delete” a file? –Meta-data interpretation Eliminating cycles –allows use of reference counts for reclaiming file space –avoids garbage collection

Given: Naming Hierarchy (because of implementation issues) / tmpusretcbinvmunix lssh projectusers packages (volume root) texemacs mount point leaf

A Typical Unix File Tree / tmpusretc File trees are built by grafting volumes from different devices or from network servers. Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages coveredDir In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point mount (coveredDir, volume) coveredDir: directory pathname volume: device volume root contents become visible at pathname coveredDir

A Typical Unix File Tree / tmpusretc File trees are built by grafting volumes from different devices or from network servers. Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages coveredDir In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point mount (coveredDir, volume) /usr/project/packages/coveredDir/emacs (volume root) texemacs

/ tmpusretc Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. binvmunix lssh projectusers packages (volume root) texemacs mount point mount (coveredDir, volume) coveredDir: directory pathname volume: device specifier or network volume volume root contents become visible at pathname coveredDir /usr/project/packages/coveredDir/emacs File trees are built by grafting volumes from different devices or from network servers. In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. A Typical Unix File Tree

Reclaiming Convenience Symbolic links - indirect files filename maps, not to file object, but to another pathname –allows short aliases –slightly different semantics Search path rules

Outline for Today’s Lecture Administrative Objective: –File system issues continued

Unix File Naming (Hard Links) 0 rain: 32 hail: 48 0 wind: 18 sleet: 48 inode 48 inode link count = 2 directory Adirectory B A Unix file may have multiple names. link system call link (existing name, new name) create a new name for an existing file increment inode link count unlink system call (“remove”) unlink(name) destroy directory entry decrement inode link count if count = 0 and file is not in active use free blocks (recursively) and on-disk inode Each directory entry naming the file is called a hard link. Each inode contains a reference count showing how many hard links name it.

Unix Symbolic (Soft) Links Unix files may also be named by symbolic (soft) links. –A soft link is a file containing a pathname of some other file. 0 rain: 32 hail: 48 inode 48 inode link count = 1 directory A 0 wind: 18 sleet: 67 directory B../A/hail/0 inode 67 symlink system call symlink (existing name, new name) allocate a new file (inode) with type symlink initialize file contents with existing name create directory entry for new file with new name The target of the link may be removed at any time, leaving a dangling reference. How should the kernel handle recursive soft links? Convenience, but not performance!

Soft vs. Hard Links What’s the difference in behavior?

Soft vs. Hard Links What’s the difference in behavior? Terry Lynn Jamie /

Soft vs. Hard Links What’s the difference in behavior? Terry Lynn Jamie / X

File Structure

After Resolving Long Pathnames OPEN(“/usr/faculty/carla/classes/cps110/spring02/lectures/lecture13.ppt”,…) Finally Arrive at File What do users seem to want from the file abstraction? What do these usage patterns mean for file structure and implementation decisions? –What operations should be optimized 1st? –How should files be structured? –Is there temporal locality in file usage? –How long do files really live?

Know your Workload! File usage patterns should influence design decisions. Do things differently depending: –How large are most files? How long-lived? Read vs. write activity. Shared often? –Different levels “see” a different workload. Feedback loop Usage patterns observed today File System design and impl

Generalizations from UNIX Workloads Standard Disclaimers that you can’t generalize…but anyway… Most files are small (fit into one disk block) although most bytes are transferred from longer files. Most opens are for read mode, most bytes transferred are by read operations Accesses tend to be sequential and 100%

More on Access Patterns There is significant reuse (re-opens)  most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. –Looks good for caching! Use of temp files is significant part of file system activity in UNIX  very limited reuse, short lifetimes (less than a minute).

Access Patterns Along the Way Proc Dir subsys open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock); F.S

File Structure Implementation: Mapping File  Block Contiguous –1 block pointer, causes fragmentation, growth is a problem. Linked –each block points to next block, directory points to first, OK for sequential access Indexed –index structure required, better for random access into file.

UNIX Inodes File Attributes Block Addr... Data Block Addr 1 1 1 2 2 2 2 33 33 Data blocks Decoupling meta-data from directory entries

File Allocation Table (FAT) Lecture.ppt Pic.jpg Notes.txt eof

Meta-Data File size File type Protection - access control information History: creation time, last modification, last access. Location of file - which device Location of individual blocks of the file on disk. Owner of file Group(s) of users associated with file

Access Control for Files Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied. UNIX RWX - owner, group, everyone

UNIX access control Each file carries its access control with it. rwx rwx rwx setuid Owner UID Group GID Everybody elseWhen bit set, it allows process executing object to assume UID of owner temporarily - enter owner domain (rights amplification) Owner has chmod, chgrp rights (granting, revoking)

More on Access Control Later

The Access Model Authorization problems can be represented abstractly by of an access model. –each row represents a subject/principal/domain –each column represents an object –each cell: accesses permitted for the {subject, object} pair read, write, delete, execute, search, control, or any other method In real systems, the access matrix is sparse and dynamic. need a flexible, efficient representation

70 Two Representations ACL - Access Control Lists –Columns of previous matrix –Permissions attached to Objects –ACL for file hotgossip: Terry, rw; Lynn, rw Capabilities –Rows of previous matrix –Permissions associated with Subject –Tickets, Namespace (what it is that one can name) –Capabilities held by Lynn: luvltr, rw; hotgossip,rw

Access Control Lists Approach: represent the access matrix by storing its columns with the objects. Tag each object with an access control list (ACL) of authorized subjects/principals. To authorize an access requested by S for O –search O’s ACL for an entry matching S –compare requested access with permitted access –access checks are often made only at bind time

Capabilities Approach: represent the access matrix by storing its rows with the subjects. Tag each subject with a list of capabilities for the objects it is permitted to access. –A capability is an unforgeable object reference, like a pointer. –It endows the holder with permission to operate on the object e.g., permission to invoke specific methods –Typically, capabilities may be passed from one subject to another. Rights propagation and confinement problems

Dynamics of Protection Schemes How to endow software modules with appropriate privilege? –What mechanism exists to bind principals with subjects? e.g., setuid syscall, setuid bit –What principals should a software module bind to? privilege of creator: but may not be sufficient to perform the service privilege of owner or system: dangerous

74 Dynamics of Protection Schemes How to revoke privileges? What about adding new subjects or new objects? How to dynamically change the set of objects accessible (or vulnerable) to different processes run by the same user? –Need-to-know principle / Principle of minimal privilege –How do subjects change identity to execute a more privileged module? protection domain, protection domain switch (enter)

75 Protection Domains Processes execute in a protection domain, initially inherited from subject Goal: to be able to change protection domains Introduce a level of indirection Domains become protected objects with operations defined on them: owner, copy, control TA grp Terry Lynn gradefile solutions proj1 rwx rwrwo r rxc luvltr r rw hotgossip rw Domain0 ctl enter r

76 If domain contains copy on right to some object, then it can transfer that right to the object to another domain. If domain is owner of some object, it can grant that right to the object, with or without copy to another domain If domain is owner or has ctl right to a domain, it can remove right to object from that domain Rights propagation. TA grp Terry Lynn gradefile solutions proj1 rwo rwrwo r rcrc luvltr r rw hotgossip rw Domain0 ctl enter r rcrc r

Files to Blocks: Allocation Clustering Log Structure

What to do about Disks? Avoid them altogether! Caching Disk scheduling –Idea is to reorder outstanding requests to minimize seeks.  Layout on disk –Placement to minimize disk overhead Build a better disk (or substitute) –Example: RAID

Goal: Good Layout on Disk Placement to minimize disk overhead: can address both seek and rotational latency Cluster related things together (e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi- block file, files in same directory) Sub-block allocation to reduce fragmentation for small files Log-Structured File Systems

Effect of Clustering Access time = seek time + rotational delay + transfer time average seek time = 2 ms for an intra-cylinder group seek, let’s say rotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 ms transfer time = 1 millisecond for an 8KB block at 8 MB/s 8 KB blocks deliver about 15% of disk bandwidth. 64KB blocks/clusters deliver about 50% of disk bandwidth. 128KB blocks/clusters deliver about 70% of disk bandwidth. Actual performance will likely be better with good disk layout, since most seek/rotate delays to read the next block/cluster will be “better than average”.

Block Allocation to Disk Layout The level of indirection in the file block maps allows flexibility in file layout. “File system design is 99% block allocation.” [McVoy] Competing goals for block allocation: –allocation cost –bandwidth for high-volume transfers –efficient directory operations Goal: reduce disk arm movement and seek overhead. metric of merit: bandwidth utilization

FFS and LFS Two different approaches to block allocation: –Cylinder groups in the Fast File System (FFS) [McKusick81] clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96] FFS can also be extended with metadata logging [e.g., Episode] –Log-Structured File System (LFS) proposed in [Douglis/Ousterhout90] implemented/studied in [Rosenblum91] BSD port, sort of maybe: [Seltzer93] extended with self-tuning methods [Neefe/Anderson97] –Other approach: extent-based file systems

FFS Cylinder Groups FFS defines cylinder groups as the unit of disk locality, and it factors locality into allocation choices. –typical: thousands of cylinders, dozens of groups –Strategy: place “related” data blocks in the same cylinder group whenever possible. seek latency is proportional to seek distance –Smear large files across groups: Place a run of contiguous blocks in each group. –Reserve inode blocks in each cylinder group. This allows inodes to be allocated close to their directory entries and close to their data blocks (for small files).

FFS Allocation Policies 1. Allocate file inodes close to their containing directories. For mkdir, select a cylinder group with a more-than-average number of free inodes. For creat, place inode in the same group as the parent. 2. Concentrate related file data blocks in cylinder groups. Most files are read and written sequentially. Place initial blocks of a file in the same group as its inode. How should we handle directory blocks? Place adjacent logical blocks in the same cylinder group. Logical block n+1 goes in the same group as block n. Switch to a different group for each indirect block.

Allocating a Block 1. Try to allocate the rotationally optimal physical block after the previous logical block in the file. Skip rotdelay physical blocks between each logical block. (rotdelay is 0 on track-caching disk controllers.) 2. If not available, find another block a nearby rotational position in the same cylinder group We’ll need a short seek, but we won’t wait for the rotation. If not available, pick any other block in the cylinder group. 3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group. Pick a block at the beginning of a run of free blocks.

Representing Small Files Internal fragmentation in the file system blocks can waste significant space for small files. E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive file system with an 8KB block size. Most files are small: one study [Irlam93] shows a median of 22KB. FFS solution: optimize small files for space efficiency. –Subdivide blocks into 2/4/8 fragments (or just frags). –Free block maps contain one bit for each fragment. To determine if a block is free, examine bits for all its fragments. –The last block of a small file is stored on fragment(s). If multiple fragments they must be contiguous.

Small Files with Bigger Blocks Internal fragmentation in the file system blocks can waste significant space for small files. E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive file system with an 8KB block size. Most files are small: one study [Irlam93] shows a median of 22KB. FFS solution: optimize small files for space efficiency. –Subdivide blocks into 2/4/8 fragments (or just frags). –Free block maps contain one bit for each fragment. To determine if a block is free, examine bits for all its fragments. –The last block of a small file is stored on fragment(s). If multiple fragments they must be contiguous.

Clustering in FFS Clustering improves bandwidth utilization for large files read and written sequentially. Allocate clumps/clusters/runs of blocks contiguously; read/write the entire clump in one operation with at most one seek. –Typical cluster sizes: 32KB to 128KB. FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space. –This (usually) occurs as a side effect of setting rotdelay = 0. Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well. –Must modify buffer cache to group buffers together and read/write in contiguous clusters.

Log-Structured File Systems Assumption: Cache is effectively filtering out reads so we should optimize for writes Basic Idea: manage disk as an append-only log (subsequent writes involve minimal head movement) Data and meta-data (mixed) accumulated in large segments and written contiguously Reads work as in UNIX - once inode is found, data blocks located via index. Cleaning an issue - to produce contiguous free space, correcting fragmentation developing over time. Claim: LFS can use 70% of disk bandwidth for writing while Unix FFS can use only 5-10% typically because of seeks.

LFS logs In LFS, all block and metadata allocation is log- based. –LFS views the disk as “one big log” (logically). –All writes are clustered and sequential/contiguous. Intermingles metadata and blocks from different files. –Data is laid out on disk in the order it is written. –No-overwrite allocation policy: if an old block or inode is modified, write it to a new location at the tail of the log. –LFS uses (mostly) the same metadata structures as FFS; only the allocation scheme is different. Cylinder group structures and free block maps are eliminated. Inodes are found by indirecting through a new map

LFS Data Structures on Disk Inode – in log, same as FFS Inode map – in log, locates position of inode, version, time of last access Segment summary – in log, identifies contents of segment (file#, offset for each block in segment) Segment usage table – in log, counts live bytes in segment and last write time Checkpoint region – fixed location on disk, locates blocks of inode map, identifies last checkpoint in log. Directory change log – in log, records directory operations to maintain consistency of ref counts in inodes

Structure of the Log clean Checkpoint region = inode map block = inode D1 = directory node File 2 File 1 block 2 = data block File 1 block1 = segment summary, usage, dirlog

Writing the Log in LFS 1.LFS “saves up” dirty blocks and dirty inodes until it has a full segment (e.g., 1 MB). –Dirty inodes are grouped into block-sized clumps. –Dirty blocks are sorted by (file, logical block number). –Each log segment includes summary info and a checksum. 2. LFS writes each log segment in a single burst, with at most one seek. –Find a free segment “slot” on the disk, and write it. –Store a back pointer to the previous segment. Logically the log is sequential, but physically it consists of a chain of segments, each large enough to amortize seek overhead.

Growth of the Log write (file1, block1) creat (D1/file3) write (file3, block1) clean Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1

Death in the Log write (file1, block1) creat (D1/file3) write (file3, block1) clean Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1

Writing the Log: the Rest of the Story 1. LFS cannot always delay writes long enough to accumulate a full segment; sometimes it must push a partial segment. –fsync, update daemon, NFS server, etc. –Directory operations are synchronous in FFS, and some must be in LFS as well to preserve failure semantics and ordering. 2. LFS allocation and write policies affect the buffer cache, which is supposed to be filesystem- independent. –Pin (lock) dirty blocks until the segment is written; dirty blocks cannot be recycled off the free chain as before. –Endow *indirect blocks with permanent logical block numbers suitable for hashing in the buffer cache.

Cleaning in LFS What does LFS do when the disk fills up? 1. As the log is written, blocks and inodes written earlier in time are superseded (“killed”) by versions written later. –files are overwritten or modified; inodes are updated –when files are removed, blocks and inodes are deallocated 2. A cleaner daemon compacts remaining live data to free up large hunks of free space suitable for writing segments. –look for segments with little remaining live data benefit/cost analysis to choose segments –write remaining live data to the log tail –can consume a significant share of bandwidth, and there are lots of cost/benefit heuristics involved.

Cleaning the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 clean

Cleaning the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 File 1 block 2 File 2

Cleaning the Log Checkpoint region File 2 File 1 block1 File 3 D1 File 1 block 2 clean

Cleaning Issues Must be able to identify which blocks are live Must be able to identify the file to which each block belongs in order to update inode to new location Segment Summary block contains this info –File contents associated with uid (version # and inode #) –Inode entries contain version # (incr. on truncate) –Compare to see if inode points to block under consideration

Policies When cleaner cleans – threshold based How much – 10s at a time until threshold reached Which segments –Most fragmented segment is not best choice. –Value of free space in segment depends on stability of live data (approx. age) –Cost / benefit analysis Benefit = free space available (1-u) * age of youngest block Cost = cost to read segment + cost to move live data –Segment usage table supports this How to group live blocks

Recovering Disk Contents Checkpoints – define consistent states –Position in log where all data structures are consistent –Checkpoint region (fixed location) – contains the addresses of all blocks of inode map and segment usage table, ptr to last segment written Actually 2 that alternate in case a crash occurs while writing checkpoint region data Roll-forward – to recover beyond last checkpoint –Uses Segment summary blocks at end of log – if we find new inodes, update inode map found from checkpoint –Adjust utilizations in segment usage table –Restore consistency in ref counts within inodes and directory entries pointing to those inodes using Directory operation log (like an intentions list)

Recovery of the Log Checkpoint region D1 File 2 File 1 block 2 File 1 block1 File 3 D1 Written since checkpoint

Recovery in Unix fsck Traverses the directory structure checking ref counts of inodes Traverses inodes and freelist to check block usage of all disk blocks

File Caching What to cache? How to manage the file buffer cache?

File Buffer Cache Memory File cache Proc V.M. What do we want to cache? inodes dir nodes whole files disk blocks The answer will determine where in the request path the cache sits Virtual memory and File buffer cache will compete for physical memory How?

Access Patterns Along the Way cache Proc F.S. open(/foo/bar/file); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(rootdir); read(inode); read(foo); read(inode); read(bar); read(inode); read(filedatablock);

Why Are File Caches Effective? 1. Locality of reference: storage accesses come in clumps. spatial locality: If a process accesses data in block B, it is likely to reference other nearby data soon. (e.g., the remainder of block B) example: reading or writing a file one byte at a time temporal locality: Recently accessed data is likely to be used again. 2. Read-ahead: if we can predict what blocks will be needed soon, we can prefetch them into the cache. most files are accessed sequentially

What to Cache? Locality in File Access Patterns (UNIX Workloads) Most files are small (often fitting into one disk block) although most bytes are transferred from longer files. Accesses tend to be sequential and 100% –Spatial locality –What happens when we cache a huge file? Most opens are for read mode, most bytes transferred are by read operations

There is significant reuse (re-opens)  most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. –Looks good for caching! Use of temp files is significant part of file system activity in UNIX  very limited reuse, short lifetimes (less than a minute). Long absolute pathnames are common in file opens –Name resolution can dominate performance – why? What to Cache? Locality in File Access Patterns (continued)

stdin stdout stderr Process descriptor per-process file ptr array System-wide Open file table r-w pos, mode System-wide File descriptor table in-memory copy of inode ptr to on-disk inode r-w pos, mode File data Byte# Proc read(fd, buf, num) File buffer cache V.M. Block# ? Caching File Blocks

Caching as “The Answer” Avoid the disk for as many file operations as possible. Cache acts as a filter for the requests seen by the disk Reads served best -- reuse. Delayed writeback will avoid going to disk at all for temp files. Memory File cache Proc

How to Manage the I/O Cache? Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m). 1. What happens on the first access to each item? Fetch it into some slot of the cache, use it, and leave it there to speed up access if it is needed again later. 2. How to determine if an item is resident in the cache? Maintain a directory of items in the cache: a hash table. Hash on a unique identifier (tag) for the item (fully associative). 3. How to find a slot for an item fetched into the cache? Choose an unused slot, or select an item to replace according to some policy, and evict it from the cache, freeing its slot.

File Block Buffer Cache HASH(vnode, logical block) Buffers with valid data are retained in memory in a buffer cache or file cache. Each item in the cache is a buffer header pointing at a buffer. Blocks from different files may be intermingled in the hash chains. System data structures hold pointers to buffers only when I/O is pending or imminent. - busy bit instead of refcount - most buffers are “free” free/inactive list head free/inactive list tail Most systems use a pool of buffers in kernel memory as a staging area for memory disk transfers.

Handling Updates in the File Cache 1. Blocks may be modified in memory once they have been brought into the cache. Modified blocks are dirty and must (eventually) be written back. Write-back, write-through (104?) 2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous). Delayed writes absorb many small updates with one disk write. How long should the system hold dirty data in memory? Asynchronous writes allow overlapping of computation and disk update activity (write-behind). Do the write call for block n+1 while transfer of block n is in progress. Thus file caches also can improve performance for writes. 3. Knowing data gets to disk Force it but you can’t trust to a “write” syscall - fsync

Mechanism for Cache Eviction/Replacement Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse. –Busy items in active use are not on the list. E.g., some in-memory data structure holds a pointer to the item. E.g., an I/O operation is in progress on the item. –The best candidates are slots that do not contain valid items. Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed). –Other slots are listed in order of value of the items they contain. These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.

Replacement Policy The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list. Defines the replacement policy A typical cache replacement policy is LRU –Assume hot items used recently are likely to be used again. –Move the item to the tail of the free list on every release. –The item at the front of the list is the coldest inactive item. Other alternatives: –FIFO: replace the oldest item. –MRU/LIFO: replace the most recently used item. HASH(vnode, logical block) free/inactive list head free/inactive list tail

Viewing Memory as a Unified I/O Cache A key role of the I/O system is to manage the page/block cache for performance and reliability. tracking cache contents and managing page/block sharing choreographing movement to/from external storage balancing competing uses of memory Modern systems attempt to balance memory usage between the VM system and the file cache. Grow the file cache for file-intensive workloads. Grow the VM page cache for memory-intensive workloads. Support a consistent view of files across different style of access. unified buffer cache

Prefetching To avoid the access latency of moving the data in for that first cache miss. Prediction! “Guessing” what data will be needed in the future. How? –It’s not for free: Consequences of guessing wrong Overhead – removal of useful stuff, disk bandwidth consumed

Intrafile prediction Sequential access suggests prefetching block n+1 when n is requested Upon seek (sequentiality is broken) –Stop prefetching –Detect a “stride” or pattern automatically –Depend on hints from program Compiler generated “prefetch” statements User supplied –How often is this issue relevant? Big files, nonsequential files, predictable accesses.

Interfile prediction Related files – what does that mean? –Directory nodes that are ancestors of a cached object must be cached in order to resolve pathname. –Detection of “file working sets ” Trees representing program executions are constructed –Capture semantic relationships among files in “semantic distance” measure – SEER system

Nachos File System

Nachos File Syscalls/Operations Create(“zot”); OpenFileId fd; fd = Open(“zot”); Close(fd); char data[bufsize]; Write(data, count, fd); Read(data, count, fd); Limitations: 1. small, fixed-size files and directories 2. single disk with a single directory 3. stream files only: no seek syscall 4. file size is specified at creation time 5. no access control, etc. BitMap FileSystem Directory FileSystem class internal methods: Create(name, size) OpenFile = Open(name) Remove(name) List() A single 10-entry directory stores names and disk locations for all currently existing files. Bitmap indicates whether each disk block is in-use or free. FileSystem data structures reside on-disk, with a copy in memory.

Representing A File in Nachos FileHdr Allocate(...,filesize) length = FileLength() sector = ByteToSector(offset) A file header describes an on-disk file as an ordered sequence of sectors with a length, mapped by a logical-to-physical block map. OpenFile(sector) Read(char* data, bytes) Write(char* data, bytes) OpenFile An OpenFile represents a file in active use, with a seek pointer and read/write primitives for arbitrary byte ranges. once upo n a time /nin a l and far far away,/nlived t he wise and sage wizard. logical block 0 logical block 1 logical block 2 OpenFile* ofd = filesys->Open(“tale”); ofd->Read(data, 10) gives ‘once upon ‘ ofd->Read(data, 10) gives ‘a time/nin ‘ bytes sectors

File Metadata On disk, each file is represented by a FileHdr structure. The FileHdr object is an in-memory copy of this structure. bytes sectors etc. file attributes: may include owner, access control, time of create/modify/access, etc. logical-physical block map (like a translation table) physical block pointers in the block map are sector IDs FileHdr* hdr = new FileHdr(); hdr->FetchFrom(sector) hdr->WriteBack(sector) The FileHdr is a file system “bookeeping” structure that supplements the file data itself: these kinds of structures are called filesystem metadata. A Nachos FileHdr occupies exactly one disk sector. To operate on the file (e.g., to open it), the FileHdr must be read into memory. Any changes to the attributes or block map must be written back to the disk to make them permanent.

Representing Large Files The Nachos FileHdr occupies exactly one disk sector, limiting the maximum file size. inode direct block map (12 entries) indirect block double indirect block sector size = 128 bytes 120 bytes of block map = 30 entries each entry maps a 128-byte sector max file size = 3840 bytes In Unix, the FileHdr (called an index- node or inode) represents large files using a hierarchical block map. Each file system block is a clump of sectors (4KB, 8KB, 16KB). Inodes are 128 bytes, packed into blocks. Each inode has 68 bytes of attributes and 15 block map entries. suppose block size = 8KB 12 direct block map entries in the inode can map 96KB of data. One indirect block (referenced by the inode) can map 16MB of data. One double indirect block pointer in inode maps 2K indirect blocks. maximum file size is 96KB + 16MB + (2K*16MB) +...

Nachos Directories 0 rain: 32 hail: 48 0 wind: 18 snow: 62 directory fileHdr A directory is a set of file names, supporting lookup by symbolic name. Each directory is a file containing a set of mappings from name  FileHdr. sector 32 Directory(entries) sector = Find(name) Add(name, sector) Remove(name) In Nachos, each directory entry is a fixed-size slot with space for a FileNameMaxLen byte name. Entries or slots are found by a linear scan. A directory entry may hold a pointer to another directory, forming a hierarchical name space.

A Nachos Filesystem On Disk 11100010 00101101 10111101 10011010 00110001 00010101 00101110 00011001 01000100 sector 0 allocation bitmap file 0 rain: 32 hail: 48 0 wind: 18 snow: 62 once upo n a time /n in a l and far far away, lived th sector 1 directory file Every box in this diagram represents a disk sector. An allocation bitmap file maintains free/allocated state of each physical block; its FileHdr is always stored in sector 0. A directory maintains the name->FileHdr mappings for all existing files; its FileHdr is always stored in sector 1.

Nachos File System Classes OpenFile(sector) Seek(offset) Read(char* data, bytes) Write(char* data, bytes) Directory SynchDisk Disk OpenFile FileHdr FileSystem Directory(entries) sector = Find(name) Add(name, sector) Remove(name) Allocate(...,filesize) length = FileLength() sector = ByteToSector(offset)................................ BitMap................................ Create(name, size) OpenFile = Open(name) Remove(name) List()

Outline Administrative: –Issues? Objective: –File System Design.

Similar presentations

Presentation on theme: "Outline Administrative: –Issues? Objective: –File System Design."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Administrative: –Issues? Objective: –File System Design.

Similar presentations

Presentation on theme: "Outline Administrative: –Issues? Objective: –File System Design."— Presentation transcript:

Similar presentations

About project

Feedback