File System Extensibility and Non-Disk File Systems

Slides:

Advertisements

Similar presentations

Advertisements

The Conquest File System: An-I A. Wang Geoffrey H. Kuenning Peter Reiher Gerald J. Popek Life after Disks Abstract The rapidly declining cost of persistent.

Allocation Methods - Contiguous

File Systems Examples.

CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.

Ceng Operating Systems

Conquest: Better Performance Through A Disk/Persistent-RAM Hybrid File System USENIX 2002 An-I Andy Wang Peter Reiher Gerald Popek University of California,

Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage Secondary Storage is usually: –anything outside of “primary memory” –storage that.

File System Extensibility and Non- Disk File Systems Andy Wang COP 5611 Advanced Operating Systems.

File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.

CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.

File System Extensibility and Non- Disk File Systems Andy Wang COP 5611 Advanced Operating Systems.

The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.

CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.

CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.

12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.

File System Performance CSE451 Andrew Whitaker. Ways to Improve Performance Access the disk less  Caching! Be smarter about accessing the disk  Turn.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

CS 540 Database Management Systems

Operating System Structures

Memory Management.

Jonathan Walpole Computer Science Portland State University

Chapter 2 Memory and process management

Chapter 11: File System Implementation

Chapter 12: File System Implementation

Andy Wang COP 5611 Advanced Operating Systems

Course Introduction Dr. Eggen COP 6611 Advanced Operating Systems

Andy Wang COP 5611 Advanced Operating Systems

How will execution time grow with SIZE?

File System Extensibility and Non-Disk File Systems

File System Implementation

File System Structure How do I organize a disk into a file system?

CS703 - Advanced Operating Systems

Operating System I/O System Monday, August 11, 2008.

Swapping Segmented paging allows us to have non-contiguous allocations

Lesson Objectives Aims You should be able to:

Optimizing Malloc and Free

Chapter 9: Virtual-Memory Management

Filesystems 2 Adapted from slides of Hank Levy

Overview Continuation from Monday (File system implementation)

Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.

CSE 451: Operating Systems Winter Module 22 Distributed File Systems

Virtual Memory Hardware

Distributed File Systems

Distributed File Systems

Secondary Storage Management Brian Bershad

Andy Wang COP 5611 Advanced Operating Systems

CSE 451: Operating Systems Spring Module 21 Distributed File Systems

Distributed File Systems

2.C Memory GCSE Computing Langley Park School for Boys.

Outline Chapter 2 (cont) OS Design OS structure

CSE 451: Operating Systems Winter Module 22 Distributed File Systems

CSE451 Virtual Memory Paging Autumn 2002

Chapter 15: File System Internals

Andy Wang COP 5611 Advanced Operating Systems

Secondary Storage Management Hank Levy

System calls….. C-program->POSIX call

Distributed File Systems

Department of Computer Science

File System Performance

Andy Wang COP 5611 Advanced Operating Systems

Distributed File Systems

Presentation transcript:

File System Extensibility and Non-Disk File Systems Andy Wang COP 5611 Advanced Operating Systems

Outline File system extensibility Non-disk file systems

File System Extensibility Any existing file system can be improved No file system is perfect for all purposes So the OS should make multiple file systems available And should allow for future improvements to file systems

Approaches to File System Extensibility Modify an existing file system Virtual file systems Layered and stackable file system layers

Modifying Existing File Systems Make the changes you want to an already operating file system Reuses code But changes everyone’s file system Requires access to source code Hard to distribute

Virtual File Systems Permit a single OS installation to run multiple file systems Using the same high-level interface to each OS keeps track of which files are instantiated by which file system Introduced by Sun

4.2 BSD File System / A

4.2 BSD File System / B NFS File System

Goals of Virtual File Systems Split FS implementation-dependent and -independent functionality Support semantics of important existing file systems Usable by both clients and servers of remote file systems Atomicity of operation Good performance, re-entrant, no centralized resources, “OO” approach

Basic VFS Architecture Split the existing common Unix file system architecture Normal user file-related system calls above the split File system dependent implementation details below I_nodes fall below open()and read()calls above

VFS Architecture Block Diagram

Virtual File Systems Each VFS is linked into an OS-maintained list of VFS’s First in list is the root VFS Each VFS has a pointer to its data Which describes how to find its files Generic operations used to access VFS’s

V_nodes The per-file data structure made available to applications Has both public and private data areas Public area is static or maintained only at VFS level No locking done by the v_node layer

BSD vfs rootvfs vfs_next vfs_vnodecovered … vfs_data mount BSD mount 4.2 BSD File System NFS

BSD vfs rootvfs vfs_next vfs_vnodecovered … vfs_data create root / mount v_node / v_vfsp v_vfsmountedhere … v_data i_node / 4.2 BSD File System NFS

BSD vfs rootvfs vfs_next vfs_vnodecovered … vfs_data create dir A mount v_node / v_node A v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data i_node / i_node A 4.2 BSD File System NFS

BSD vfs NFS vfs rootvfs vfs_next vfs_vnodecovered … vfs_data vfs_next vfs_vnodecovered … vfs_data mount NFS mount mntinfo v_node / v_node A v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data i_node / i_node A 4.2 BSD File System NFS

BSD vfs NFS vfs rootvfs vfs_next vfs_vnodecovered … vfs_data vfs_next vfs_vnodecovered … vfs_data create dir B mount mntinfo v_node / v_node A v_node B v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data i_node / i_node A i_node B 4.2 BSD File System NFS

BSD vfs NFS vfs rootvfs vfs_next vfs_vnodecovered … vfs_data vfs_next vfs_vnodecovered … vfs_data read root / mntinfo v_node / v_node A v_node B v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data i_node A i_node B i_node / mount 4.2 BSD File System NFS

BSD vfs NFS vfs rootvfs vfs_next vfs_vnodecovered … vfs_data vfs_next vfs_vnodecovered … vfs_data read dir B v_node / v_node A v_node B v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data v_vfsp v_vfsmountedhere … v_data i_node / mount i_node A i_node B mntinfo 4.2 BSD File System NFS

Does the VFS Model Give Sufficient Extensibility? The VFS approach allows us to add new file systems But it isn’t as helpful for improving existing file systems What can be done to add functionality to existing file systems?

Layered and Stackable File System Layers Increase functionality of file systems by permitting some form of composition One file system calls another, giving advantages of both Requires strong common interfaces, for full generality

Layered File Systems Windows NT provides one example of layered file systems File systems in NT are the same as device drivers Device drivers can call other device drivers Using the same interface

Windows NT Layered Drivers Example User-Level Process User mode Kernel mode System Services I/O Manager File System Driver Multivolume Disk Driver Disk Driver

Another Approach - UCLA Stackable Layers More explicitly built to handle file system extensibility Layered drivers in Windows NT allow extensibility Stackable layers support extensibility

Stackable Layers Example File System Calls File System Calls VFS Layer VFS Layer Compression LFS LFS

How Do You Create a Stackable Layer? Write just the code that the new functionality requires Pass all other operations to lower levels (bypass operations) Reconfigure the system so the new layer is on top

User File System Directory Directory Layer Layer Encrypt Compress LFS Layer UFS Layer

What Changes Does Stackable Layers Require? Changes to v_node interface For full value, must allow expansion to the interface Changes to mount commands Serious attention to performance issues

Extending the Interface New file layers provide new functionality Possibly requiring new v_node operations Each layer must be prepared to deal with arbitrary unknown operations Bypass v_node operation

Handling a Vnode Operation A layer can do three things with a v_node operation: 1. Do the operation and return 2. Pass it down to the next layer 3. Do some work, then pass it down The same choices are available as the result is returned up the stack

Mounting Stackable Layers Each layer is mounted with a separate command Essentially pushing new layer on stack Can be performed at any normal mount time Not just on system build or boot

What Can You Do With Stackable Layers? Leverage off existing file system technology, adding Compression Encryption Object-oriented operations File replication All without altering any existing code

Performance of Stackable Layers To be a reasonable solution, per-layer overhead must be low In UCLA implementation, overhead is ~1-2% per layer In system time, not elapsed time Elapsed time overhead ~.25% per layer Highly application dependent, of course

File Systems Using Other Storage Devices All file systems discussed so far have been disk-based The physics of disks has a strong effect on the design of the file systems Different devices with different properties lead to different file systems

Other Types of File Systems RAM-based Disk-RAM-hybrid Flash-memory-based MEMS-based Network/distributed discussion of these deferred

Fitting Various File Systems Into the OS Something like VFS is very handy Otherwise, need multiple file access interfaces for different file systems With VFS, interface is the same and storage method is transparent Stackable layers makes it even easier Simply replace the lowest layer

In-Core File Systems Store files in main memory, not on disk Fast access and high bandwidth Usually simple to implement Hard to make persistent Often of limited size May compete with other memory needs

Where Are In-Core File Systems Useful? When brain-dead OS can’t use all main memory for other purposes For temporary files For files requiring very high throughput

In-Core File System Architectures Dedicated memory architectures Pageable in-core file system architectures

Dedicated Memory Architectures Set aside some segment of physical memory to hold the file system Usable only by the file system Either it’s small, or the file system must handle swapping to disk RAM disks are typical examples

Pageable Architectures Set aside some segment of virtual memory to hold the file system Share physical memory system Can be much larger and simpler More efficient use of resources UNIX /tmp file systems are typical examples

Basic Architecture of Pageable Memory FS Uses VFS interface Inherits most of code from standard disk-based filesystem Including caching code Uses separate process as “wrapper” for virtual memory consumed by FS data

How Well Does This Perform? Not as well as you might think Around 2 times disk based FS Why? Because any access requires two memory copies 1. From FS area to kernel buffer 2. From kernel buffer to user space Fixable if VM can swap buffers around

Other Reasons Performance Isn’t Better Disk file system makes substantial use of caching Which is already just as fast But speedup for file creation/deletion is faster requires multiple trips to disk

Disk/RAM Hybrid FS Conquest File System http://www.cs.fsu.edu/~awang/conquest

Hardware Evolution CPU (50% /yr) 1 GHz Memory (50% /yr) 105 106 Accesses Per Second (Log Scale) 1 MHz 1 KHz Disk (15% /yr) This graph shows hardware evolution over time. The vertical axis is in log scale. First, CPU and memory are improving at 50% per year. However, disk is improving only at the rate of 15% per year. And, this gap is widening from 5 orders of magnitude to 6 orders of magnitude. In human scales, back in 1990, if CPU access takes 1 second, disk would take 6 days. In 2000, this ratio is now 1 second to 3 months. Let’s think about that this means. It takes 1 second to grab a sheet of paper and write something down. If you call the Santa Claus to physically mail you a piece of paper, it may take 6 days. It takes about a month to make your own paper from papyrus, and most of the time is waiting for your paper to dry up. 3 months is a long time! 1990 1995 2000 (1 sec : 6 days) (1 sec : 3 months)

Price Trend of Persistent RAM Booming of digital photography 102 101 4 to 10 GB of persistent RAM 3.5” HDD 2.5” HDD 1” HDD Persistent RAM $/MB (log) 100 paper/film 10-1 Now, let us look at the price trend over time. Again, the cost is in the log scale. First, let’s see the cost of paper and film, which is a critical barrier to cross for any storage technology to achieve an economy of scale. Once a storage technology crosses this barrier, it becomes cheap enough to be a storage alternative. (animate) Now let’s look at various cost curves. For disks, various disk geometries are introduced roughly at the top boundary of the paper and film cost barrier. Also, notice the cost curve for the persistent RAM. Back in 1998, the booming of digital photography changed the slope of the curve. By 2005, we would expect to see 4 to 10 GB of persistent RAM on personal desktops. Cheap enough, once cross the boundary 10-2 1995 2000 2005 Year

Conquest Design and build a disk/persistent-RAM hybrid file system Deliver all file system services from memory, with the exception of high-capacity storage The idea of Conquest is to design and build a disk/persistent RAM hybrid file system, which delivers all file system services from memory, with the single exception of high-capacity storage. Two major benefits are simplicity and performance.

User Access Patterns Small files Large files Take little space (10%) Represent most accesses (90%) Large files Take most space Mostly sequential accesses Except database applications It is well known that small files take up little space and represent most accesses. Large files take up most of the storage capacity, and they are accessed sequentially most of the time. Of course, database is an exception, and Conquest currently does not handle database workload.

Files Stored in Persistent RAM Small files (< 1MB) No seek time or rotational delays Fast byte-level accesses Contiguous allocation Metadata Fast synchronous update No dual representations Executables and shared libraries In-place execution Based on this user behavior pattern, Conquest stores the following files in persistent RAM. Small files benefit the most from being stored in memory, because seek time and rotational delays comprise the bulk of the time spent on accessing small objects. Also, we now have fast byte-level accesses as opposed to block-level accesses. Small files are allocated contiguously. Storing metadata in memory avoids the notorious synchronous update problem. And, it deserves some discussion. Basically, if there is no metadata; there is no file system. Therefore, system designers take extra caution when it comes to handling metadata. If you update a directory, for example, most disk-based file systems will propagate the change synchronously to disk. It is a serious performance problem. By storing in metadata in memory, we no longer have this problem. Also, we now have a single representation for metadata, as opposed to a runtime representation and storage representation of metadata. Executables and shared libraries are also stored in core, so we can execute programs in-place, which reduces program startup time significantly.

Memory Data Path of Conquest Conventional file systems Conquest Memory Data Path Storage requests Persistence support Battery-backed RAM Small file and metadata storage Storage requests IO buffer management IO buffer Persistence support Disk management Now let’s take a look at the data path for conventional file systems. A storage request has to go through the IO buffer management, which handles caching. If the request is not in the cache, it has to go through persistence support, which is responsible for translating storage and runtime forms of metadata. The request then needs to go through disk management, which handles disk layout, disk arm scheduling and so on before reaching the disk. For conquest memory, updates to metadata and data are in-place. There is no IO buffer management and disk management. Also, for persistence support, we don’t need to translate between runtime and storage states. All we need to manage is metadata allocation, which I will describe a bit later. Disk

Large-File-Only Disk Storage Allocate in big chunks Lower access overhead Reduced management overhead No fragmentation management No tricks for small files Storing data in metadata No elaborate data structures Wrapping a balanced tree onto disk cylinders Since small files and metadata are taken care of, the disk only needs to handle large files. Which means, we can allocate disk space in big chucks, and it translates into lower access and management overhead. Also, without small objects, we don’t need to worry about fragmentation. We don’t need tricks for small files, such as storing data inside the metadata, or elaborate data structures, such as wrapping a balanced tree onto the geometry of the disk cylinders.

Sequential-Access Large Files Sequential disk accesses Near-raw bandwidth Well-defined readahead semantics Read-mostly Little synchronization overhead (between memory and disk) For large files that are accessed sequentially, disk can deliver near raw bandwidth, which is about 100 MB per second. And that speed is 200 times faster than random disk accesses. Also, large files have well-defined readahead semantics. Since they are read mostly, large file handling involve little synchronization overhead

Disk Data Path of Conquest Conventional file systems Conquest Disk Data Path Storage requests Storage requests IO buffer management IO buffer management Battery-backed RAM IO buffer IO buffer Small file and metadata storage Persistence support Disk management Disk management This shows the disk data path of Conquest. Again, on the left side, we have the data path for conventional file systems. Immediately, you see that Conquest data conquest bypasses mechanisms involved in persistence support. The IO buffer management is greatly simplified because we know the behavior of large file accesses. Also, the disk management is greatly simplified due to the lack of small files and fragmentation management. fonts Disk Disk Large-file-only file system

Random-Access Large Files Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files Near Sequential access? Simplify large-file metadata representation significantly You may ask, “what about large files that are randomly accessed?” In literature, random accesses are commonly defined as nonsequential accesses. However, if you take a look say a movie file, typically it has 150 scene changes. There are 150 places you may randomly jump to, and perform disk accesses sequentially. Also, looking at a mp3 file, the title is stored at the end of the file, so the typical access is to jump to the end of the file and go back to the beginning to play sequentially. Therefore, what may be random accesses are really near sequentially accesses. With this knowledge, we can simplify large-file metadata representation significantly. Mostly is…dumb data structures are still fast in memory

PostMark Benchmark Conquest is comparable to ramfs ISP workload (emails, web-based transactions) Conquest is comparable to ramfs At least 24% faster than the LRU disk cache 250 MB working set with 2 GB physical RAM Now, let’s look at the performance of Conquest. This slide shows the result for PostMark benchmark, which models ISP workload. The graph plots the number of files against the transaction rate. Conquest, represented in dark blue, is compared against ramfs, ext2, reiserfs, and SGI XFS. Ramfs, in the light blue, does not provide persistence, but it is a base case comparison for the quality of Conquest implementation. Ext2, in green, is the most widely used file system in the UNIX world. Reiserfs, in orange, is a journaling file system optimized for small files. SGI XFS, in red, is also a journaling file system, which is based on the IO-Lite technology. As you can see, Conquest is performance is comparable to the performance of ramfs. Compared to other disk-based file systems, Conquest is at least 24% faster. Note that all these file systems are operating in the LRU disk cache. File systems optimized for disk does not take full advantage of memory speed.

PostMark Benchmark When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS 10,000 files, 3.5 GB working set with 2 GB physical RAM > RAM <= RAM Now let’s fix the number of files to 10,000, and vary the percentage of large files from 0 to 10 percent. Since the working set is larger than memory, the graph does not include the ramfs. As you can see, when both memory and disk components are exercised, Conquest can be still be several times faster than leading disk-based file systems. Here is the boundary of physical RAM. Since we can’t see the right side of the graph too well, let’s zoom into the graph.

PostMark Benchmark When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS 10,000 files, 3.5 GB working set with 2 GB physical RAM When the working set is greater than RAM, Conquest still runs 1.4 to 2 times faster than various disk-based file systems. This improvement is very significant.

Flash Memory File Systems What is flash memory? Why is it useful for file systems? A sample design of a flash memory file system

Flash Memory A form of solid-state memory similar to ROM Holds data without power supply Reads are fast Can be written once, more slowly Can be erased, but very slowly Limited number of erase cycles before degradation

Writing In Flash Memory If writing to empty location, just write If writing to previously written location, erase it, then write Typically, flash memories allow erasure only of an entire sector Can read (sometimes write) other sectors during an erase

Typical Flash Memory Characteristics Read cycle 80-150 ns Write cycle 10ms/byte Erase cycle Cycle limit Sector size 500ms/block 100,000 times 64Kbytes Power consumption 15-45 mA active 5-20 mA standby Price ~$300/Gbyte

Pros/Cons of Flash Memory Small, and light Uses less power than disk Read time comparable to DRAM No rotation/seek complexities No moving parts (shock resistant) Expensive (compared to disk) Erase cycle very slow Limited number of erase cycles

Flash Memory File System Architectures One basic decision to make Is flash memory disk-like? Or memory-like? Should flash memory be treated as a separate device, or as a special part of addressable memory?

Hitachi Flash Memory File System Treats flash memory as device As opposed to directly addressable memory Basic architecture similar to log file system

Basic Flash Memory FS Architecture Writes are appended to tail of sequential data structure Translation tables to find blocks later Cleaning process to repair fragmentation This architecture does no wear-leveling

Flash Memory Banks and Segments Architecture divides entire flash memory into banks (8, in current implementation) Banks are subdivided into segments 8 segments per bank, currently 256 Kbytes per segment 16 Mbytes total capacity

Writing Data in Flash Memory File System One bank is currently active New data is written to block in active bank When this bank is full, move on to bank with most free segments Various data structures maintain illusion of “contiguous” memory

Cleaning Up Data Cleaning is done on a segment basis When a segment is to be cleaned, its entire bank is put on a cleaning list No more writes to bank till cleaning is done Segments chosen in manner similar to LFS

Cleaning a Segment Copy live data to another segment Erase entire segment segment is erasure granularity Return bank to active bank list

Performance of the Prototype System No seek time, so sequential/random access should be equally fast Around 650-700 Kbytes per second Read performance goes at this speed Write performance slowed by cleaning How much depends on how full the file system is Also, writing is simply slower in flash

More Flash Memory File System Performance Data On Andrew Benchmark, performs comparably to pageable memory FS Even when flash memory nearly full This benchmark does lots of reads, few writes Allowing flash file system to perform lots of cleaning without delaying writes