FFS, LFS, and RAID Andy Wang COP 5611 Advanced Operating Systems.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

I/O Management and Disk Scheduling Chapter 11. I/O Driver OS module which controls an I/O device hides the device specifics from the above layers in the.
Faculty of Information Technology Department of Computer Science Computer Organization Chapter 7 External Memory Mohammad Sharaf.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Redundant Array of Independent Disks
1 Log-Structured File Systems Hank Levy. 2 Basic Problem Most file systems now have large memory caches (buffers) to hold recently-accessed blocks Most.
CS 6560: Operating Systems Design
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
File Systems.
CSE 451: Operating Systems Autumn 2013 Module 18 Berkeley Log-Structured File System Ed Lazowska Allen Center 570 © 2013 Gribble,
CSE521: Introduction to Computer Architecture Mazin Yousif I/O Subsystem RAID (Redundant Array of Independent Disks)
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
1 Storage (cont’d) Disk scheduling Reducing seek time (cont’d) Reducing rotational latency RAIDs.
The design and implementation of a log-structured file system The design and implementation of a log-structured file system M. Rosenblum and J.K. Ousterhout.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
1 Recitation 8 Disk & File System. 2 Disk Scheduling Disks are at least four orders of magnitude slower than main memory –The performance of disk I/O.
Redundant Array of Independent Disks
1 File System Implementation Operating Systems Hebrew University Spring 2010.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays Daniel Stodolsky Garth Gibson Mark Holland.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
Log-structured File System Sriram Govindan
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
Log-Structured File Systems
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Embedded System Lab. 서동화 The Design and Implementation of a Log-Structured File System - Mendel Rosenblum and John K. Ousterhout.
CS333 Intro to Operating Systems Jonathan Walpole.
Local Filesystems (part 1) CPS210 Spring Papers  The Design and Implementation of a Log- Structured File System  Mendel Rosenblum  File System.
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
Storage and File structure COP 4720 Lecture 20 Lecture Notes.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
W4118 Operating Systems Instructor: Junfeng Yang.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Magnetic Disks Have cylinders, sectors platters, tracks, heads virtual and real disk blocks (x cylinders, y heads, z sectors per track) Relatively slow,
CS Introduction to Operating Systems
Jonathan Walpole Computer Science Portland State University
Transactions and Reliability
Disks and RAID.
FFS, LFS, and RAID.
Filesystems 2 Adapted from slides of Hank Levy
Log-Structured File Systems
Log-Structured File Systems
CSE 451: Operating Systems Autumn 2009 Module 17 Berkeley Log-Structured File System Ed Lazowska Allen Center
Log-Structured File Systems
Log-Structured File Systems
Andy Wang COP 5611 Advanced Operating Systems
The Design and Implementation of a Log-Structured File System
Presentation transcript:

FFS, LFS, and RAID Andy Wang COP 5611 Advanced Operating Systems

UNIX Fast File System Designed to improve performance of UNIX file I/O Two major areas of performance improvement  Bigger block sizes  Better on-disk layout for files

Block Size Improvement 4x block size quadrupled amount of data gotten per disk fetch But could lead to fragmentation problems So fragments introduced  Small files stored in fragments  Fragments addressable But not independently fetchable

Disk Layout Improvements Aimed toward avoiding disk seeks Bad if finding related files takes many seeks Very bad if find all the blocks of a single file requires seeks Spatial locality: keep related things close together on disk

Cylinder Groups A cylinder group: a set of consecutive disk cylinders in the FFS Files in the same directory stored in the same cylinder group Within a cylinder group, tries to keep things contiguous But must not let a cylinder group fill up

Locations for New Directories Put new directory in relatively empty cylinder group What is “empty”?  Many free i_nodes  Few directories already there

The Importance of Free Space FFS must not run too close to capacity No room for new files Layout policies ineffective when too few free blocks Typically, FFS needs 10% of the total blocks free to perform well

Performance of FFS 4x to 15x the bandwidth of old UNIX file system Depending on size of disk blocks Performance on original file system  Limited by CPU speed  Due to memory-to-memory buffer copies

FFS Not the Ultimate Solution Based on technology of the early 80s And file usage patterns of those times In modern systems, FFS achieves only ~5% of raw disk bandwidth

The Log-Structured File System Large caches can catch almost all reads But most writes have to go to disk So FS performance can be limited by writes So, produce a FS that writes quickly Like an append-only log

Basic LFS Architecture Buffer writes, send them sequentially to disk  Data blocks  Attributes  Directories  And almost everything else Converts small sync writes to large async writes

A Simple Log Disk Structure File A Block 7 File Z Block 1 File M Block 202 File A Block 3 File F Block 1 File A Block 7 File L Block 26 File L Block 25 Head of Log

Key Issues in Log-Based Architecture 1. Retrieving information from the log No matter how well you cache, sooner or later you have to read 2. Managing free space on the disk You need contiguous space to write - in the long run, how do you get more?

Finding Data in the Log Give me block 25 of file L Or, Give me block 1 of file F File A Block 7 File Z Block 1 File M Block 202 File A Block 3 File F Block 1 File A Block 7 File L Block 26 File L Block 25

Retrieving Information From the Log Must avoid sequential scans of disk to read files Solution - store index structures in log Index is essentially the most recent version of the i_node

Finding Data in the Log How do you find all blocks of file Foo? Foo Block 1 Foo Block2 Foo Block3 Foo Block1 (old)

Finding Data in the Log with an I_node Foo Block 1 Foo Block2 Foo Block3 Foo Block1 (old)

How Do You Find a File’s I_node? You could search sequentially LFS optimizes by writing i_node maps to the log The i_node map points to the most recent version of each i_node A file system’s i_nodes cover multiple blocks of i_node map

How Do You Find the Inode? The Inode Map

How Do You Find Inode Maps? Use a fixed region on disk that always points to the most recent i_node map blocks But cache i_node maps in main memory Small enough that few disk accesses required to find i_node maps

Finding I_node Maps New i_node mapsAn old i_node map

Reclaiming Space in the Log Eventually, the log reaches the end of the disk partition So LFS must reuse disk space like superseded data blocks Space can be reclaimed in background or when needed Goal is to maintain large free extents on disk

Example of Need for Reuse Head of log New data to be logged

Major Alternatives for Reusing Log Threading + Fast - Fragmentation - Slower reads Head of log New data to be logged

Major Alternatives for Reusing Log Copying +Simple +Avoids fragmentation -Expensive New data to be logged

LFS Space Reclamation Strategy Combination of copying and threading Copy to free large fixed-size segments Thread free segments together Try to collect long-lived data permanently into segments

A Threaded, Segmented Log Head of log

Cleaning a Segment 1. Read several segments into memory 2. Identify the live blocks 3. Write live data back (hopefully) into a smaller number of segments

Identifying Live Blocks Hard to track down live blocks of all files Instead, each segment maintains a segment summary block  Identifying what is in each block Crosscheck blocks with owning i_node’s block pointers Written at end of log write, for low overhead

Segment Cleaning Policies What are some important questions?  When do you clean segments?  How many segments to clean?  Which segments to clean?  How to group blocks in their new segments?

When to Clean Periodically Continuously During off-hours When disk is nearly full On-demand LFS uses a threshold system

How Many Segments to Clean The more cleaned at once, the better the reorganization of the disk  But the higher the cost of cleaning LFS cleans a few tens at a time  Till disk drops below threshold value Empirically, LFS not very sensitive to this factor

Which Segments to Clean? Cleaning segments with lots of dead data gives great benefit Some segments are hot, some segments are cold But “cold” free space is more valuable than “hot” free space Since cold blocks tend to stay cold

Cost-Benefit Analysis u = utilization A = age Benefit to cost = u*A/(u + 1) Clean cold segments with some space, hot segments with a lot of space

What to Put Where? Given a set of live blocks and some cleaned segments, which goes where?  Order blocks by age  Write them to segments oldest first Goal is very cold, highly utilized segments

Goal of LFS Cleaning 100% fullempty number of segments 100% fullempty number of segments

Performance of LFS On modified Andrew benchmark, 20% faster than FFS LFS can create and delete 8 times as many files per second as FFS LFS can read 1 ½ times as many small files LFS slower than FFS at sequential reads of randomly written files

Logical Locality vs. Temporal Locality Logical locality (spatial locality): Normal file systems keep a file’s data blocks close together Temporal locality: LFS keeps data written at the same time close together When temporal locality = logical locality  Systems perform the same

Major Innovations of LFS Abstraction: everything is a log Temporal locality Use of caching to shape disk access patterns  Cache most reads  Optimized writes Separating full and empty segments

Where Did LFS Look For Performance Improvements? Minimized disk access  Only write when segments filled up Increased size of data transfers  Write whole segments at a time Improving locality  Assuming temporal locality, a file’s blocks are all adjacent on disk  And temporally related files are nearby

Parallel Disk Access and RAID One disk can only deliver data at its maximum rate So to get more data faster, get it from multiple disks simultaneously Saving on rotational latency and seek time

Utilizing Disk Access Parallelism Some parallelism available just from having several disks But not much Instead of satisfying each access from one disk, use multiple disks for each access Store part of each data block on several disks

Disk Parallelism Example open(foo)read(bar)write(zoo) File System

Data Striping Transparently distributing data over multiple disks Benefits –  Increases disk parallelism  Faster response for big requests Major parameters  Number of disks  Size of data interleaf

Fine- vs. Coarse-grained Data Interleaving Fine grain data interleaving +High data rate for all requests  But only one request per disk array  Lots of time spent positioning Coarse-grain data interleaving +Large requests access many disks +Many small requests handled at once  Small I/O requests access few disks

Reliability of Disk Arrays Without disk arrays, failure of one disk among N loses 1/Nth of the data With disk arrays (fine grained across all N disks), failure of one disk loses all data N disks 1/Nth as reliable as one disk

Adding Reliability to Disk Arrays Buy more reliable disks Build redundancy into the disk array  Multiple levels of disk array redundancy possible  Most organizations can prevent any data loss from single disk failure

Basic Reliability Mechanisms Duplicate data Parity for error detection Error Correcting Code for detection and correction

Parity Methods Can use parity to detect multiple errors  But typically used to detect single error If hardware errors are self-identifying, parity can also correct errors When data is written, parity must be written, too

Error-Correcting Code Based on Hamming codes, mostly Not only detect error, but identify which bit is wrong

RAID Architectures Redundant Arrays of Independent Disks Basic architectures for organizing disks into arrays Assuming independent control of each disk Standard classification scheme divides architectures into levels

Non-Redundant Disk Arrays (RAID Level 0) No redundancy at all So, what we just talked about Any failure causes data loss

Non-Redundant Disk Array Diagram (RAID Level 0) open(foo)read(bar)write(zoo) File System

Mirrored Disks (RAID Level 1) Each disk has second disk that mirrors its contents  Writes go to both disks  No data striping + Reliability is doubled + Read access faster - Write access slower - Expensive and inefficient

Mirrored Disk Diagram (RAID Level 1) open(foo)read(bar)write(zoo) File System

Memory-Style ECC (RAID Level 2) Some disks in array are used to hold ECC E.g., 4 data disks require 3 ECC disks + More efficient than mirroring + Can correct, not just detect, errors - Still fairly inefficient

Memory-Style ECC Diagram (RAID Level 2) open(foo)read(bar)write(zoo) File System

Bit-Interleaved Parity (RAID Level 3) Each disk stores one bit of each data block One disk in array stores parity for other disks + More efficient that Levels 1 and 2 - Parity disk doesn’t add bandwidth

Bit-Interleaved RAID Diagram (Level 3) open(foo)read(bar)write(zoo) File System

Block-Interleaved Parity (RAID Level 4) Like bit-interleaved, but data is interleaved in blocks of arbitrary size  Size is called striping unit  Small read requests use 1 disk + More efficient data access than level 3 + Satisfies many small requests at once - Parity disk can be a bottleneck - Small writes require 4 I/Os

Block-Interleaved Parity Diagram (RAID Level 4) open(foo)read(bar)write(zoo) File System

Block-Interleaved Distributed-Parity (RAID Level 5) Spread the parity out over all disks +No parity disk bottleneck +All disks contribute read bandwidth –Requires 4 I/Os for small writes

Block-Interleaved Distributed-Parity Diagram (RAID Level 5) open(foo)read(bar)write(zoo) File System

Other RAID Configurations RAID 6  Can survive two disk failures RAID 10 (RAID 1+0)  Data striped across mirrored pairs RAID 01 (RAID 0+1)  Mirroring two RAID 0 arrays RAID 15, RAID 51

Where Did RAID Look For Performance Improvements? Parallel use of disks  Improve overall delivered bandwidth by getting data from multiple disks Biggest problem is small write performance But we know how to deal with small writes...

Bonus Given N disks in RAID 1/10/01/15/51, what is the expected number of disk failures before data loss? (1/2 critique) Given 1-TB disks and probability p for a bit to fail silently, what is the probability of irrecoverable data loss for RAID 1/5/6/10/01/15/51 after a single disk failure? (1/2 critique)