CS 3204 Operating Systems Godmar Back Lecture 26.

Slides:



Advertisements
Similar presentations
4/8/14CS161 Spring FFS Recovery: Soft Updates Learning Objectives Explain how to enforce write-ordering without synchronous writes. Identify and.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Redundant Array of Independent Disks
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
The Zebra Striped Network File System Presentation by Joseph Thompson.
CS-3013 & CS-502, Summer 2006 More on File Systems1 More on Disks and File Systems CS-3013 & CS-502 Operating Systems.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
File Systems Examples.
Lecture 18 ffs and fsck. File-System Case Studies Local FFS: Fast File System LFS: Log-Structured File System Network NFS: Network File System AFS: Andrew.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Computer ArchitectureFall 2008 © November 12, 2007 Nael Abu-Ghazaleh Lecture 24 Disk IO.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Ext3 Journaling File System “absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionality” chadd williams.
1 ECE7995 Computer Storage and Operating System Design Lecture 5: Disk and File System.
More on FilesCS-4513, D-Term More on File Systems CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts,
Linux Operating System
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
File System Reliability. Main Points Problem posed by machine/disk failures Transaction concept Reliability – Careful sequencing of file system operations.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
More on File SystemsCS-502 Fall More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
CSE 321b Computer Organization (2) تنظيم الحاسب (2) 3 rd year, Computer Engineering Winter 2015 Lecture #4 Dr. Hazem Ibrahim Shehata Dept. of Computer.
Redundant Array of Independent Disks
L/O/G/O External Memory Chapter 3 (C) CS.216 Computer Architecture and Organization.
CS 4284 Systems Capstone Godmar Back Disks & File Systems.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
IT 344: Operating Systems Winter 2008 Module 16 Journaling File Systems Chia-Chi Teng CTB 265.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
CS 4284 Systems Capstone Godmar Back Disks & File Systems.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Advanced UNIX File Systems Berkley Fast File System, Logging File Systems And RAID.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
CSE 451: Operating Systems Spring 2012 Journaling File Systems Mark Zbikowski Gary Kimura.
File Systems Operating Systems 1 Computer Science Dept Va Tech August 2007 ©2007 Back File Systems & Fault Tolerance Failure Model – Define acceptable.
UNIX File System (UFS) Chapter Five.
File Systems 2. 2 File 1 File 2 Disk Blocks File-Allocation Table (FAT)
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
Introduction to RAID Rogério Perino de Oliveira Neves Patrick De Causmaecker
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
CS 3204 Operating Systems Godmar Back Lecture 25.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
CSE 451: Operating Systems Winter 2015 Module 17 Journaling File Systems Mark Zbikowski Allen Center 476 © 2013 Gribble, Lazowska,
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Storage Systems CSE 598d, Spring 2007 Lecture 13: File Systems March 8, 2007.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CS 4284 Systems Capstone Godmar Back Disks & File Systems.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
CS Introduction to Operating Systems
CS 5204 Operating Systems Disks & File Systems Godmar Back.
Transactions and Reliability
CS 5204 Operating Systems Disks & File Systems Godmar Back.
Journaling File Systems
ICOM 6005 – Database Management Systems Design
Overview Continuation from Monday (File system implementation)
Printed on Monday, December 31, 2018 at 2:03 PM.
CSE 451: Operating Systems Spring 2010 Module 14
Presentation transcript:

CS 3204 Operating Systems Godmar Back Lecture 26

10/10/2015CS 3204 Spring Announcements Office Hours moved to one of McB 133, 116, or 124. Last office hours today. Please see revised grading policy posted on websiterevised grading policy –Project 3 has been graded (and should have been posted) –Will provide standing grade by tomorrow –Accept project 4 until May 7, 23:59pm You must send to your advisor by Wednesday 5pm if you want to switch to P/F! Reading assignment: Ch 10, 11, 12

Filesystems Consistency & Logging

FFS’s Consistency Berkeley FFS (Fast File System) formalized rules for filesystem consistency FFS acceptable failures: –May lose some data on crash –May see someone else’s previously deleted data Applications must zero data out if they wish to avoid this + fsync –May have to spend time to reconstruct free list –May find unattached inodes  lost+found Unacceptable failures: –After crash, get active access to someone else’s data Either by pointing at reused inode or reused blocks FFS uses 2 synchronous writes on each metadata operation that creates/destroy inodes or directory entries, e.g., creat(), unlink(), mkdir(), rmdir() –Updates proceed at disk speed rather than CPU/memory speed

Write Ordering & Logging Problem: as disk sizes grew, fsck becomes infeasible –Complexity proportional to used portion of disk –takes several hours to check GB-sized modern disks In the early 90s, approaches were developed that –Avoided need for fsck after crash –Reduced the need for synchronous writes Two classes of approaches: –Write-ordering (aka Soft Updates) BSD – the elegant approach –Journaling (aka Logging) Used in VxFS, NTFS, JFS, HFS+, ext3, reiserfs

Write Ordering Instead of synchronously writing, record dependency in buffer cache –On eviction, write out dependent blocks before evicted block: disk will always have a consistent or repairable image –Repairs can be done in parallel – don’t require delay on system reboot Example: –Must write block containing new inode before block containing changed directory pointing at inode Can completely eliminate need for synchronous writes Can do deletes in background after zeroing out directory entry & noting dependency Can provide additional consistency guarantees: e.g., make data blocks dependent on metadata blocks

Write Ordering: Cyclic Dependencies Tricky case: A should be written before B, but B should be written before A? … must unravel

Logging Filesystems Idea from databases: keep track of changes –“write-ahead log” or “journaling”: modifications are first written to log before they are written to actually changed locations –reads bypass log After crash, trace through log and –redo completed metadata changes (e.g., created an inode & updated directory) –undo partially completed metadata changes (e.g., created an inode, but didn’t update directory) Log must be written to persistent storage

Logging Issues How much does logging slow normal operation down? Log writes are sequential –Can be fast, especially if separate disk is used –Subtlety: log actually does not have to be written synchronously, just in-order & before the data to which it refers! Can trade performance for consistency – write log synchronously if strong consistency is desired Need to recycle log –After “sync()”, can restart log since disk is known to be consistent

Physical vs Logical Logging What & how should be logged? Physical logging: –Store physical state that’s affected before or after block (or both) –Choice: easier to redo (if after) or undo (if before) Logical logging: –Store operation itself as log entry (rename(“a”, “b”)) –More space-efficient, but can be tricky to implement

Summary Filesystem consistency is important Any filesystem design implies metadata dependency rules Designer needs to reason about state of filesystem after crash & avoid unacceptable failures –Needs to take worst-case scenario into account – crash after every sector write Most current filesystems use logging –Various degrees of data/metadata consistency guarantees

Filesystems Volume Managers Linux VFS

Example: Linux VFS Reality: system must support more than one filesystem at a time –Users should not notice a difference unless unavoidable Most systems, Linux included, use an object- oriented approach: –VFS-Virtual Filesystem

Example: Linux VFS Interface struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); }; struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); };

Volume Management Traditionally, disk is exposed as a block device (linear array of block abstraction) –Refinement: disk partitions = subarray within block array Filesystem sits on partition Problems: –Filesystem size limited by disk size –Partitions hard to grow & shrink Solution: Introduce another layer – the Volume Manager (aka “Logical Volume Manager”)

Volume Manager Volume Manager separates physical composition of storage devices from logical exposure ext3 /home ext3 /usr jfs /opt LV1LV2LV3 VolumeGroup PV1PV2PV3PV4 filesystems logical volumes physical volumes

RAID – Redundant Arrays of Inexpensive Disks Idea born around 1988 Original observation: it’s cheaper to buy multiple, small disks than single large expensive disk (SLED) –SLEDs don’t exist anymore, but multiple disks arranged as a single disk still useful Can reduce latency by writing/reading in parallel Can increase reliability by exploiting redundancy –I in RAID now stands for “independent” disks Several arrangements are known, 7 have “standard numbers” Can be implemented in hardware/software RAID array would appear as single physical volume to LVM

RAID 0 RAID: Striping data across disk Advantage: If disk access go to different disk, can read/write in parallel, decrease in latency Disadvantage: Decreased reliability (MTTF(Array) = MTTF(Disk)/#disks

RAID 1 RAID 1: Mirroring (all writes go to both disks) Advantages: –Redundancy, Reliability – have backup of data –Can have better read performance than single disk – why? –About same write performance as single disk Disadvantage: –Inefficient storage use

Using XOR for Parity Recall: –X^X = 0 –X^1 = !X –X^0 = X Let’s set: W=X^Y^Z –X^(W)=X^(X^Y^Z)=(X^X)^Y^Z=0^(Y^Z)=Y^Z –Y^(X^W)=Y^(Y^Z)=0^Z=Z Obtain: Z=X^Y^W (analogously for X, Y) XYZWXYZW XOR

RAID 4 RAID 4: Striping + Block-level parity Advantage: need only N+1 disks for N-disk capacity & 1 disk redundancy Disadvantage: small writes (less than one stripe) may require 2 reads & 2 writes –Read old data, read old parity, write new data, compute & write new parity –Parity disk can become bottleneck

RAID 5 RAID 5: Striping + Block-level Distributed Parity Like RAID 4, but avoids parity disk bottleneck Get read latency advantage like RAID 0 Best large read & large write performance Only remaining disadvantage is small writes –“small write penalty”

Other RAID Issues RAID-6: dual parity, code-based, provides additional redundancy (2 disks may fail before data loss) RAID (0+1) and RAID (1+0): –Mirroring+striping Interaction with filesystem –WAFL (write anywhere filesystem layout) avoids in- place updates to avoid small write penalty –Based on LFS (log-structured filesystem) idea in which all writes go to new locations