The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
I/O Management and Disk Scheduling Chapter 11. I/O Driver OS module which controls an I/O device hides the device specifics from the above layers in the.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Chapter 4 Memory Management Basic memory management Swapping
RAID Redundant Array of Independent Disks
Part IV: Memory Management
CS 6560: Operating Systems Design
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories Presented by Sri.
Chapter 11: File System Implementation
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Disks CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
RAID Systems CS Introduction to Operating Systems.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM J. Wilkes, R. Golding, C. Staelin T. Sullivan HP Laboratories, Palo Alto, CA.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
1 Recitation 8 Disk & File System. 2 Disk Scheduling Disks are at least four orders of magnitude slower than main memory –The performance of disk I/O.
1 File System Implementation Operating Systems Hebrew University Spring 2010.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
1 Lecture 8: Secondary-Storage Structure 2 Disk Architecture Cylinder Track SectorDisk head rpm.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
IT253: Computer Organization
CS414 Review Session.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Presented by Arthur Strutzenberg.
Chapter 12 – Mass Storage Structures (Pgs )
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
CS333 Intro to Operating Systems Jonathan Walpole.
Chapter 14: Mass-Storage Systems Disk Structure. Disk Scheduling. RAID.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Magnetic Disks Have cylinders, sectors platters, tracks, heads virtual and real disk blocks (x cylinders, y heads, z sectors per track) Relatively slow,
CS Introduction to Operating Systems
RAID.
HP AutoRAID (Lecture 5, cs262a)
Database Applications (15-415) DBMS Internals- Part I Lecture 11, February 16, 2016 Mohammad Hammoud.
Jonathan Walpole Computer Science Portland State University
Transactions and Reliability
Multiple Platters.
Disks and RAID.
Operating System I/O System Monday, August 11, 2008.
HP AutoRAID (Lecture 5, cs262a)
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID RAID Mukesh N Tekwani
Filesystems 2 Adapted from slides of Hank Levy
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM
Overview Continuation from Monday (File system implementation)
Jonathan Walpole Computer Science Portland State University
CS333 Intro to Operating Systems
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 2 o File System Recap o OS manages storage of files on storage media using a File System o storage media: o comprised of an array of data units, called sectors o File System: o organizes sectors into addressable storage units o establishes directory structure for accessing files o FFS and LFS both developed as improvements over previous FSes o improved performance by optimizing access o FFS: o increased block size to reduce # of block addresses managed in directory o logically grouped cylinders to help ensure locality for blocks of a file o LFS: o eliminated seek times by always writing at end of the log o introduced new addressable structure called extents o an extent is a large contiguous set of blocks o need extents so as to have plenty of room at end of log for writing new entries o requires Garbage Collection of old log entries o live blocks of partially filled extents are migrated to other extents to free up space

HP AutoRAID 3 o Crash Recovery o issue is consistency of directory data after a crash or power failure o directory information typically written after the file data is written o FFS: o after a crash you have no way of knowing what you were last doing o requires a consistency check o all inode information must be verified against data it maps to o inconsistencies cannot always be repaired, data can be lost o LFS: o drastically reduces time to recover because of checkpointing o checkpoint = noted recent time when files and inode map were consistent o verify by rolling forward through log from last checkpoint o LFS keeps lots of other metadata information and stores some of it with the file o increased odds of restoring consistency o But neither can recover from a hardware failure….

HP AutoRAID 4 o RAID ! (round about the 1980’s) o Redundant Array of Inexpensive (or Independent) Disks o connect multiple cheap disks into an ARRAY of disks, spread data across them! o a single disk has less reliability than an array of smaller drives with redundancy o Virtualization ! o multiple disks but the File System sees only one virtual unit (doesn’t know it’s virtual!) o requires an ARRAY CONTROLLER, a combination of hardware and software o handles mapping between where the FS thinks data is and where it actually is o Redundancy! o partial, like parity o full, like an extra copy o if a single drive in the array is lost, its data can be automatically regenerated o no longer have to worry too much about drives failing!

HP AutoRAID 5 o RAID Levels o RAID 1 - Mirroring o full redundancy! o zero recovery time in case of disk failure, just use copy o storage capacity = 50% of total size of array o writes are serialized at some level between the two disks o in case of crash or power failure, both disks are NOT in inconsistent state o this makes writes slower than just writing to one disk o a write request does not return until both copies have been updated o transfer rate = same as one disk o parallel reads ! o each copy can service a read request

HP AutoRAID 6 o RAID Levels o RAID 3 - Byte level striping, parity on check disk o spread data by striping: byte1 -> disk1, byte2 -> disk2, byte3 -> disk3 o reads and writes of stripe’s bytes happen at the same time! o transfer rate = (N - 1) * transfer rate of one disk o only partial redundancy! o check disk stores parity information o parity overhead amounts to one bit per group of corresponding bits in a stripe o redundancy overhead = 1 / N % o Oops! Byte striping means every disk involved in every request! o No parallel reads nor writes

HP AutoRAID 7 o Parity o parity is computed using XOR ( ^ ):

HP AutoRAID 8 o RAID Levels o RAID 5 - Block level striping, parity interleaved o striping unit is 1 block: block1 -> disk1, block2 -> block2, block3 -> block3, etc. o blocks of stripes written at same time! o transfer rate = (N - 1) * transfer rate of one disk o only partial redundancy! o parity information dispersed round-robin among all disks o same redundancy overhead as level 3, = 1 / N % o Hey! Block striping can mean that every disk is NOT involved in a (small) request o parallel reads and writes can occur, depends on which disks store involved blocks o BUT writes get slower! o this happened in RAID 3 too o read - modify - write : o read parity o recompute/modify parity o write data and parity

HP AutoRAID 9 o RAID 1 vs RAID 5 o Reads: o RAID 1 (mirroring): o always offers parallel reads o RAID 5: o can only sometimes offer parallel reads o depends on where the needed blocks are o two read requests that require blocks on the same disk must be serialized o Writes: o RAID 1: o (mirroring) must complete two writes before request returns o granularity of serialization can be smaller than a file o can’t do parallel writes o RAID 5: o typically does read-modify-write to recompute parity o (HP AutoRAID uses combo of read-modify-write and LFS !) o can’t do parallel writes either o Redundancy Overhead: o RAID 1 = full redundancy, storage capacity reduced by 50% o RAID 5 = partial redundancy, storage capacity reduced by 1/N%

HP AutoRAID 10 o Storage Hierarchy = HP AutoRAID o RAID 1 = fast reads and writes, but 50% redundancy overhead o RAID 5 = strong reads, slow writes, 1/N% storage overhead o RAID 1 is fast but expensive, like a cache! o RAID 5 is slower but cheaper, like main memory! o Neither is optimum under all circumstances… o SO create a hierarchy: o use mirroring for active blocks o active set = blocks of regularly read and written files o use RAID 5 for inactive blocks o inactive set = blocks of read-only and rarely accessed files o Sounds hard! o Who pushes the data back and forth between the sets? o How often do you have to do it? o if the sets change too often, no time for anything else!

HP AutoRAID 11 o Who Minds the Storage Hierarchy? o The System Administrator? o as long as you don’t have to pay them much o and if they get it right all the time and don’t make any mistakes o The File System? o if so, big plus: File System knows better than anything who is using which files o can best determine active and inactive sets based on tracking access patterns o BUT, there are a lot of different OSes with different File System options o that makes deployment hard o each File System must be modified in order to manage a storage hierarchy o An Array Controller? o embed the software to manage the hierarchy in the hardware of a controller o no deployment issues, just add the hardware to the system o overrules the existing File System… o lose the ability to track access patterns… o need a reliable and often correct policy for determining active/inactive sets… o sounds like virtualization…

HP AutoRAID 12 o HP AutoRAID (local hard drive gets smart!) o array controller’s embedded software manages active/inactive sets o application level user interface for configuration parameters o set up LUNs (virtual logical units) o virtualization o File System is out of the loop! o Consider Mapping: o File System things it is addressing the blocks of a particular file o doesn’t know the file is actually in a storage hierarchy o is the requested file in the active set? o Or inactive set? o which disk is it on? o need some set mapping between what the file system sees and where data actually resides on disk

HP AutoRAID 13 o Virtual to Physical Mapping o Physically: o the array is structured by an address hierarchy: o PEGs contain 3 or more PEXs o PEXs address 1MB worth 128KB segments o a segment holds 2 Relocation Blocks o PEXs are typically 1MB of contiguous disk space o Segments are 128KB of contiguous sectors o Relocation Blocks serve as the: o striping unit in RAID 5, the mirroring unit in RAID 1, o and as the unit of migration between active and inactive sets o Virtually, the File System sees: o LUNs: Logical Units o purely virtual, no superblock, no directory, not a partition o rather is a set of RBs that get mapped to physical segments when actually used o user can create as many LUNs as they want o Each LUN has a virtual device table that holds the list of RBs assigned to it o RBs in virtual device table are mapped to RBs in PEG tables o PEG tables map RBs to PEX tables in which RBs are assigned to actual segments

HP AutoRAID 14 o Mapping if RB3 migrates from inactive to active, simply update the PEX mapping in the PEG table that maps RB3

HP AutoRAID 15 o How cool is that… o What you can do when you’re not in control anymore.. o Hot-pluggable disks o take one out and RAID immediately begins regenerating missing data o or, if one fails, activate a spare, if available o array still functions, no down time o requests for missing data are given top priority for regeneration o Create a larger array on the fly o size of array is limited to the size of the smallest disk o so take a small disk out and put a larger disk in o systematically replace all disks, one by one, letting each regenerate o when last bigger disk goes in, array is automatically larger

HP AutoRAID 16 o HP AutoRAID Read and Write Operations o RAID 1 Mirrored Storage Class o normal RAID Level 1 reads and writes o 2 reads can happen in parallel o a write is serialized (at the segment level) between the two disks o both updates must complete before request returns (remember the overhead!) o RAID 5 Storage Class o reads are processed as normal RAID 5 read operations o reads are parallel if possible o writes are log structured o when they happen is more complicated o RAID 5 Writes happen for 1 of 3 reasons: o a File System request tries to write data at RAID 5: o results in promotion of requested data to active set o (no actual write happens at RAID 5 in this case) o Mirrored storage class runs out of space: o so data is demoted from active to inactive, RBs copied from active to inactive o when garbage collecting and cleaning in RAID 5

HP AutoRAID 17 o Holes, Cleaning, and Garbage Collection o Holes come from: o demotion of RBs from active to inactive leaves ‘holes’ in PEXs of mirrored class o holes are managed as a free list o promotion of RBs from inactive to active leaves holes in PEXs of RAID 5 o by the way, RAID 5 in HP AutoRAID uses LFS… o so holes must be garbage collected o Cleaning: o plug the holes o RBs are migrated between PEGs to fill some, empty others o cleaning mirrored class frees up PEGs to accommodate bursts or to give to RAID 5 o cleaning RAID 5 is an alternative to garbage collection o Garbage Collection: o normal LFS garbage collection o or can be hole plugging garbage collection to fill/free PEGs o this performs much better, reduces garbage collection by up to 90%!

HP AutoRAID 18 o Performance o depends most on how much of the active set fits into the mirrored class o if it all fits, then RAID 5 goes unused. Performance is that of a RAID I array o tested OLTP against weaker RAID and JBOD o JBOD = just a bunch of disks, striped, no redundancy (so performs the best!) o tested with all of active set fitting in Mirrored Storage class o so no migration overhead o AutoRAID lags due to redundancy overhead o tested performance for different %’s of active set at mirrored level o more disks = higher % at Mirrored Storage Class o obviously performance rises with higher % because less migration o interesting to note at 8 drives, when all of active set fits o performance rises because transfer rate is increasing, more disks to write to Shows transaction rate of OLTP for slow RAID, HP AutoRAID, and for JBOD Shows transaction rate as number of disk in AutoRAID increases

HP AutoRAID 19 o Can the File System help? o File System sees virtual disk, o probably has its own ideas of how best to lay data to blocks to optimize access o perhaps by assigning RBs of a LUN to a linear set of contiguous blocks… o BUT are they really going to be contiguous? o in the array, RBs can be mapped anywhere and most likely are not stored linearly o so does this make seek times really bad? o ran tests where they initially set up array: o with all RBs laid out completely linearly o with all RBs laid out completely randomly o Resulted in only modest performance gains for initial linear layout o note there is no way to migrate data between sets and maintain a linear layout.. o Conclusion: o the 64KB RB allocation block may sound big, but works just fine o remember, large block sizes amortize seek times o RB size is subject to same considerations as block size on a normal hard drive

HP AutoRAID 20 o Mirrored Storage Class Read Selection Algorithm o which copy should be read? o possibilities: o strict alternation o keep one disk head on the outer track, the other on the inner o read from the disk with the shortest queue o read from the disk with the shortest seek time o strict alternation and inner/outer can give big benefits under certain workloads… o AND can really punish under other workloads o shortest queue and shortest seek time yield same modest gain o but it is hard to track shortest seek time o so shortest queue wins

HP AutoRAID 21 o Conclusion o redundancy protects from data loss due to hardware failure o different striping units and levels of redundancy result in different performance o performance depends on type of workload o redundancy also introduces overhead o 50% for mirroring o reduce redundancy overhead by using a storage hierarchy o implementing different RAID levels for active and inactive data o storage hierarchy managed by an array controller o management software embedded onto hardware controller o special mapping virtualizes the array o File System sees one (or more) virtual logical units