HP AutoRAID (Lecture 5, cs262a)

Slides:



Advertisements
Similar presentations
A Case for Redundant Arrays Of Inexpensive Disks Paper By David A Patterson Garth Gibson Randy H Katz University of California Berkeley.
Advertisements

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
RAID Redundant Array of Independent Disks
CSCE430/830 Computer Architecture
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”
EECS 262a Advanced Topics in Computer Systems Lecture 4 Filesystems (Con’t) September 15 th, 2014 John Kubiatowicz Electrical Engineering and Computer.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories Presented by Sri.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,
1 A Case for Redundant Arrays of Inexpensive Disks Patterson, Gibson and Katz (Seminal paper) Chen, Lee, Gibson, Katz, Patterson (Survey) Circa late 80s..
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
RAID Technology. Use Arrays of Small Disks? 14” 10”5.25”3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End Katz and Patterson.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Computer ArchitectureFall 2008 © November 12, 2007 Nael Abu-Ghazaleh Lecture 24 Disk IO.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
RAID Systems CS Introduction to Operating Systems.
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM J. Wilkes, R. Golding, C. Staelin T. Sullivan HP Laboratories, Palo Alto, CA.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Lecture 4 1 Reliability vs Availability Reliability: Is anything broken? Availability: Is the system still available to the user?
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Redundant Array of Independent Disks
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Presented by Arthur Strutzenberg.
EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con’t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Αρχιτεκτονική Υπολογιστών Ενότητα # 6: RAID Διδάσκων: Γεώργιος Κ. Πολύζος Τμήμα: Πληροφορικής.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
CS Introduction to Operating Systems
John Kubiatowicz and Anthony D. Joseph
HP AutoRAID (Lecture 5, cs262a)
A Case for Redundant Arrays of Inexpensive Disks (RAID)
RAID Redundant Arrays of Independent Disks
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Disks and RAID.
Vladimir Stojanovic & Nicholas Weaver
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID Disk Arrays Hank Levy 1.
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM
RAID Disk Arrays Hank Levy 1.
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
UNIT IV RAID.
John Kubiatowicz Electrical Engineering and Computer Sciences
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Autumn 2004 Redundant Arrays of Inexpensive Disks (RAID) Hank Levy 1.
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2009 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID Disk Arrays Hank Levy 1.
RAID RAID Mukesh N Tekwani April 23, 2019
CSE 451: Operating Systems Winter 2004 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
John Kubiatowicz Electrical Engineering and Computer Sciences
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Presentation transcript:

HP AutoRAID (Lecture 5, cs262a) Ali Ghodsi and Ion Stoica, UC Berkeley January 31, 2018 (based on slide from John Kubiatowicz, UC Berkeley) 3-D graph Checklist What we want to enable What we have today How we’ll get there

Array Reliability 50,000 Hours ÷ 70 disks = 700 hours Reliability of N disks = Reliability of 1 Disk ÷ N 50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month! Arrays (without redundancy) too unreliable to be useful! Hot spares support reconstruction in parallel with access: very high media availability can be achieved

RAID Redundant Array of Inexpensive Disks (RAID) Invented at Berkeley 1988: Dave Patterson Garth Gibson (now at CMU, CEO of the Vector Institute, Toronto) Randy Katz Patterson, David; Gibson, Garth A.; Katz, Randy (1988). A Case for Redundant Arrays of Inexpensive Disks (RAID), SIGMOD ‘88

RAID Basics: Levels* of RAID RAID 0: striping with no parity High read/write throughput; theoretically Nx where N is number of disks No data redundancy https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 0: striping with no parity RAID 1: Mirroring, simple, fast, but 2x storage Each disk is fully duplicated onto its "shadow”  high availability Read throughput (2x) assuming N disks Writes (1x): two physical writes https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 2: bit-level interleaving with Hamming error-correcting codes d bit Hamming code can detect d-1 errors and correct (d-1)/2 Need synchronized disks, i.e., spin at same angular rotation Hamming code https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 3: byte-level striping with dedicated parity disk Dedicated parity disk is write bottleneck: every write writes parity parity disk https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 4: block-level striping with dedicated parity disk Same bottleneck problems as RAID 3 parity disk https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 5: block-level striping with rotating parity disk Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x parity block https://en.wikipedia.org/wiki/Standard_RAID_levels Levels in RED are used in practice

RAID Basics: Levels* of RAID RAID 5: block-level striping with rotating parity disk Most popular; spreads out parity load; space 1-1/N, read/write (N-1)x RAID 6: RAID 5 with two parity blocks (tolerates two drive failures) Use with today’s drive sizes! Why? Correlated drive failures (2x expected in 10hr recovery) [Schroeder and Gibson, FAST07] Failures during multi-hour/day rebuild in high-stress environments Levels in RED are used in practice

Redundant Arrays of Disks RAID 5+: High I/O Rate Parity Increasing Logical Disk Addresses Independent writes possible because of interleaved parity Reed-Solomon Codes ("Q") for protection during reconstruction A logical write becomes four physical I/Os D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 Stripe P D16 D17 D18 D19 Stripe Unit D20 D21 D22 D23 P Targeted for mixed applications … … … … … Disk Columns

Problems of Disk Arrays: Small Writes RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes D0' D0 D1 D2 D3 P new data (3. Write) + old data XOR (1. Read) + old parity XOR (2. Read) (4. Write) D0' D1 D2 D3 P'

System Availability: Orthogonal RAIDs Data Recovery Group: unit of data redundancy Array Controller String Controller . . . String Controller . . . Redundant Support Components: fans, power supplies, controller, cables End to End Data Integrity: internal parity protected data paths

System-Level Availability Fully dual redundant I/O Controller host with duplicated paths, higher performance can be obtained when there are no failures host I/O Controller Array Controller . . . Array Controller . . . . . . Goal: No Single Points of Failure . . . . . . Recovery Group .

HP AutoRAID – Motivation Goals: automate the efficient replication of data in a RAID RAIDs are hard to setup and optimize Different RAID Levels provide different tradeoffs Automate the migration between levels RAID 1 and 5: what are the tradeoffs? RAID 1: excellent read performance, good write performance, performs well under failures, expensive RAID 5: cost effective, good read performance, bad write performance, expensive recovery

HP AutoRAID – Motivation Each kind of replication has a narrow range of workloads for which it is best... Mistake ⇒ 1) poor performance, 2) changing layout is expensive and error prone Difficult to add storage: new disk ⇒ change layout and rearrange data... Difficult to add disks of different sizes

HP AutoRAID – Key Ideas Key idea: mirror active data (hot), RAID 5 for cold data Assumes only part of data in active use at one time Working set changes slowly (to allow migration)

HP AutoRAID – Key Ideas How to implement this idea? Sys-admin make a human move around the files.... BAD. painful and error prone File system Hard to implement/ deploy; can’t work with existing systems Smart array controller: (magic disk) block-level device interface Easy to deploy because there is a well-defined abstraction Enables easy use of NVRAM (why?)

AutoRAID Details PEX (Physical Extent): 1MB chunk of disk space PEG (Physical Extent Group): Size depends on # Disks A group of PEXes assigned to one storage class Stripe: Size depends # Disks One row of parity and data segments in a RAID 5 storage class Segment: 128 KB Strip unit (RAID 5) or half of a mirroring unit Relocation Block (RB): 64KB Client visible space unit

Closer Look

Putting Everything Together Virtual Device Tables Map RBs to PEGs RB allocated to a PEG only when written PEG Tables Map PEGs to physical disk addresses Unused slots in PEG are marked free until and RB is allocated to them

HP AutoRaid – Features Promote/demote in 64KB chunks (8-16 blocks) Hot swap disks, etc. (A hot swap is just a controlled failure) Add storage easily (goes into the mirror pool) Useful to allow different size disks (why?) No need for an active hot spare (per se); Just keep enough working space around Log-structured RAID 5 writes Nice big streams, no need to read old parity for partial writes

Questions When to demote? When there is too much mirrored storage (>10%) Demotion leaves a hole (64KB). What happens to it? Moved to free list and reuse Demoted RBs are written to the RAID5 log, one write for data, a second for parity Why log RAID 5 better than update in place? Update of data requires reading all the old data to recalculate parity Log ignores old data (which becomes garbage) and writes only new data/parity stripes How to promote? When a RAID5 block is written... Just write it to mirrored and the old version becomes garbage.

Questions How big should an RB be? How do you find where an RB is? Bigger ⇒ Less mapping information, fewer seeks Smaller ⇒ fine grained mapping information How do you find where an RB is? Convert addresses to (LUN, offset) and then lookup RB in a table from this pair Map size = Number of RBs and must be proportional to size of total storage How to handle thrashing (too much active write data)? Automatically revert to directly writing RBs to RAID 5!

Issues Disks writes go to two disks (since newly written data is “hot”) Must wait for both to complete - why? Does the host have to wait for both? No, just for NVRAM Controller uses cache for reads Controller uses NVRAM for fast commit, then moves data to disks What if NVRAM is full? Block until NVRAM flushed to disk, then write to NVRAM

Issues What happens in the background? 1) compaction, 2) migration, 3) balancing Compaction: clean RAID 5 and plug holes in the mirrored disks Do mirrored disks get cleaned? Yes, when a PEG is needed for RAID5; i.e., pick a disks with lots of holes and move its used RBs to other disks Resulting empty PEG is now usable by RAID5 What if there aren’t enough holes? Write the excess RBs to RAID5, then reclaim the PEG Migration: which RBs to demote? Least-recently-written (not LRU) Balancing: make sure data evenly spread across the disks. (Most important when you add a new disk)