Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com.

Slides:



Advertisements
Similar presentations
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
Advertisements

CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RELIABILITY ANALYSIS OF ZFS CS 736 Project University of Wisconsin - Madison.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan Hewlett-Packard Laboratories Presented by Sri.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Chapter 7: Configuring Disks. 2/24 Objectives Learn about disk and file system configuration in Vista Learn how to manage storage Learn about the additional.
12: IO Systems1 I/O SYSTEMS This Chapter is About Processor I/O Interfaces: Connections exist between processors, memory, and IO devices. The OS must manage.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
RAID Systems CS Introduction to Operating Systems.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
NEC Computers SAS - Confidential - Oct RAID General Concept 1 RAID General Concept Auteur : Franck THOMAS.
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
Storage and NT File System INFO333 – Lecture Mariusz Nowostawski Noria Foukia.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
IOS110 Introduction to Operating Systems using Windows Session 5 1.
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Copyright © 2014 EMC Corporation. All Rights Reserved. SnapView Snapshot Upon completion of this module, you should be able to: Describe SnapView Snapshot.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Windows Server 2003 硬碟管理與磁碟機陣列 林寶森
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
The concept of RAID in Databases By Junaid Ali Siddiqui.
4P13 Week 12 Talking Points Device Drivers 1.Auto-configuration and initialization routines 2.Routines for servicing I/O requests (the top half)
CS 6290 I/O and Storage Milos Prvulovic. Storage Systems I/O performance (bandwidth, latency) –Bandwidth improving, but not as fast as CPU –Latency improving.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
RAID Presentation Raid is an acronym for “Redundant array of independent Drives”, or Redundant array of inexpensive drives”. The main concept of RAID is.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
COSC2410: LAB 19 INTRODUCTION TO MEMORY/CACHE DIRECT MAPPING 1.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
CS Introduction to Operating Systems
ZFS The Last Word in Filesystem
ZFS The Future Of File Systems
Froyo: Hassle-free Personal Array Management using XFS and Devicemapper Andy Grover @iamagrover April 20, 2016.
Storage HDD, SSD and RAID.
Jonathan Walpole Computer Science Portland State University
Fujitsu Training Documentation RAID Groups and Volumes
Outline Paging Swapping and demand paging Virtual memory.
RAID Non-Redundant (RAID Level 0) has the lowest cost of any RAID
CSE451 I/O Systems and the Full I/O Path Autumn 2002
Operating System I/O System Monday, August 11, 2008.
Introduction To Computers
Main Memory Management
RAID Disk Arrays Hank Levy 1.
RAID RAID Mukesh N Tekwani
RAID Disk Arrays Hank Levy 1.
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Overview Continuation from Monday (File system implementation)
Btrfs Filesystem Chris Mason.
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
RAID Redundant Array of Inexpensive (Independent) Disks
Mark Zbikowski and Gary Kimura
Secondary Storage Management Brian Bershad
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE451 Virtual Memory Paging Autumn 2002
RAID Disk Arrays Hank Levy 1.
Secondary Storage Management Hank Levy
2 Disks and Things.
RAID RAID Mukesh N Tekwani April 23, 2019
Paging and Segmentation
Hard Drives & RAID PM Video 10:28
File System Performance
Presentation transcript:

Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com

13 Years Ago...

ZFS original goals ⛳ End the suffering of system administrators Everything works well together No sharp edges Resilient configuration

Scorecard: ✅ Filesystem as the administrative control point Inheritable properties Quotas & Reservations (x3) Add storage Add and remove special devices (log & cache) Storage portability Between releases & OS’s

Scorecard: ❌ Some properties only apply to new data Compression, recordsize, dedup But at least you can send/recv Can’t reconfigure primary storage Stuck if you add the wrong disks Stuck with existing RAID-Z config

Scorecard: 📈 🆕 Device Removal 🔜 RAIDZ Expansion Stuck if you add the wrong disks? 🆕 Device Removal Stuck with existing RAID-Z config? 🔜 RAIDZ Expansion

🆕 Device Removal

🆕 Problems Solved Over provision E.g. pool only 50% full Added wrong disk / wrong vdev type E.g. meant to add as log Migrate to different device size/number E.g. 10x 1TB drives ➠ 4x 6TB drives

🆕 Device Removal StoragePool mirror-0 mirror-1 mirror-2

🆕 Device Removal StoragePool mirror-0 mirror-1 mirror-2

🆕 How it works Find allocated space to copy By spacemap, not by block pointer Fast discovery of data to copy Sequential reads No checksum verification

🆕 How it works: mapping Old block pointers ➠ removed vdev Map <old offset, size> ➠ <new vdev, offset> Always in memory Read when pool opened

Always in memory? 🆕 How it works: mapping Can be as bad as 1GB RAM / 1TB disk! Track / remove obsolete entries over time >10x improvement coming 100MB RAM / 1TB disk or better

🆕 How it works: frees Free’s from removing vdev Yet to copy? Free from old location Already copied? Free from old and new locations In the middle of copying? Free from old, and new at end of txg

🆕 How it works: split blocks

🆕 One big caveat Doesn’t work with RAID-Z (Works with plain disks and mirrors) Devices also must be same ashift

🆕 Operating Dev Removal Check memory usage before initiating removal $ zpool list -v NAME SIZE ALLOC FREE FRAG CAP test 2.25T 884G 1.37T 62% 39% c2t1d0 750G 299G 451G 63% 39% c2t2d0 750G 309G 441G 66% 41% c2t3d0 750G 276G 474G 59% 36% $ zpool remove -n tank c2t1d0 Memory that will be used after removing c2t1d0: 57.5M $ zpool remove tank c2t1d0

🆕 Operating Dev Removal Check status $ zpool status tank remove: Evacuation of c2t1d0 in progress since Apr 16 19:21:17 103G copied out of 299G at 94.8M/s, 34.36% done, 0h30m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0

🆕 Operating Dev Removal Removal can be canceled $ zpool remove -s tank $ zpool status remove: Removal of c2t1d0 canceled on Apr 16 19:25:15 2018 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0

🆕 Operating Dev Removal Power loss / reboot during removal? Picks up where left off Other operations during removal? Everything* works * except zpool checkpoint

🆕 Operating Dev Removal Status after removal $ zpool status tank remove: Removal of vdev 0 copied 299G in 1h20m, completed on Mon Apr 16 20:36:40 2018 25.6M memory used for removed device mappings config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0

🆕 Operating Dev Removal Performance impact during removal Copy all data Reads are sequential zfs_vdev_removal_max_active=2 Can take a while, ~fragmentation

🆕 Operating Dev Removal Performance impact after removal Memory use Check with zpool status Negligible other impact Read mapping when pool opened Look up mapping when read - O(log(n)) Track obsolete mappings

🆕 Operating Dev Removal Redundancy impact Preserves all copies of data (mirror) Silent damage detected/corrected Even for split blocks However: checksums not verified Transient errors become permanent

🆕 Dev Removal Status Thanks to additional developers Alex Reece, Serapheim Dimitropoulos, Prashanth Sreenivasa Brian Behlendorf, Tim Chase Used in production at Delphix since 2015 In illumos since Jan 2018 In ZoL since April 2018 (will be in 0.8) “Map big chunks” out for review

🔜 RAIDZ Expansion

🔜 Problem Solved You have a RAID-Z pool with 4 disks in it You want to add a 5th disk StoragePool raidz1

🔜 How it works Rewrite everything with new physical stripe width Increases total usable space Parity/data relationship not changed

How could traditional RAID 4/5/6 do it? 1 2 3 P1-3 4 5 6 P4-6 7 8 9 P7-9 10 11 12 P10-12 13 14 15 P13-15 1 2 3 4 P1-4 5 6 7 8 P5-8 9 10 11 12 P9-12 13 14 15 16 P13-16 17 18 19 20 P17-20 Color indicates parity group (stripe)

RAID-Z Expansion: Reflow Color indicates parity group (logical stripe) 2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

RAID-Z Expansion: Reflow copies allocated data 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

🔜 How it works Find allocated space to reflow By spacemap, not by block pointer Fast discovery of data to reflow Sequential reads and writes No checksum verification

🔜 How it works Each logical stripe is independent Don’t need to know where parity is Segments still on different disks, so redundancy is preserved (contraction couldn’t work this way)

🔜 Operating RAIDZ Expand Initiate expansion $ zpool attach tank raidz1-0 c2t1d0 Works with RAIDZ1/2/3 Can expand multiple times Will work in background; zpool status

🔜 Operating RAIDZ Expand RAIDZ must be healthy during reflow Silent damage OK (up to parity limit) Must be able to write to all drives No missing devices If disk dies, reflow will pause

🔜 Operating RAIDZ Expand Performance impact during reflow Copy all data Reads and writes are sequential

🔜 Operating RAIDZ Expand Performance impact after reflow Small CPU overhead when reading One additional bcopy() to sequentialize Combinatorial reconstruction (old + new) choose P

🔜 Operating RAIDZ Expand Redundancy impact Preserves data and parity However: checksums not verified Transient errors become permanent

RAID-Z Expansion: new writes, new stripe width 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13 14 15 16x 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

🔜 RAIDZ Expand Status Sponsored by FreeBSD Foundation Design complete Pre-alpha preview code You will lose data! Reflow in one txg https://reviews.freebsd.org/D15124 Upcoming talk at BSDCAN (June 2018)

Future Work Map big chunks - 1/10th RAM for DevRm Review out Queue up multiple removal Don’t allocate from queued removals Prototype needs work Remove RAIDZ group? Other vdevs are RAIDZ, >= width

OpenZFS Developer Summit 6th annual OZDS! Week of September 10, 2018 Talk proposals due July 16

🆕 Device Removal 🔜 RAIDZ Expansion Matt Ahrens mahrens@delphix.com

Reflow end state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Reflow initial state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Possible intermediary state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Possible intermediary state: disk failure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 30 What if we lose a disk? Read stripe @29-32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)

Read at least 5*5=25 sectors into scratch object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 25; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 26; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 27; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 28; separation=7; chunk size = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 30; separation=7; chunk size = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 30 What if we lose a disk? Read stripe @29-32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)

Reflow process Need to track progress to know the physical stripe width Each TXG can only overwrite what’s previously unused Exponential increase in amount that can be copied per TXG E.g. Initial scratch object: 1MB 72 TXG’s for 1GB 145 TXG’s for 1TB 217 TXG’s for 1PB

Design implications Works with RAIDZ-1/2/3 Can expand multiple times (4-wide -> 5 wide -> 6 wide) Old data has old Data : Parity ratio New data has new Data : Parity ratio RAIDZ must be healthy (no missing devices) during reflow If disk dies, reflow will pause and wait for it to be reconstructed Reflow works in the background

Thank you!

Status High level design complete Detailed design 80% complete Zero lines of code written Expect significant updates at BSDCAN (June 2018) and next year’s DevSummit (~October 2018)