Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com
13 Years Ago...
ZFS original goals ⛳ End the suffering of system administrators Everything works well together No sharp edges Resilient configuration
Scorecard: ✅ Filesystem as the administrative control point Inheritable properties Quotas & Reservations (x3) Add storage Add and remove special devices (log & cache) Storage portability Between releases & OS’s
Scorecard: ❌ Some properties only apply to new data Compression, recordsize, dedup But at least you can send/recv Can’t reconfigure primary storage Stuck if you add the wrong disks Stuck with existing RAID-Z config
Scorecard: 📈 🆕 Device Removal 🔜 RAIDZ Expansion Stuck if you add the wrong disks? 🆕 Device Removal Stuck with existing RAID-Z config? 🔜 RAIDZ Expansion
🆕 Device Removal
🆕 Problems Solved Over provision E.g. pool only 50% full Added wrong disk / wrong vdev type E.g. meant to add as log Migrate to different device size/number E.g. 10x 1TB drives ➠ 4x 6TB drives
🆕 Device Removal StoragePool mirror-0 mirror-1 mirror-2
🆕 Device Removal StoragePool mirror-0 mirror-1 mirror-2
🆕 How it works Find allocated space to copy By spacemap, not by block pointer Fast discovery of data to copy Sequential reads No checksum verification
🆕 How it works: mapping Old block pointers ➠ removed vdev Map <old offset, size> ➠ <new vdev, offset> Always in memory Read when pool opened
Always in memory? 🆕 How it works: mapping Can be as bad as 1GB RAM / 1TB disk! Track / remove obsolete entries over time >10x improvement coming 100MB RAM / 1TB disk or better
🆕 How it works: frees Free’s from removing vdev Yet to copy? Free from old location Already copied? Free from old and new locations In the middle of copying? Free from old, and new at end of txg
🆕 How it works: split blocks
🆕 One big caveat Doesn’t work with RAID-Z (Works with plain disks and mirrors) Devices also must be same ashift
🆕 Operating Dev Removal Check memory usage before initiating removal $ zpool list -v NAME SIZE ALLOC FREE FRAG CAP test 2.25T 884G 1.37T 62% 39% c2t1d0 750G 299G 451G 63% 39% c2t2d0 750G 309G 441G 66% 41% c2t3d0 750G 276G 474G 59% 36% $ zpool remove -n tank c2t1d0 Memory that will be used after removing c2t1d0: 57.5M $ zpool remove tank c2t1d0
🆕 Operating Dev Removal Check status $ zpool status tank remove: Evacuation of c2t1d0 in progress since Apr 16 19:21:17 103G copied out of 299G at 94.8M/s, 34.36% done, 0h30m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0
🆕 Operating Dev Removal Removal can be canceled $ zpool remove -s tank $ zpool status remove: Removal of c2t1d0 canceled on Apr 16 19:25:15 2018 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0
🆕 Operating Dev Removal Power loss / reboot during removal? Picks up where left off Other operations during removal? Everything* works * except zpool checkpoint
🆕 Operating Dev Removal Status after removal $ zpool status tank remove: Removal of vdev 0 copied 299G in 1h20m, completed on Mon Apr 16 20:36:40 2018 25.6M memory used for removed device mappings config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0
🆕 Operating Dev Removal Performance impact during removal Copy all data Reads are sequential zfs_vdev_removal_max_active=2 Can take a while, ~fragmentation
🆕 Operating Dev Removal Performance impact after removal Memory use Check with zpool status Negligible other impact Read mapping when pool opened Look up mapping when read - O(log(n)) Track obsolete mappings
🆕 Operating Dev Removal Redundancy impact Preserves all copies of data (mirror) Silent damage detected/corrected Even for split blocks However: checksums not verified Transient errors become permanent
🆕 Dev Removal Status Thanks to additional developers Alex Reece, Serapheim Dimitropoulos, Prashanth Sreenivasa Brian Behlendorf, Tim Chase Used in production at Delphix since 2015 In illumos since Jan 2018 In ZoL since April 2018 (will be in 0.8) “Map big chunks” out for review
🔜 RAIDZ Expansion
🔜 Problem Solved You have a RAID-Z pool with 4 disks in it You want to add a 5th disk StoragePool raidz1
🔜 How it works Rewrite everything with new physical stripe width Increases total usable space Parity/data relationship not changed
How could traditional RAID 4/5/6 do it? 1 2 3 P1-3 4 5 6 P4-6 7 8 9 P7-9 10 11 12 P10-12 13 14 15 P13-15 1 2 3 4 P1-4 5 6 7 8 P5-8 9 10 11 12 P9-12 13 14 15 16 P13-16 17 18 19 20 P17-20 Color indicates parity group (stripe)
RAID-Z Expansion: Reflow Color indicates parity group (logical stripe) 2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)
RAID-Z Expansion: Reflow copies allocated data 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)
🔜 How it works Find allocated space to reflow By spacemap, not by block pointer Fast discovery of data to reflow Sequential reads and writes No checksum verification
🔜 How it works Each logical stripe is independent Don’t need to know where parity is Segments still on different disks, so redundancy is preserved (contraction couldn’t work this way)
🔜 Operating RAIDZ Expand Initiate expansion $ zpool attach tank raidz1-0 c2t1d0 Works with RAIDZ1/2/3 Can expand multiple times Will work in background; zpool status
🔜 Operating RAIDZ Expand RAIDZ must be healthy during reflow Silent damage OK (up to parity limit) Must be able to write to all drives No missing devices If disk dies, reflow will pause
🔜 Operating RAIDZ Expand Performance impact during reflow Copy all data Reads and writes are sequential
🔜 Operating RAIDZ Expand Performance impact after reflow Small CPU overhead when reading One additional bcopy() to sequentialize Combinatorial reconstruction (old + new) choose P
🔜 Operating RAIDZ Expand Redundancy impact Preserves data and parity However: checksums not verified Transient errors become permanent
RAID-Z Expansion: new writes, new stripe width 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13 14 15 16x 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)
🔜 RAIDZ Expand Status Sponsored by FreeBSD Foundation Design complete Pre-alpha preview code You will lose data! Reflow in one txg https://reviews.freebsd.org/D15124 Upcoming talk at BSDCAN (June 2018)
Future Work Map big chunks - 1/10th RAM for DevRm Review out Queue up multiple removal Don’t allocate from queued removals Prototype needs work Remove RAIDZ group? Other vdevs are RAIDZ, >= width
OpenZFS Developer Summit 6th annual OZDS! Week of September 10, 2018 Talk proposals due July 16
🆕 Device Removal 🔜 RAIDZ Expansion Matt Ahrens mahrens@delphix.com
Reflow end state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Reflow initial state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Possible intermediary state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Possible intermediary state: disk failure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 30 What if we lose a disk? Read stripe @29-32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)
Read at least 5*5=25 sectors into scratch object 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 25; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 26; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 27; separation=6; chunk size = 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 28; separation=7; chunk size = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 30; separation=7; chunk size = 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Reflow progress = 30 What if we lose a disk? Read stripe @29-32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)
Reflow process Need to track progress to know the physical stripe width Each TXG can only overwrite what’s previously unused Exponential increase in amount that can be copied per TXG E.g. Initial scratch object: 1MB 72 TXG’s for 1GB 145 TXG’s for 1TB 217 TXG’s for 1PB
Design implications Works with RAIDZ-1/2/3 Can expand multiple times (4-wide -> 5 wide -> 6 wide) Old data has old Data : Parity ratio New data has new Data : Parity ratio RAIDZ must be healthy (no missing devices) during reflow If disk dies, reflow will pause and wait for it to be reconstructed Reflow works in the background
Thank you!
Status High level design complete Detailed design 80% complete Zero lines of code written Expect significant updates at BSDCAN (June 2018) and next year’s DevSummit (~October 2018)