Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com.

Flexible Disk Use In OpenZFS Matt Ahrens

13 Years Ago...

ZFS original goals ⛳ End the suffering of system administrators
Everything works well together No sharp edges Resilient configuration

Scorecard: ✅ Filesystem as the administrative control point
Inheritable properties Quotas & Reservations (x3) Add storage Add and remove special devices (log & cache) Storage portability Between releases & OS’s

Scorecard: ❌ Some properties only apply to new data
Compression, recordsize, dedup But at least you can send/recv Can’t reconfigure primary storage Stuck if you add the wrong disks Stuck with existing RAID-Z config

Scorecard: 📈 🆕 Device Removal 🔜 RAIDZ Expansion
Stuck if you add the wrong disks? 🆕 Device Removal Stuck with existing RAID-Z config? 🔜 RAIDZ Expansion

🆕 Device Removal

🆕 Problems Solved Over provision E.g. pool only 50% full
Added wrong disk / wrong vdev type E.g. meant to add as log Migrate to different device size/number E.g. 10x 1TB drives ➠ 4x 6TB drives

🆕 Device Removal StoragePool mirror-0 mirror-1 mirror-2

🆕 How it works Find allocated space to copy
By spacemap, not by block pointer Fast discovery of data to copy Sequential reads No checksum verification

🆕 How it works: mapping Old block pointers ➠ removed vdev
Map <old offset, size> ➠ <new vdev, offset> Always in memory Read when pool opened

Always in memory? 🆕 How it works: mapping
Can be as bad as 1GB RAM / 1TB disk! Track / remove obsolete entries over time >10x improvement coming 100MB RAM / 1TB disk or better

🆕 How it works: frees Free’s from removing vdev Yet to copy?
Free from old location Already copied? Free from old and new locations In the middle of copying? Free from old, and new at end of txg

🆕 How it works: split blocks

🆕 One big caveat Doesn’t work with RAID-Z
(Works with plain disks and mirrors) Devices also must be same ashift

🆕 Operating Dev Removal
Check memory usage before initiating removal $ zpool list -v NAME SIZE ALLOC FREE FRAG CAP test T 884G 1.37T 62% 39% c2t1d G 299G 451G 63% 39% c2t2d G 309G 441G 66% 41% c2t3d G 276G 474G 59% 36% $ zpool remove -n tank c2t1d0 Memory that will be used after removing c2t1d0: 57.5M $ zpool remove tank c2t1d0

Check status $ zpool status tank remove: Evacuation of c2t1d0 in progress since Apr 16 19:21:17 103G copied out of 299G at 94.8M/s, 34.36% done, 0h30m to go config: NAME STATE READ WRITE CKSUM test ONLINE c2t1d0 ONLINE c2t2d0 ONLINE c2t3d0 ONLINE

Removal can be canceled $ zpool remove -s tank $ zpool status remove: Removal of c2t1d0 canceled on Apr 16 19:25: config: NAME STATE READ WRITE CKSUM test ONLINE c2t1d0 ONLINE c2t2d0 ONLINE c2t3d0 ONLINE

Power loss / reboot during removal? Picks up where left off Other operations during removal? Everything* works * except zpool checkpoint

Status after removal $ zpool status tank remove: Removal of vdev 0 copied 299G in 1h20m, completed on Mon Apr 16 20:36: 25.6M memory used for removed device mappings config: NAME STATE READ WRITE CKSUM test ONLINE c2t2d ONLINE c2t3d ONLINE

Performance impact during removal Copy all data Reads are sequential zfs_vdev_removal_max_active=2 Can take a while, ~fragmentation

Performance impact after removal Memory use Check with zpool status Negligible other impact Read mapping when pool opened Look up mapping when read - O(log(n)) Track obsolete mappings

Redundancy impact Preserves all copies of data (mirror) Silent damage detected/corrected Even for split blocks However: checksums not verified Transient errors become permanent

🆕 Dev Removal Status Thanks to additional developers
Alex Reece, Serapheim Dimitropoulos, Prashanth Sreenivasa Brian Behlendorf, Tim Chase Used in production at Delphix since 2015 In illumos since Jan 2018 In ZoL since April 2018 (will be in 0.8) “Map big chunks” out for review

🔜 RAIDZ Expansion

🔜 Problem Solved You have a RAID-Z pool with 4 disks in it
You want to add a 5th disk StoragePool raidz1

🔜 How it works Rewrite everything with new physical stripe width
Increases total usable space Parity/data relationship not changed

How could traditional RAID 4/5/6 do it?
1 2 3 P1-3 4 5 6 P4-6 7 8 9 P7-9 10 11 12 P10-12 13 14 15 P13-15 1 2 3 4 P1-4 5 6 7 8 P5-8 9 10 11 12 P9-12 13 14 15 16 P13-16 17 18 19 20 P17-20 Color indicates parity group (stripe)

RAID-Z Expansion: Reflow Color indicates parity group (logical stripe)
2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13p 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

RAID-Z Expansion: Reflow copies allocated data
2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

🔜 How it works Find allocated space to reflow
By spacemap, not by block pointer Fast discovery of data to reflow Sequential reads and writes No checksum verification

🔜 How it works Each logical stripe is independent
Don’t need to know where parity is Segments still on different disks, so redundancy is preserved (contraction couldn’t work this way)

🔜 Operating RAIDZ Expand
Initiate expansion $ zpool attach tank raidz1-0 c2t1d0 Works with RAIDZ1/2/3 Can expand multiple times Will work in background; zpool status

RAIDZ must be healthy during reflow Silent damage OK (up to parity limit) Must be able to write to all drives No missing devices If disk dies, reflow will pause

Performance impact during reflow Copy all data Reads and writes are sequential

Performance impact after reflow Small CPU overhead when reading One additional bcopy() to sequentialize Combinatorial reconstruction (old + new) choose P

Redundancy impact Preserves data and parity However: checksums not verified Transient errors become permanent

RAID-Z Expansion: new writes, new stripe width
2 3 4 5p 6 7p 8 9 10 11 12 13 14 15 16 17p 18 19 20 1p 2 3 4 5p 6 7p 8 9 10 11p 12 13 14 15 16x 17p 18 19 20 21 22 23 24 25 Color indicates parity group (logical stripe)

🔜 RAIDZ Expand Status Sponsored by FreeBSD Foundation Design complete
Pre-alpha preview code You will lose data! Reflow in one txg Upcoming talk at BSDCAN (June 2018)

Future Work Map big chunks - 1/10th RAM for DevRm Review out
Queue up multiple removal Don’t allocate from queued removals Prototype needs work Remove RAIDZ group? Other vdevs are RAIDZ, >= width

OpenZFS Developer Summit
6th annual OZDS! Week of September 10, 2018 Talk proposals due July 16

🆕 Device Removal 🔜 RAIDZ Expansion Matt Ahrens

Reflow end state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Reflow initial state 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Possible intermediary state
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Possible intermediary state: disk failure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 30 What if we lose a disk? Read stripe @29-32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)

Read at least 5*5=25 sectors into scratch object
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 25; separation=6; chunk size = 1
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Reflow progress = 30 What if we lose a disk? Read stripe @29-32
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 If we read from new loc 29,30 & original loc 31,32 Data loss :-( Instead, read “split stripe” from original location Same as if we hadn’t started moving this stripe Need to ensure the original location of split stripe hasn’t been overwritten Each stripe of block considered independently Only split stripe read from original loc (not whole block)

Reflow process Need to track progress to know the physical stripe width Each TXG can only overwrite what’s previously unused Exponential increase in amount that can be copied per TXG E.g. Initial scratch object: 1MB 72 TXG’s for 1GB 145 TXG’s for 1TB 217 TXG’s for 1PB

Design implications Works with RAIDZ-1/2/3
Can expand multiple times (4-wide -> 5 wide -> 6 wide) Old data has old Data : Parity ratio New data has new Data : Parity ratio RAIDZ must be healthy (no missing devices) during reflow If disk dies, reflow will pause and wait for it to be reconstructed Reflow works in the background

Thank you!

Status High level design complete Detailed design 80% complete
Zero lines of code written Expect significant updates at BSDCAN (June 2018) and next year’s DevSummit (~October 2018)

Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com.

Similar presentations

Presentation on theme: "Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com.

Similar presentations

Presentation on theme: "Flexible Disk Use In OpenZFS Matt Ahrens mahrens@delphix.com."— Presentation transcript:

Similar presentations

About project

Feedback