Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018.

Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018

Introduction Description: Copy files almost instantly by copying by reference. Motivations: VMware VAAI support: NAS Full File Clone and Fast File Clone. Save memory and disk space Existing alternatives: dataset clone deduplication.

Regular Copy dnode 1 Copy dnode 2 L2 L1 A B Data blocks C L2' L1' A' B' Copied blocks C' File copy is currently a very costly operation that has to duplicate every data block of the original file.

Direct References File Clone: Method 1

modify data block 3 of dnode 2
Clone: fast file copy dnode 1 Clone dnode 2 L2 L1 A B Data blocks C L2' update blkptr’s L1' propagate changes C' 1 2 3 3 modify data block 3 of dnode 2 Clone is similar to a snapshot. Clone references same blocks as original file. Only modified blocks are written out.

both nodes point to the same data
Diagrams dnode 1 Clone dnode 2 B A A data blocks A both nodes point to the same data private data shared data modify data A cleaner way to represent shared and private data.

Reference node: reference original data
reference dnode (hidden) dnode 1 dnode 2 (clone) Clone A data In order to make the original file writable, we use an approach similar to a dataset clone. A dataset clone is performed on top of a read-only snapshot. Likewise, when a file is cloned, a hidden read-only dnode is created; it references all the original blocks.

Reference node: avoid refcount
dnode 1 dnode 2 (clone) reference dnode A garbage data B modify data C modify data A file 1 A file 2 A original data The extra dnode is used to keep references to the original data even if it is not used anymore. As long as clones exist, original data blocks are not freed. This avoids having to implement any kind of complicated refcount algorithm. Original data is freed only when the refnode is destroyed.

System Attributes file 1 file 2 (clone) dnode 34 dnode 56 ____________
pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 34, 56 For cloned files: a flag is set indicating it is a clone. a new attribute is created: refnode. It points to the reference dnode. For the reference dnode: a flag is set indicating it is a the reference dnode for other clones. new attribute: clones. It is an array of dnode numbers representing all the clones. new attribute: birth. Txg of when the reference was created.

System Attributes New Attributes file 1 dnode 34 ____________
pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 57 ZAP dnode 57 ________ 34 56 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 34, 56 New Attributes Refnode: object number of reference dnode. Birth: txg when the reference node was created. Clones: array of all dnodes that are clones of the refnode. Alternatively, clones could point to a ZAP object storing the clones list. pflags: new flags for pflags attribute: clone and clone_ref .

Freeing Blocks reference dnode dnode 55 birth: txg 200
When overwriting or destroying data, only free blocks that are born after the reference node. file 1 dnode 34 C birth: txg = 205 A birth: txg = 177 Refnode birth: txg = 200 Keep when replaced Free when replaced B birth: txg = 202 A file 1 C modify data. txg 205 B old data. txg 202 B modify data. txg 202 A Old data. txg = 177 txg <= 200 KEEP txg > 200 FREE Original data. txg = 177

Freeing Blocks Blocks that were born after the reference node are treated the same way as regular blocks. Blocks born before the reference node are only freed when the reference node is destroyed. Any writes sent to a file right after it has been cloned cannot be assigned the same txg as the reference node. The reference node is destroyed when: Option 1: All clones are destroyed. Option 2: All but one clone is destroyed (harder).

Multiple Clones clone update clone list file 2 (clone) dnode 56
____________ pflags: clone refnode: 55 file 1 dnode 34 ____________ pflags: clone refnode: 55 file 3 (clone) dnode 63 ____________ pflags: clone refnode: 55 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 56, 63 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 56 file 1 was not modified after the cloning operation, thus file 3 can link to the same reference node. B A modify file 2 A

Nested Clones clone If a clone is modified and then cloned again, a new reference node will be created. file 2 (clone) dnode 56 ____________ pflags: clone refnode: 63 file 3 (clone) dnode 64 ____________ pflags: clone refnode: 63 reference dnode 2 dnode 63 birth: txg 240 ____________ pflags: clone_ref refnode: 55 clones: 56, 64 modify data update clone list file 1 dnode 34 ____________ pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone clone_obj: 55 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 63 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 56 Data 2 Data 1

Integration with other ZFS features
Problem Traversing blocks in-order without going twice over the same block now becomes problematic. ZFS Send/Receive ZFS send/receive loses all information about block pointers and txgs. Traversal must be done in multiple passes: first send the reference nodes, then send the clones Several passes required for nested clones. One extra pass for each clone depth. dataset dnode 1 regular dnode 4 clone of 9 dnode 7 dnode 9 refnode dnode 12 dnode 13 First Pass: dnodes 1, 7, 9, 12 Second Pass: dnodes 4, 13 Solution: implicit references

Implicit References File Clone: Method 2

Implicit Block Pointers
Shared data is only accessible/linked from the reference node. The clones have embedded block pointers indicating to look for data in the reference node Direct References Implicit References A refnode B A clone B clone B clone 1 2 3 1 2 3 L1 return block 2 of refnode overwrite data block 1 read data block 2 replace references to shared blocks by special hole

Nested Clones: performance issue
Whe data is not found the refnodes have to be traversed one by one. This can cause performance degradation when multiple refnodes are nested. 1 2 3 C Clone 1 2 3 B refnode 2 A refnode 1 1 2 3 read data block 1 return block 1

Nested Clones: improvements
One solution to improve performance is to have a reference to the dnode of the appropriate refnode in the embedded blkptr. 1 2 3 C 7 Clone dnode 13 1 2 3 B 7 refnode 2 dnode 9 A refnode 1 dnode 7 1 2 3 read data block 1 return block 1

Analysis Integration and comparison

Compare referencing methods
Direct References Implicit References Access clones at same speed as regular files. ZFS send/receive becomes non-trivial and potentially slower. Requires more changes in DMU layer. Accessing clones is potentially slower for each nesting level. Higher arc footprint. Requires more changes in ZPL layer.

Integration with other ZFS features
ZIL A new record type must be implemented. Snapshots File clone should not interfere with current snapshot logic. Special care has to be taken so that unreferenced clone-related data is destroyed when a snapshot is destroyed. Scrubbing Do not scrub cloned data multiple times. Easy with implicit references. Send / Receive Do not send cloned data multiple times. Easy with implicit references. ZFS features Clone feature should be downgraded from active to enabled when all clones are deleted.

Space Quotas Space quotas can be tricky. It is a similar situation as with Linux reflinks. If we treat clone as a copy on the user level: Full ZPL size of each clone (shared + private data) is accounted to owner’s userquota. Full ZPL size of each clone is accounted to dataset’s refquota and refreservation. Shared data of refnode plus private data of each clone is accounted to dataset’s quota and reservation.

Accessing Reference Nodes
refnode 7 refnode 13 $refnodes contents clones A ZAP object in the dataset links to all refnodes. ZPL layer can access this ZAP object as a special read-only folder. Inside this folder, each refnode is displayed as a directory. Each refnode directory contains one file to view the refnode’s contents and another file that contains the relative paths to all its clones.

Integration with OS NFS NFS Fast File Clone support Fast File Clone
libc fclone userland system call fclone A new system call is required kernel zfs vnode ops vop_fclone File clone support within the same dataset zfs znode ops zfs_clone_node File clone workhorse

Other thoughts Ditto blocks for highly cloned files.
Ability to unlink clone: obtain a hard copy. When cloning a clone, avoid nesting existing refnode if changes between the clone and its refnode are minor. Alternative clone designs Use deduplication Link to dataset clones Work to do.

Compare clone alternatives
Direct/Implicit References Linked dataset clones Deduplication Instant cloning. Slow cloning: need to traverse data. Scalable. Not scalable as of now. Affects pool import times. Space wasted by refnode if shared data no longer referenced. Space wasted by snapshot if shared data no longer referenced. No space wasted. Space quotas must be implemented. Space quotas must be modified for dataset clones that represent files. Space quotas already implemented.

THANK YOU

Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018.

Similar presentations

Presentation on theme: "Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018.

Similar presentations

Presentation on theme: "Fast File Clone in ZFS Design Proposal Pavel Zakharov 11/14/2018."— Presentation transcript:

Similar presentations

About project

Feedback