Download presentation
Presentation is loading. Please wait.
Published byMyra Armstrong Modified over 8 years ago
1
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost
2
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Overview Btrfs background Ceph basics, storage requirements Transactions Snapshots Ceph journaling Other btrfs features
3
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs Featureful Extent based Space efficient packing for small files Integrity checksumming Writable snapshots Efficient incremental backups Multi-device support (striping, mirroring, RAID) Online resize, defrag, scrub/repair Transparent compression
4
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs trees Generic copy-on-write tree implementation Never overwrite data in place; ondisk image always consistent Reference counting (Almost) everything is a key/value pair Large data blobs outside of tree Data and metadata segregated on disk Transparent defrag before writeback
5
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs trees
6
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs copy-on-write trees
7
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs transaction commit
8
2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs tree layout Keys Types Inode, xattrs, dir item, data extent, csum, backrefs Nice properties Xattrs close to inode Small amounts of file data stored inline in tree Checksums near data references
9
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Ceph basics Scalable file system, block device, object store Objects, not blocks 100s to 1000s of storage bricks (OSDs) Self-healing, self-managing, replicated, no SPOFs, etc. OSDs are smart Peer to peer, loosely coordinated, strong consistency Manage replication, recovery, data migration Carefully manage consistency of locally stored objects
10
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Consistency and safety All objects are replicated Writes/updates apply to all replicas before ack Nodes may fail at any time They frequently recover We keep our local store in a consistent state Know what we have Know how that compares to what others have So we can re-sync quickly Versioning and logs!
11
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Atomicity Objects have version metadata “Placement groups” have update logs And writes are ordered Want atomic data+metadata commits Object content Version metadata Log entry fsync(2) on extN isn't good enough
12
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Transaction hooks Btrfs groups many operations into a single commit/transaction We added ioctl(2) to start/end transactions START pins the current transaction; END releases User process can bracket sets of operations and know they will commit atomically Protects against node failures #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
13
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Transaction hooks What about software errors? By default, END is implied when the fd is closed Software crash means partial transaction can reach disk Mount option would disable implied END and intentionally wedge the machine No rollback
14
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Compound operations Various interfaces proposed for compound kernel operations Syslets – Ingo Molnar, ~2007 Btrfs usertrans ioctl – Me, ~2009 Describe multiple operations via single syscall Varying degrees of generality, flexibility No worry about process completing transaction Need to ensure the operation will succeed ENOSPC, EIO, EFAULT, bad transaction, etc.
15
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Snapshots Granularity of durability is a btrfs transaction i.e. a snapshot Explicitly manage btrfs commits from userspace Do whatever operations we'd like Quiesce writes Take a snapshot of the btrfs subvolume Repeat On failure/restart, roll back to last snapshot
16
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Commit process Normal commit sequence Block start of new transactions Flush/perform delayed allocations, writeback Make btree state consistent Allow new transactions Flush new trees Update superblock(s) to point to new tree roots Want to minimize idle window Ceph OSD needs ordering, not safety
17
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Async snapshot interface New async snapshot ioctls #define BTRFS_SUBVOL_CREATE_ASYNC(1ULL << 0) struct btrfs_ioctl_vol_args_v2 { __s64 fd; __u64 transid; __u64 flags; __u64 unused[4]; char name[BTRFS_SUBVOL_NAME_MAX + 1]; }; #define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \ struct btrfs_ioctl_vol_args_v2) #define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
18
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Ceph journal Full btrfs commits have high latency Tree flush, delayed allocation and writeback, superblock updates Even fsync(2), tree log has much of that Poor IO pattern Ceph OSDs have independent journal Separate device or file Keep write latency low Exploit SSDs, NVRAM, etc. Optional
19
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Journal mode Write-ahead Any fs Operations must be idempotent Parallel Journal relative to a consistency point Btrfs only Mask commit latency Atomicity (w/ non-btrfs backing fs)
20
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Journal performance HDD only Low load: journal improves write latency Full load: halves throughput We should avoid journal under heavy load HDD + NVRAM/SSD Low load: low latency Full load: full throughput, low latency SSD only Journal offers minimal benefit Eventually Btrfs can probably do better
21
2011 Storage Developer Conference. © new dream network. All Rights Reserved. CLONE RANGE Clone (range of) bytes from one file to another No data is copied; only extent refs and csums Exposed by Ceph object storage API as a building block for snapshots Ceph snapshots do not rely on btrfs snapshots Now also used by cp --reflink
22
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Other useful bits Btrfs checksums Planned ioctl to extract csum metadata Improve Ceph intra-node scrub Can also read(2) data for deep scrub Transparent compression zlib (decent compression; slowish) lzo (mediocre compression; very fast)
23
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Multi-device support Devices added to pool of available storage Multiple pool modes Raid0, raid1, raid10, single spindle dup Raid5/6 coming Space allocated in large chunks btrfs will mask many media errors Read from alternate replicas No intervening block interface to make life difficult
24
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Stability ENOSPC Ceph takes slightly different commit path Every commit is a snapshot commit Async Ceph replication masks some of it When failures are independent Improving test coverage fsck.btrfs coming Real Soon Now
25
2011 Storage Developer Conference. © new dream network. All Rights Reserved. Questions http://btrfs.wiki.kernel.org/ http://ceph.newdream.net/
26
2011 Storage Developer Conference. © new dream network. All Rights Reserved. 26
27
2011 Storage Developer Conference. © new dream network. All Rights Reserved. 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.