2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost.

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Overview  Btrfs background  Ceph basics, storage requirements  Transactions  Snapshots  Ceph journaling  Other btrfs features

2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs  Featureful  Extent based  Space efficient packing for small files  Integrity checksumming  Writable snapshots  Efficient incremental backups  Multi-device support (striping, mirroring, RAID)  Online resize, defrag, scrub/repair  Transparent compression

2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs trees  Generic copy-on-write tree implementation  Never overwrite data in place; ondisk image always consistent  Reference counting  (Almost) everything is a key/value pair  Large data blobs outside of tree  Data and metadata segregated on disk  Transparent defrag before writeback

2011 Storage Developer Conference. © new dream network. All Rights Reserved. btrfs tree layout  Keys   Types  Inode, xattrs, dir item, data extent, csum, backrefs  Nice properties  Xattrs close to inode  Small amounts of file data stored inline in tree  Checksums near data references

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Ceph basics  Scalable file system, block device, object store  Objects, not blocks  100s to 1000s of storage bricks (OSDs)  Self-healing, self-managing, replicated, no SPOFs, etc.  OSDs are smart  Peer to peer, loosely coordinated, strong consistency  Manage replication, recovery, data migration  Carefully manage consistency of locally stored objects

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Consistency and safety  All objects are replicated  Writes/updates apply to all replicas before ack  Nodes may fail at any time  They frequently recover  We keep our local store in a consistent state  Know what we have  Know how that compares to what others have  So we can re-sync quickly  Versioning and logs!

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Atomicity  Objects have version metadata  “Placement groups” have update logs  And writes are ordered  Want atomic data+metadata commits  Object content  Version metadata  Log entry  fsync(2) on extN isn't good enough

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Transaction hooks  Btrfs groups many operations into a single commit/transaction  We added ioctl(2) to start/end transactions  START pins the current transaction; END releases  User process can bracket sets of operations and know they will commit atomically  Protects against node failures #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Transaction hooks  What about software errors?  By default, END is implied when the fd is closed  Software crash means partial transaction can reach disk  Mount option would disable implied END and intentionally wedge the machine  No rollback

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Compound operations  Various interfaces proposed for compound kernel operations  Syslets – Ingo Molnar, ~2007  Btrfs usertrans ioctl – Me, ~2009  Describe multiple operations via single syscall  Varying degrees of generality, flexibility  No worry about process completing transaction  Need to ensure the operation will succeed  ENOSPC, EIO, EFAULT, bad transaction, etc.

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Snapshots  Granularity of durability is a btrfs transaction  i.e. a snapshot  Explicitly manage btrfs commits from userspace  Do whatever operations we'd like  Quiesce writes  Take a snapshot of the btrfs subvolume  Repeat  On failure/restart, roll back to last snapshot

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Commit process  Normal commit sequence  Block start of new transactions  Flush/perform delayed allocations, writeback  Make btree state consistent  Allow new transactions  Flush new trees  Update superblock(s) to point to new tree roots  Want to minimize idle window  Ceph OSD needs ordering, not safety

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Async snapshot interface  New async snapshot ioctls #define BTRFS_SUBVOL_CREATE_ASYNC(1ULL << 0) struct btrfs_ioctl_vol_args_v2 { __s64 fd; __u64 transid; __u64 flags; __u64 unused[4]; char name[BTRFS_SUBVOL_NAME_MAX + 1]; }; #define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \ struct btrfs_ioctl_vol_args_v2) #define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Ceph journal  Full btrfs commits have high latency  Tree flush, delayed allocation and writeback, superblock updates  Even fsync(2), tree log has much of that  Poor IO pattern  Ceph OSDs have independent journal  Separate device or file  Keep write latency low  Exploit SSDs, NVRAM, etc.  Optional

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Journal mode  Write-ahead  Any fs  Operations must be idempotent  Parallel  Journal relative to a consistency point  Btrfs only  Mask commit latency  Atomicity (w/ non-btrfs backing fs)

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Journal performance  HDD only  Low load: journal improves write latency  Full load: halves throughput  We should avoid journal under heavy load  HDD + NVRAM/SSD  Low load: low latency  Full load: full throughput, low latency  SSD only  Journal offers minimal benefit  Eventually Btrfs can probably do better

2011 Storage Developer Conference. © new dream network. All Rights Reserved. CLONE RANGE  Clone (range of) bytes from one file to another  No data is copied; only extent refs and csums  Exposed by Ceph object storage API as a building block for snapshots  Ceph snapshots do not rely on btrfs snapshots  Now also used by cp --reflink

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Other useful bits  Btrfs checksums  Planned ioctl to extract csum metadata  Improve Ceph intra-node scrub  Can also read(2) data for deep scrub  Transparent compression  zlib (decent compression; slowish)  lzo (mediocre compression; very fast)

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Multi-device support  Devices added to pool of available storage  Multiple pool modes  Raid0, raid1, raid10, single spindle dup  Raid5/6 coming  Space allocated in large chunks  btrfs will mask many media errors  Read from alternate replicas  No intervening block interface to make life difficult

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Stability  ENOSPC  Ceph takes slightly different commit path  Every commit is a snapshot commit  Async  Ceph replication masks some of it  When failures are independent  Improving test coverage  fsck.btrfs coming Real Soon Now

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost.

Similar presentations

Presentation on theme: "2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost.

Similar presentations

Presentation on theme: "2011 Storage Developer Conference. © new dream network. All Rights Reserved. Leveraging btrfs transactions Sage Weil new dream network / DreamHost."— Presentation transcript:

Similar presentations

About project

Feedback