Section 7 Erasure Coding Overview
Objectives What is Erasure Coding? Erasure Coding in Ceph Configure Erasure Coding
What is Erasure Coding? Objective Notes:
What is Erasure Code? In information theory : an erasure code is a forward error correction (FEC) code for the binary erasure channel, which transforms a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols. The fraction r = k/n is called the code rate, the fraction k’/k, where k’ denotes the number of symbols required for recovery, is called reception efficiency. Thanks Wikipedia – that really helps!
Why do we have Erasure Coding? The default replication strategy in SES is simple replication Defined by the size parameter of a pool Each object is replicated a number of times to provide resilience This approach is simple and effective but comes at a price For replication size of n the raw storage requirement is n times the amount of data being stored Data replication overhead is high, especially as replication size increases Erasure coding provides an alternative Trading off some resilience and performance to lower the raw disk requirements for storage
Replication vs Erasure Coding Replication (default) Use for active data Simple and fast Uses more disk space Erasure coding Use for archive data Calculates recovery data Definable redundancy level Needs a cache layer for use with rbd Data is accessed via a replicated pool and then migrated to the Erasure Coded pool Can use both at same time But in different pools
Erasure Coding in Ceph Objective Notes:
A quick video overview...
Normal Ceph Read/Write
Erasure Coded Write
Erasure Coded Write Encode takes place on Primary OSD host node Example is k=3,m=2 so 5 OSDs required
Erasure Coded Write With k=3, data is split into 3 shards, each written to a different OSD (via CRUSH calculation)
Erasure Coded Write With m=2, 2 recovery shards are calculated and written to different OSDs
Erasure Coding (just the basics) Calculates parity blocks to recover data Configurable K+M parameters (example at 10+6) All data now stored on 16 disks Requires 10 disks to recover Data safe with 6 failures Only requires 60% extra raw capacity Performance All disks need to acknowledge writes Slower recovery (think RAID 6) More chance of failure during rebuild Do not use K+M of 10+1 – you need to have sufficient failure cover
Erasure Coding overview Makes storage much cheaper With no reduction in reliability (if properly configured) Great for archival storage Power consumption advantages Trades disk running for CPU load for writing and recovery Makes reads, writes and recovery slower You will probably want to add a cache tier To maximize the performance Can access via RADOS RADOS is Ceph native API Requires a cache tier for RBD access But you probably want one anyway
Erasure Coding Plugins The EC algorithm and implementation are pluggable jerasure/gf-complete (free, open, and very fast) (www.jerasure.org) ISA-L (Intel library; optimized for modern Intel processors) LRC (local recovery code – layers over existing plugins) SHEC (trades extra storage for recovery efficiency – new from Fujitsu) Parameterized Pick “k” and “m”, stripe size OSD handles data path, placement, rollback, etc. Erasure plugin handles Encode and decode maths Given these available shards, which ones should I fetch to satisfy a read? Given these available shards and these missing shards, which ones should I fetch to recover?
Erasure Coding Parameters Two key parameters in Erasure Coding configuration K : Erasure coding works by spitting data into shards which are then written to separate OSDs. K determines the number of shards into which data is split. The default is k=2 M : Erasure coding calculates additional data which is used to reconstruct missing shards (for example caused by OSD failure). M determines how many additional shards are calculated, and this is also the number of OSD failures which the an erasure coded pool can withstand. The default is m=1 The default values stripe data on two osds, and calculated data on a third. The loss of any one OSD is tolerable, similar to a replication size of 2 For 1GB of storage, replication size of 2 needs 2GB, erasure coded data with k=2/m=1 only requires 1.5GB
Configure Erasure Coding
Erasure Code Profiles Profiles define the parameters for erasure coding Profile contains k and m values CRUSH ruleset Plugin Default jerasure Technique Default reed_sol_van When a pool is created as an EC pool the profile determines the setup for Erasure Coding This cannot be changed later
Command: ceph osd erasure-code-profile Syntax: ceph osd erasure-code-profile OPTIONS Option Description get -view details of an existing EC profile set -set a profile, requires k and m values, with optional values such as ruleset, plugin ls -list profiles Notes:
Default EC Profile The default profile provides a basic erasure coding configuration which will function in almost any cluster Uses minimum practical values k=2, m=1 Is written over 3 OSDs Two data shards One recovery shard
Setting a custom EC Profile Use the ceph osd erasure-code-profile set command with the following options: Profile name K : number of stripes required M : number of failed units ruleset-failure-domain = crush bucket level for failure OSD, Host, rack etc as defined in CRUSH map Example with k=8,m=2 and failure at host level ceph osd erasure-code-profile set example-profile k=8 m=2 ruleset-failure-domain=host
Command: ceph osd pool create Syntax: ceph osd pool create <name> <pg> erasure <profile> Same command used to create standard replication pools but with the addition of the erasure option and a profile name (or the default profile is used) Notes:
Section 7 Exercises Objective Notes: