A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial
EXTRA MATERIAL
CEPH – A HDFS replacement
What is Ceph? Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware
Ceph Architecture – Host Level At the host level… We have Object Storage Devices (OSDs) and Monitors Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) The device, host, rack, row, and room are stored by the Monitors and used to compute a failure domain OSDs store the Ceph data objects A host can run multiple OSDs, but it needs to be appropriately provisioned
Ceph Architecture – Block Level At the block device level... Object Storage Device (OSD) can be an entire drive, a partition, or a folder OSDs must be formatted in ext4, XFS, or btrfs (experimental).
Ceph Architecture – Data Organization Level At the data organization level… Data are partitioned into pools Pools contain a number of Placement Groups (PGs) Ceph data objects map to PGs (via a modulo of hash of name) PGs then map to multiple OSDs.
Ceph Placement Groups Ceph shards a pool into placement groups distributed evenly and pseudo-randomly across the cluster The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs—creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail RedHat Ceph Architecture v1.2.3
Ceph Architecture – Overall View vities/tf- storage/ws16/slides/ low_cost_storage_ceph- openstack_swift.pdf
Ceph Architecture – RADOS An Application interacts with a RADOS cluster RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or Rados Block Device) and a cache-coherent distributed file system (CephFS).
Ceph Architecture – RADOS Components
Ceph Architecture – Where Do Objects Live?
Ceph Architecture – Where Do Objects Live? Contact a Metadata server?
Ceph Architecture – Where Do Objects Live? Or calculate the placement via static mapping?
Ceph Architecture – CRUSH Maps
Ceph Architecture – CRUSH Maps Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing) CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters) The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store
Ceph Architecture – CRUSH – 1/2
Ceph Architecture – CRUSH – 2/2
Ceph Architecture – librados nz.dlr.de/pages/stora ge2014/present/2.% 20Konferenztag/13_ 06_2014_06_Inktank.pdf
Ceph Architecture – RADOS Gateway nz.dlr.de/pages/stora ge2014/present/2.% 20Konferenztag/13_ 06_2014_06_Inktank.pdf
Ceph Architecture – RADOS Block Device (RBD) – 1/3
Ceph Architecture – RADOS Block Device (RBD) – 2/3 Virtual Machine storage using RDB Live Migration using RBD
Ceph Architecture – RADOS Block Device (RBD) – 3/3 Direct host access from Linux
Ceph Architecture – CephFS – POSIX F/S
Ceph – Read/Write Flows m/en- us/blogs/2015/04/06/ce ph-erasure-coding- introduction
Ceph Replicated I/O RedHat Ceph Architecture v1.2.3
Ceph – Erasure Coding – 1/5 Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes. Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like: RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content.
Ceph – Erasure Coding – 2/5 Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations In replicated pools, Ceph makes a deep copy of each object in the placement group on the secondary OSD(s) in the set For erasure coding, the process is a bit different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set. The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs.
Ceph – Erasure Coding – 3/5 5 OSDs (K+M=5); sustain loss of 2 (M=2) Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K Coding blocks are YXY and QGC RedHat Ceph Architecture v1.2.3
Ceph – Erasure Coding – 4/5 On reading object NYAN from an erasure coded pool, decoding function retrieves chunks 1, 2, 3 and 4 If any two chunks are missing (ie an erasure is present), decoding function can reconstruct other chunks RedHat Ceph Architecture v1.2.3
Ceph – Erasure Coding – 4/5 5 OSDs (K+M=5); sustain loss of 2 (M=2) Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K Coding blocks are YXY and QGC RedHat Ceph Architecture v1.2.3