an intro to ceph and big data

an intro to ceph and big data
patrick mcgarry – inktank Big Data Workshop – 27 JUN 2013

what is ceph? distributed storage system
reliable system built with unreliable components fault tolerant, no SPoF commodity hardware expensive arrays, controllers, specialized networks not required large scale (10s to 10,000s of nodes) heterogenous hardware (no fork-lift upgrades) incremental expansion (or contraction) dynamic cluster

what is ceph? unified storage platform
scalable object + compute storage platform RESTful object storage (e.g., S3, Swift) block storage distributed file system open source LGPL server-side client support in mainline Linux kernel

RADOS – the Ceph object store
APP APP HOST/VM CLIENT RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story. RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes 4

OSD OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4 zfs? DISK DISK DISK DISK DISK Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves). M

hash(object name) % num pg
10 01 11 hash(object name) % num pg CRUSH(pg, cluster state, policy) With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results. 6

10 10 01 01 10 10 01 11 01 10 Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important. 7

CLIENT ?? When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs. 8

What happens, though, when a node goes down
What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.

The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.

CLIENT ?? Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.

So what about big data? CephFS s/HDFS/CephFS/g Object Storage
Key-value store Most people will default to discussions about CephFS when confronted with either Big Data or HPC applications. This can mean using CephFS by itself, or perhaps as a drop-in replacement for HDFS. [NOT READY ARGUMENT] There are a couple of other options, however. You can use librados to talk directly to the object store. One user I know actually plugged Hadoop in at this level, instead of using CephFS. Ceph also has a pretty decent key-value store proof-of-concept done by an intern last year. It's based on a b-tree structure but uses a fixed height of two levels instead of a true tree structure. This draws from both a normal B-Tree and Google BigTable. Would love to see someone do more with it.

L librados direct access to RADOS from applications
C, C++, Python, PHP, Java, Erlang direct access to storage nodes no HTTP overhead L I mentioned librados, this is the low-level library that allows you to directly access a RADOS cluster from your application. This has native language bindings for C, C++, Python, etc. This is obviously the fastest way to get at your data and comes with no inherent overheard or translation layer. 13

rich librados API efficient key/value storage inside an object
atomic single-object transactions update data, attr, keys together atomic compare-and-swap object-granularity snapshot infrastructure inter-client communication via object embed code in ceph-osd daemon via plugin API arbitrary atomic object mutations, processing For most object systems an object is just a bunch of bytes, maybe some extended attributes. Ceph you can store a lot more than that. You can store key/value pairs inside an object, think berkelyDB or sql where each object is a logical container. It supports atomic transaction so you can do things like atomic compare-and-swap. Update the bytes and the keys/values in an atomic fashion and it will be consistently distributed and replicated across a cluster in a safe way. There is snapshotting that will give you per-directory snapshots, and inter-client communication for locking and whatnot. The really exciting part about this is the ability to implement your own functionality on the OSD.....

Data and compute RADOS Embedded Object Classes
Moves compute directly adjacent to data C++ by default Lua bindings available These embedded object classes allow you to send an object method call to the cluster and it will actually perform that computation without having to pull the data over the network. The downside to using these object classes is the injection of new functionality into the system. A compiled C++ plugin has to be delivered and dynamically loaded into each OSD process. This becomes more complicated if a cluster is composed of multiple architecture targets, and makes it difficult to update functionality on the fly. One approach to addressing these problems is to embed a language run-time within the OSD. Noah Watkins, one of our engineers tackled this with some Lua bindings which are available.

die, POSIX, die successful exascale architectures will replace or transcend POSIX hierarchical model does not distribute line between compute and storage will blur some processes is data-local, some is not fault tolerance will be first-class property of architecture for both computation and storage One of the more contentious assertions that Sage likes to make is that as we move towards exascale computing and beyond we'll need to transcend or replace POSIX. The heirarchical model just doesn't scale well beyond a certain level. Future models are going to have to start blurring the line between compute and storage and recognizing when data is local to perform operations vs when you need to gather data from multiple sources and gather data for an operation. And finally fault tolerance needs to become a first-class property of these architectures. As we push the scale of our existing architectures, building things like burst buffers to deal with huge checkpoints across millions of cores it just doesn't make a whole lot of sense.

POSIX – I'm not dead yet! CephFS builds POSIX namespace on top of RADOS metadata managed by ceph-mds daemons stored in objects strong consistency, stateful client protocol heavy prefetching, embedded inodes architected for HPC workloads distribute namespace across cluster of MDSs mitigate bursty workloads adapt distribution as workloads shift over time Having said all that, there are too many things (both people and code) built using POSIX mentality to ditch it any time soon. CephFS is designed to provide that POSIX layer on top of RADOS. [read slide] Now, as we've said there is certainly some work to be done on CephFS, but I want to share a bit about how it works since it (and similar thinking) will play a big part of Ceph's HPC and Big Data applications going forward.

CLIENT metadata 01 10 data M CephFS adds a metadata server (or MDS) to the list of node types in your Ceph cluster. Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs. 18

M There are multiple MDSs! 19

?? one tree three metadata servers
So how do you have one tree and multiple servers? three metadata servers ?? 20

If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
21

When the second one comes along, it will intelligently partition the work by taking a subtree.
22

When the third MDS arrives, it will attempt to split the tree again.
23

Same with the fourth. 24

DYNAMIC SUBTREE PARTITIONING
A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”. This is done as a periodic load balance exchange. The transfer just ships the cache contents between MDS and lets the clients continue transparently. DYNAMIC SUBTREE PARTITIONING 25

recursive accounting ceph-mds tracks recursive directory stats
file sizes file and directory counts modification time efficient $ ls -alSh | head total 0 drwxr-xr-x 1 root root T :51 . drwxr-xr-x 1 root root T :06 .. drwxr-xr-x 1 pomceph pg T :25 pomceph drwxr-xr-x 1 mcg_test1 pg G :57 mcg_test1 drwx--x--- 1 luko adm G :17 luko drwx--x--- 1 eest adm G :29 eest drwxr-xr-x 1 mcg_test2 pg G :34 mcg_test2 drwx--x--- 1 fuzyceph adm G :46 fuzyceph drwxr-xr-x 1 dallasceph pg M :06 dallasceph CephFS has some neat features that you don't find in most file systems. Because they built the filesystem namespace from the ground up they were able to build these features into the infrastructure. One of these features is recursive accounting. The MDSs keep track of directory stats for every dir in the file system. For instance, when you do an 'ls -al' the file size is actually the total number of bytes stored in that directory recursively in the system. The same thing you can get from a 'du' but in realtime.

snapshots snapshot arbitrary subdirectories simple interface
hidden '.snap' directory no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_ # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot [Provides snapshots] The motivation here is once you start talking about petabytes and exabytes it doesn't make much sense to try to snapshot the entire tree. You need to be able to snapshot different directories and different data sets. You can add and remove snapshots for any directory with standard bash-type commands.

how can you help? try ceph and tell us what you think
ask if you need help ask your organization to start dedicating resources to the project find a bug ( and fix it participate in our ceph developer summit Also, next Ceph developer summit coming soon to plan for the Emperor release. Would love to see some blueprints submitted for CephFS work.

questions?

thanks patrick mcgarry patrick@inktank.com http://github.com/ceph
@scuttlemonkey

an intro to ceph and big data

Similar presentations

Presentation on theme: "an intro to ceph and big data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

an intro to ceph and big data

Similar presentations

Presentation on theme: "an intro to ceph and big data"— Presentation transcript:

Similar presentations

About project

Feedback