Download presentation
Published byAnastasia Temple Modified over 9 years ago
1
Ceph: A Scalable, High-Performance Distributed File System
Derek Weitzel
2
In the Before… Lets go back through some of the mentionable distributed file systems used in HPC
3
In the Before… There were distributed filesystems like:
Lustre – RAID over storage boxes Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) When functional, reading/writing EXTREMELY fast Used in heavily in HPC
4
In the Before… There were distributed filesystems like:
NFS – Network File System Does this really count as distributed? Single large server Full POSIX support, in kernel since…forever Slow with even a moderate number of clients Dead simple
5
In the Current… There are distributed filesystems like:
Hadoop – Apache Project inspired by Google Massive throughput Throughput scales with attached HDs Have seen VERY LARGE production clusters Facebook, Yahoo… Nebraska Doesn’t even pretend to be POSIX
6
In the Current… There are distributed filesystems like:
GPFS(IBM) / Panasas – Propriety file systems Requires closed source kernel driver Not flexible with newest kernels / OS’s Good: Good support and large communities Can be treated as black box for administrators HUGE Installments (Panasas at LANL is HUGE!!!!)
7
Motivation Ceph is a emerging technology in the production clustered environment Designed for: Performance – Striped data over data servers. Reliability – No single point of failure Scalability – Adaptable metadata cluster
8
Timeline 2006 – Ceph Paper written
2007 – Sage Weil earned PhD from Ceph (largely) 2007 – 2010 Development continued, primarily for DreamHost March 2010 – Linus merged Ceph client into mainline kernel No more patches needed for clients
9
Adding Ceph to Mainline Kernel
Huge development! Significantly lowered cost to deploy Ceph For production environments, it was a little too late – was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).
10
Lets talk paper Then I’ll show a quick demo
11
Ceph Overview Decoupled data and metadata
IO directly with object servers Dynamic distributed metadata management Multiple metadata servers handling different directories (subtrees) Reliable autonomic distributed storage OSD’s manage themselves by replicating and monitoring
12
Decoupled Data and Metadata
Increases performance by limiting interaction between clients and servers Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… In contrast to other filesystems, CEPH uses a function to calculate the block locations
13
Dynamic Distributed Metadata Management
Metadata is split among cluster of servers Distribution of metadata changes with the number of requests to even load among metadata servers Metadata servers also can quickly recover from failures by taking over neighbors data Improves performance by leveling metadata load
14
Reliable Autonomic Distributed Storage
Data storage servers act on events by themselves Initiates replication and Improves performance by offloading decision making to the many data servers Improves reliability by removing central control of the cluster (single point of failure)
15
Ceph Components Some quick definitions before getting into the paper
MDS – Meta Data Server ODS – Object Data Server MON – Monitor (Now fully implemented)
16
Ceph Components Ordered: Clients, Metadata, Object Storage 1 2 3
17
Ceph Components Ordered: Clients, Metadata, Object Storage 1 2 3
18
Client Overview Can be a Fuse mount
File system in user space Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) Can link directly to the Ceph Library Built into newest OS’s.
19
Client Overview – File IO
1. Asks the MDS for the inode information
20
Client Overview – File IO
2. Responds with the inode information
21
Client Overview – File IO
3. Client Calculates data location with CRUSH
22
Client Overview – File IO
4. Client reads directly off storage nodes
23
Client Overview – File IO
Client asks MDS for a small amount of information Performance: Small bandwidth between client and MDS Performance Small cache (memory) due to small data Client calculates file location using function Reliability: Saves the MDS from keeping block locations Function described in data storage section
24
Ceph Components Ordered: Clients, Metadata, Object Storage 1 2 3
25
Client Overview – Namespace
Optimized for the common case, ‘ls –l’ Directory listing immediately followed by a stat of each file Reading directory gives all inodes in the directory Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzel swanson 63 Aug apache drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzel swanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzel swanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzel swanson 75 Jan 18 12:25 buildsys-macros
26
Metadata Overview Metadata servers (MDS) server out the file system attributes and directory structure Metadata is stored in the distributed filesystem beside the data Compare this to Hadoop, where metadata is stored only on the head nodes Updates are staged in a journal, flushed occasionally to the distributed file system
27
MDS Subtree Partitioning
In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients In order to be scalable, Ceph needs to distributed metadata requests among many servers MDS will monitor frequency of queries using special counters MDS will compare the counters with each other and split the directory tree to evenly split the load
28
MDS Subtree Partitioning
Multiple MDS split the metadata Clients will receive metadata partition data from the MDS during a request
29
MDS Subtree Partitioning
Busy directories (multiple creates or opens) will be hashed across multiple MDS’s
30
MDS Subtree Partitioning
Clients will read from random replica Update to the primary MDS for the subtree
31
Ceph Components Ordered: Clients, Metadata, Object Storage 1 2 3
32
Data Placement Need a way to evenly distribute data among storage devices (OSD) Increased performance from even data distribution Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution Problem: Don’t want to keep data locations in the metadata servers Requires lots of memory if lots of data blocks
33
CRUSH CRUSH is a pseudo-random function to find the location of data in a distributed filesystem Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored Input data is: inode number – From MDS OSD Cluster Map (CRUSH map) – From OSD/Monitors
34
CRUSH CRUSH maps a file to a list of servers that have the data
35
CRUSH File to Object: Takes the inode (from MDS)
36
CRUSH File to Placement Group (PG): Object ID and number of PG’s
37
Placement Group Sets of OSDs that manage a subset of the objects
OSD’s will have many Placement Groups Placement Groups will have R OSD’s, where R is number of replicas An OSD will either be a Primary or Replica Primary is in charge of accepting modification requests for the Placement Group Clients will write to Primary, read from random member of Placement Group
38
CRUSH PG to OSD: PG ID and Cluster Map (from OSD)
39
CRUSH Now we know where to write the data / read the data
Now how do we safely handle replication and node failures?
40
Replication Replicates to nodes also in the Placement Group
41
Replication Write the the placement group primary (from CRUSH function).
42
Replication Primary OSD replicates to other OSD’s in the Placement Group
43
Replication Commit update only after the longest update
44
Failure Detection Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!). Monitors keep a cluster map (used in CRUSH) Multiple monitors keep eye on cluster configuration, dole out cluster maps.
45
Recovery & Updates Recovery is entirely between OSDs
OSD have two off modes, Down and Out. Down is when the node could come back, Primary for a PG is handed off Out is when a node will not come back, data is re- replicated.
46
Recovery & Updates Each object has a version number
Upon bringing up, check version number of Placement Groups to see if current Check version number of objects to see if need update
47
Ceph Components Ordered: Clients, Metadata, Object Storage (Physical) 1 2 4
48
Object Storage The underlying filesystem can make or break a distributed one Filesystems have different characteristics Example: RieserFS good at small files XFS good at REALLY big files Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanle attrs.
49
Object Storage Ceph can run on normal file systems, but slow
XFS, ext3/4, … Created own Filesystem in order to handle special object requirements of Ceph EBOFS – Extent and B-Tree based Object File System.
50
Object Storage Important to note that development of EBOFS has ceased
Though Ceph can run on any normal filesystem (I have it running on ext4) Hugely recommend to run on BTRFS
51
Object Storage - BTRFS Fast Writes: Copy on write file system for Linux Great Performance: Supports small files with fast lookup using B-Tree algorithm Ceph Requirement: Supports unlimited chaining of attributes Integrated into mainline kernel Considered next generation file system Peer of ZFS from Sun Child of ext3/4
52
Performance and Scalability
Lets look at some graphs!
53
Performance & Scalability
Write latency with different replication factors Remember, has to write to all replicas before ACK write to client
54
Performance & Scalability
X-Axis is size of the write to Ceph Y-Axis is the Latency when writing X KB
55
Performance & Scalability
Notice, this is still small writes, < 1MB As you can see, the more replicas Ceph has to write, the slower the ACK to the client
56
Performance & Scalability
Obviously, async write is faster Latency for async is from flushing buffers to Ceph
57
Performance and Scalability
2 lines for each file system Writes are bunched at top, reads at bottom
58
Performance and Scalability
X-Axis is the KBs written to or read from Y-Axis is the throughput per OSD (node)
59
Performance and Scalability
The custom ebofs does much better on both writes and reads
60
Performance and Scalability
Writes for ebofs max the throughput of the underlying HD
61
Performance and Scalability
X-Axis is size of the cluster Y-Axis is the per OSD throughput
62
Performance and Scalability
Most configurations hover around HD speed
63
Performance and Scalability
32k PGs will distribute data more evenly over the cluster than the 4k PGs
64
Performance and Scalability
Evenly splitting the data will lead to a balanced load across the OSDs
65
Conclusions Very fast POSIX compliant file system
General enough for many applications No single point of failure – Important for large data centers Can handle HPC like applications (lots of metadata, small files)
66
Demonstration Started 3 Fedora 16 instances on HCC’s private cloud
67
Demonstration Some quick things if the demo doesn’t work
MDS log of a MDS handing off a directory to another for load balancing :15: f964654b700 mds.0.migrator nicely exporting to mds.1 [dir /hadoop-grid/ [2,head] auth{1=1} pv=2574 v=2572 cv=0/0 ap=1+2+3 state= |complete f(v2 m :14: =0+1) n(v86 rc :15: b =213+79) hs=1+8,ss=0+0 dirty=9 | child replicated dirty authpin 0x29a0fe0]
68
Demonstration Election after a Monitor was overloaded
Lost another election (peon ): :23: fcf log [INF] : mon.gamma calling new monitor election :23: fcf log [INF] : mon.gamma calling new monitor election :23: fcf log [INF] : won leader election with quorum 1,2 :15: f50b360e700 e26 e26: 3 osds: 2 up, 3 in
69
GUI Interface
70
Where to Find More Info New company sponsoring development
Instruction on setting up CEPH can be found on the Ceph wiki: Or my blog
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.