Jeff Darcy / Mark Wagner Principal Software Engineers, Red Hat 4 May, 2011 BUILDING A CLOUD FILESYSTEM
What's It For? ● “Filesystem as a Service” ● Managed by one provider, used by many tenants Familiarity ScalabilityFlexibility Privacy
What About Existing Filesystems? ● GlusterFS, PVFS2, Ceph,... ● Not all the same (distributed vs. cluster) ● Even the best don't cover all the bases
Privacy Part 1: Separate Namespace tenantX# ls /mnt/shared_fs/tenantY a.txt b.txt my_secret_file.txt ● Tenant X's files should be completely invisible to any other tenant ● Ditto for space usage ● Solvable with subvolume mounts and directory permissions, but watch out for symlinks etc.
Privacy Part 2: Separate ID Space ● Tenant X's “joe” has the same UID as tenant Y's “fred” ● Two tenants should not have the same UIDs ●...but server only has one UID space ● must map between per-server and per-tenant spaces server# ls /shared/tenantX/joe/foo -rw-r--r-- 1 joe joe 9262 Jan 20 12:00 foo server# ls /shared/tenantY/fred/bar -rw-r--r-- 1 joe joe 6481 Mar 09 13:47 bar
“Register your users with our ID service” Add another step? I create thousands of users every day! I already run my own ID service, to sync across the company. Amazon doesn't require that! It was nice knowing you.
Privacy Part 3: At Rest Encryption ● Where did it come from? Whose data is on it? ● Moral: encrypt and store key separately
Privacy Part 4: Wire Encryption + Authentication ● Know who you're talking to ● Make sure nobody else can listen (or spoof) Picture credit:
CloudFS ● Builds on established technology ● Adds specific functionality for cloud deployment Familiarity ScalabilityFlexibility Privacy GlusterFS SSLCloudFS
GlusterFS Core Concept: Translators ● So named because they translate upper-level I/O requests into lower-level I/O requests using the same interface ● stackable in any order ● can be deployed on either client or server ● Lowest-level “bricks” are just directories on servers ● GlusterFS is an engine to route filesystem requests through translators to bricks
Translator Patterns Brick 1 (do nothing) Brick 2 (do nothing) Caching (read XXXX) Brick 1 (read XX) Brick 2 (read YY) Splitting (read XXYY) Brick 1 (write XXYY) Brick 2 (write XXYY) Replicating (write XXYY) Brick 1 (do nothing) Brick 2 (write XXYY) Routing (write XXYY)
Translator Types ● Protocols: client, (native) server, NFS server ● Core: distribute (DHT), replicate (AFR), stripe ● Features: locks, access control, quota ● Performance: prefetching, caching, write-behind ● Debugging: trace, latency measurement ● CloudFS: ID mapping, authentication/encryption, future “Dynamo” and async replication
Typical Translator Structure Mount (FUSE) Cache Distribute Replicate Client AClient B Client C Client D Replicate Server A /export Server B /foo Server C /bar Server D /x/y/z client side
Let's Discuss Performance
Test Hardware ● Testing on Westmere EP Server class machines ● Two Socket, HT on ● 12 boxes total ● 48 GB fast memory ● 15K drives ● 10Gbit – 9K Jumbo frame enabled ● 4 Servers with fully populated Internal SAS drives (7) ● 8 boxes used as clients / VM hosts
Hardware Server s 10 Gbit Network Switch Client s
First Performance Question We Get Asked ● How does it Stack up to NFS ? ● One of the first tests we ran before we tuned ● Tests conducted on same box / storage
GlusterFS vs. NFS (Writes)
GlusterFS vs. NFS (Reads) IO Bound
Second Question ● How Does it Scale ? ● Tests run on combinations of Servers, clients, Vms ● Representative sample shown here ● Scale up across servers, hosts and threads
Read Scalability - Baremetal
Write Scalability – Bare metal
Tuning Fun ● Now that we have the basics, lets play ● Initial tests on RAID0 ● Lets try JBOD
Tuning Tips - Storage Layout
Virtualized Performance ● All this bare metal stuff is interesting but this is a CloudFS, lets see some virt data ● Use KVM Guests running RHEL6.1
Virtualized Performance - RHEL6.1 KVM Guests
Virtualized Performance ● Guest was CPU Bound in previous slides ● Up guest from 2 -> 4 VCPUs
Tuning Tips – Sizing the Guest
CloudFS Implementation
CloudFS Namespace Isolation ● Clients mount subdirectories on each brick ● Subdirectories are combined into per-tenant volumes tenantC# mount server1:brick /mnt/xxx
CloudFS ID Isolation tenantC# stat -c '%A %u %n' blah -rw-r--r-- 92 blah tenantC# stat -c '%A %u %n' /shared/blah -rw-r--r-- 92 /shared/blah provider# stat -c '%A %u %n' /bricks/C/blah -rw-r--r /bricks/C/blah
CloudFS Authentication ● OpenSSL with provider-signed certificates ● Identity used by other CloudFS functions Tenant (one time) Client certificate request ID=x Provider Server-signed certificate Client owned by tenant (every time) SSL connection using certificate Provides authentication and encryption
CloudFS Encryption ● Purely client side, not even escrow on server ● Provides privacy and indemnity ● Problem: partial-block writes partial cipher-block being written remainder fetched from server all input bytes affect all output bytes
Gluster to CloudFS ● So far we have been talking about Gluster performance ● Now lets look at the overhead of the CloudFS specific components
CloudFS Encryption Overhead
CloudFS Multi-Tenancy Overhead
For More Information ● CloudFS blog: ● Mailing lists: ● ● ● Code: ● More to come (wikis, bug tracker, etc.)
Backup: CloudFS “Dynamo” Translator (future) ● Greater scalability ● Faster replication ● Faster replica repair ● Faster rebalancing ● Variable # of replicas “Dynamo” Consistent Hashing S1S1 S2S2 S3S3 A B C D
Backup: CloudFS Async Replication (future) ● Multiple masters ● Partition tolerant ● writes accepted everywhere ● Eventually consistent ● version vectors etc. ● Preserves client-side encryption security ● Unrelated to Gluster geosync Site A S1S1 S2S2 S3S3 Site B S4S4 S5S5 Site C S6S6 S7S7