GlusterFS as a Development Platform

GlusterFS as a Development Platform
(Writing GlusterFS Translators) Kaleb KEITHLEY Red Hat 23 February, 2013

Extending GlusterFS with Translators
What is a translator? An example: the HekaFS uidmap translator Translator basics Translator file-ops (fops) methods Inside fop methods — the nuts and bolts Building in-tree or out-of-tree GluPy Resources Outline of what I'm going to talk about Translators, per se, aren't hard to write. Yes, the devil (or God) is in the details. Hard things will always be hard, and doing them in the I/O path will not make them any easier.

What is a GlusterFS translator
Pluggable software component Provision Storage == Create a directed graph of translators Linear, e.g. trivial one brick volume, NFS volume Trees, e.g. Distribution, replication, stripe Translators are on both the server and on the client Translators may be moved from client to server and vice versa Every translator implements the same API/ABI Pluggable — will say more about that later When you use CLI, graph is created. It's a file. You can edit the file to insert your own translator The client-server split is not set in stone. But use caution. It doesn't always make sense to move translators from client to server or server to client As we'll see, the API is simple: just two methods. The important part is the table of function pointers for the fops Every particular fop method has the same API, i.e. Same function signature

Simple Translator Stack — Linear
protocol/server protocols/client performance/write-behind performance/read-ahead performance/io-cache performance/quick-read Performance/stat-prefetch Debug/io-stats debug/io-stats features/marker performance/io-threads features/locks Here's a simple single brick volume. You can see which translators run on the client And which ones run on the server In the lower left is the brick, the real disk drive where the data is stored. In the upper left is the “virtual” disk on the client. A write to the virtual disk on the client traverses all the translators down to the protocols/client translator, where both the data and metadata are marshaled and sent to the server, and down through the server's stack to the disk. And then the results come back the opposite path. features/access-control storage/posix

Complex Translator Stack — Distribute (DHT)
protocol/server debug/io-stats performance/stat-prefetch debug/io-stats protocol/server debug/io-stats features/access-control ... storage/posix performance/write-behind features/access-control ... Here's a similar setup, but now we have added another brick and and are using DHT. See how the I/O is split by the cluster/distribute xlator and sent to both servers. ... cluster/distribute storage/posix protocol/client protocol/client

Complex Translator Stack — DHT + NFS
nfs clients protocol/server debug/io-stats nfs/server debug/io-stats protocol/server debug/io-stats performance/stat-prefetch features/access-control ... storage/posix performance/write-behind features/access-control ... A similar setup, but now we've added NFS by adding a NFS xlator. Color here is important. On the previous slides I used different colors to indicate which pieces ran on different machines. That's still true here, but notice that the NFS server “client” stack runs on one of the servers. Thus you can see how GlusterFS uses NFS with DHT (or AFR). ... cluster/distribute storage/posix protocol/client protocol/client

An example: HekaFS uidmap translator
Consider a service provider with several customers Each customer has thousands of users Collisions in the uid and gid space Uidmap xlator maps uids and gids to discrete sets of uids and gids per tenant Tenant red Tenant blue Tenant green I want to talk a little bit about what we did with HekaFS Define tenant: a set of related users. We wanted truly discrete storage for every user from every tenant. Not only do their files live in an entirely separate part of the namespace, but we further ensure that their uids and gids don't collide by remapping them with the uidmap translator /export/bricks/vol /export/bricks/vol/green /export/bricks/vol/red /export/bricks/vol/blue uidmap xlator

Translator basics Translators are shared objects (shlibs) Methods
int32_t init(xlator_t *this); void fini(xlator_t *this); Data struct xlator_fops fops { ... }; struct xlator_cbks cbks { }; struct volume_options options [] = { ... }; Client, Server, Client/Server Threads: write MT-SAFE Portability: GlusterFS != Linux only License: server GPLv3+, client GPLv2 or LGPLv3+ fops: function pointer table or vtble with its own “sub-API” cbks: never used in the translator, legacy. Run-time will dlsym it, so you have to have it. Think carefully about where your translator should run. Client? Server? Either one? GlusterFS is threaded, your xlator must be MT-SAFE. This is LinuxCon, yes, and GlusterFS is developed on Linux, but— Say some words about the pieces and their license(s)

Volume Options Here's an example of the options table.
struct volume_options options[] = { { .key = {"uidmap-plugin", "plugin"}, .type = GF_OPTION_TYPE_STR, }, { .key = {"root-squash"}, .value = { "yes", "no"} { .key = {"uid-range"}, { .key = {"gid-range"}, { .key = {NULL} }, }; GF_OPTION_TYPE_{ANY,STR,INT,SIZET,PERCENT,BOOL,...} Here's an example of the options table. There's a rich set of other types not shown here

Translator fops Here there be dragons (and they're not dragons of good fortune) Every fop method has a different signature Signatures change from release to release Documentation? :-( The nuts and bolts fop methods and fop callbacks STACK_WIND(), STACK_UNWIND(), and friends Calling multiple “children” Dealing with errors Indicating an I/O error Indicating a Method error Every fop has it's own signature (API), so be careful. Changes in fop signatures is slowing down as GlusterFS matures. But— Documentation is getting better. But— Like fop signatures, corresponding cbk, STACK_WIND and STACK_UNWIND signatures are unique per fop and they change from release to release too. We'll talk about fops and cbks and STACK_WIND and STACK_UNWIND, and much more

Every method has a different signature
Open fop method and callback typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *); typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *, xlator_t *, int32_t, int32_t, fd_t *, dict_t *); Rename fop method and callback typedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *, loc_t *, loc_t *, dict_t *); typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *, xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *, struct iatt *, struct iatt *, struct iatt *); here we compare the fop and cbk signatures for open(), and also for rename(), just emphasize the point that fops and cbks have different signatures. Perhaps that's obvious—

Method signatures change from release to release
3.2 rename fop typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, int32_t); 3.3 rename fop typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *); If you're developing for different releases of GlusterFS, be aware that the fop and cbk signatures have changed. Someday we'll be happy with them and they won't change any more. Someday.

Translator Data Types call_frame_t — xlator_t — translator context
inode_t — represents a file on disk; ref-counted fd_t — represents an open file; ref-counted iatt_t — ~= struct stat dict_t — ~= Python dict (or C++ std::map) call_frame_t, context for the I/O as it traverses down and back up the stack of translators. Valid for the duration of this particular I/O xlator_t, context about the particular translator the I/O is in at this moment in time. Valid for as long as the translator is loaded, i.e. effectively indefinite. inode_t and fd_t. It is possible for them to go away. If you need them to stay around, bump their ref count. Don't forget to decrement the ref count when you're done with it.

Utility Functions Memory Management GF_MALLOC, GF_CALLOC, GF_FREE
Logging gf_log, gf_print_trace Red-black trees, hashes, etc

fop methods and fop callbacks
uidmap_writev (...) { ... STACK_WIND (frame, uidmap_writev_cbk, FIRST_CHILD (this), FIRST_CHILD (this)->fops->writev, fd, vector, count, offset, iobref); /* DANGER ZONE */ return 0; } Effectively lose control after STACK_WIND Callback might have already happened Or might be running right now Or maybe it's not going to run 'til later Here's an fop method for writev(). Do all your work before passing the I/O on to the next translator in the chain, i.e. Before calling STACK_WIND. Note cbk, in this case uidmap_writev_cbk(). No I/O occurs here. It all happens sometime between the STACK_WIND and the invocation of the cbk. N.B. In this case we're only passing the I/O to a single child. In the danger zone? Only do clean-up. E.g. Release local stuff. Make sure it's not in use further down the call tree.

fop methods and fop callback methods, cont.
uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...) { ... STACK_UNWIND_STRICT (writev, frame op_ret, op_errno, prebuf, postbuf); return 0; } The I/O is complete when the callback is called Here's the cbk. The I/O is complete at this point, return back up the stack with STACK_REWIND Notice the cookie parameter. More about this in the next slide

STACK_WIND versus STACK_WIND_COOKIE
Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs (call_frame_t *frame, xlator_t *this, loc_t *loc) { inode_t *root_inode = loc->inode->table->root; STACK_WIND_COOKIE (frame, quota_statfs_cbk, root_inode, FIRST_CHILD (this), FIRST_CHILD (this)->fops->statfs, loc, xdata); return 0; } There is also frame->local shared by all STACK_WIND callbacks compare STACK_WIND to STACK_WIND_COOKIE. Notice 'inode' and that it's being passed as an extra param The 'cookie' is only shared between fop and cbk There's also frame->local which is shared by all fops and cbks, but be careful you don't overwrite something that's already there.

STACK_WIND, STACK_WIND_COOKIE, cont.
Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs_cbk (call_frame_t *frame, void *cookie, ...) { inode_t *root_inode = cookie; ... } Here we see that cookie != NULL when the fop used STACK_WIND_COOKIE

STACK_UNWIND versus STACK_UNWIND_STRICT
STACK_UNWIND_STRICT uses the correct type /* return from function in a type-safe way */ #define STACK_UNWIND (frame, params ...) do { ret_fn_t fn = frame->ret; ... versus #define STACK_UNWIND_STRICT (op, frame, params ...) fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame->ret; And why wouldn't you want strong typing? This slide speaks for itself. Is there a reason to use STACK_UNWIND? Ever?

Calling multiple children (fan out)
afr_writev_wind (...) { ... for (i = 0; i < priv->child_count; i++) { if (local->transaction.pre_op[i]) { STACK_WIND_COOKIE (frame, afr_writev_wind_cbk, (void *) (long) i, priv->children[i], priv->children[i]->fops->writev, local->fd, ...); } return 0; Here's an example from AFR where the I/O is sent to all the replicas. Remember the danger zone

Calling multiple children, cont. (fan in)
afr_writev_wind_cbk (...) { LOCK (&frame->lock); callcnt = --local->call_count; UNLOCK (&frame->lock); if (callcnt == 0) /* we're done */ ... } failure by any one child means the whole transaction failed And needs to be handled accordingly Not much else to say

Dealing With Errors: I/O errors
uidmap_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno, ...) { ... STACK_UNWIND_STRICT (writev, frame, -1, EIO, ...); return 0; } op_ret: 0 or -1, success or failure op_errno: from <errno.h> Use an op_errno that's valid and/or relevant for the fop Note that this is in the cbk Xlators further down the call tree may return an error in op_ret and op_errno. E.g. The storage/posix xlator that actually writes to disk may have a real error, and this propagates back up. You could decide it's not a real error— Conversely the lower xlators may return success and you could decide it is an error anyway— Either pass the parameters on, or change them accordingly. Recommend: use appropriate errno, or you risk confusing people

Dealing With Errors: method errors
uidmap_writev (call_frame_t *frame, xlator_t *this, ...) { ... if (horrible_logic_error_must_abort) { goto error; /* glusterfs idiom */ } STACK_WIND(frame, uid_writev_cbk, ...); return 0; error: STACK_UNWIND_STRICT (writev, frame, -1, EIO, NULL, NULL); And here's how to handle some unrecoverable error, logic or otherwise, in the fop. Note the glusterfs idiom (retch)

Building: in-tree or out-of-tree
In-tree: Gluster.org source hasn't had a -devel package. Good news: starting with there is now a -devel package if you build the RPM from the glusterfs.spec(.in) file included in the source Bad news: unclear what .deb packagers are doing Fedora RPMs have had a -devel package since 3.2.x Out-of-tree: Use HekaFS sources as a model for building out-of-tree Here's an example of community at work. Gluster never provided a -devel rpm until we did it for Fedora. HekaFS source tree is a good starting point for adding your own translators for building out-of-tree Or build in-tree. On modern hardware the whole gluster build takes only a couple of minutes. Need example of files to change to add new xlator

Writing a translator in Python with GluPy
io-cache STACK_UNWIND fop ... fop GluPy python gluster.py cbk ctypes (ffi) cbk ... STACK_WIND libglusterfs dht

Resources Jeff Darcy's HekaFS.org Translator tutorials
context/ translator/ Jeff Darcy's Glupy Articles GlusterFS documentation GlusterFS Git repos ssh://git.gluster.com/glusterfs.git This presentation My

Call to action Go forth and write GlusterFS translators!
A basic translator isn't hard. Trust me. ;-)

GlusterFS as a Development Platform

Similar presentations

Presentation on theme: "GlusterFS as a Development Platform"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GlusterFS as a Development Platform

Similar presentations

Presentation on theme: "GlusterFS as a Development Platform"— Presentation transcript:

Similar presentations

About project

Feedback