GlusterFS as a Development Platform

Slides:



Advertisements
Similar presentations
Step By Step Windows Server 2003 Installation Guide Step By Step Windows Server 2003 Installation Guide.
Advertisements

Operating Systems Lecture 7 OS Potpourri Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Lecture 4 Page 1 CS 111 Online Modularity and Memory Clearly, programs must have access to memory We need abstractions that give them the required access.
Lecture 5 Page 1 CS 111 Online Process Creation Processes get created (and destroyed) all the time in a typical computer Some by explicit user command.
GlusterFS Translators Conceptual Overview Jeff Darcy August 28, 2012.
Installing Windows 7 Lesson 2.
Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.
Virtual Memory.
Remote Procedure Calls
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
Copyright © Jim Fawcett Spring 2017
Non Contiguous Memory Allocation
Protecting Memory What is there to protect in memory?
Jonathan Walpole Computer Science Portland State University
COMP261 Lecture 23 B Trees.
User-Written Functions
Outline Properties of keys Key management Key servers Certificates.
Debugging Intermittent Issues
Week 7 - Friday CS221.
Efficient data maintenance in GlusterFS using databases
Protecting Memory What is there to protect in memory?
Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.
Protecting Memory What is there to protect in memory?
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Hash table CSC317 We have elements with key and satellite data
Advanced OS Concepts (For OCR)
Steve Ko Computer Sciences and Engineering University at Buffalo
File System Implementation
File System Structure How do I organize a disk into a file system?
CSE451 I/O Systems and the Full I/O Path Autumn 2002
Debugging Intermittent Issues
Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?
Process Creation Processes get created (and destroyed) all the time in a typical computer Some by explicit user command Some by invocation from other running.
B+-Trees.
Swapping Segmented paging allows us to have non-contiguous allocations
B+-Trees.
B+-Trees.
Some Simple Definitions for Testing
Modularity and Memory Clearly, programs must have access to memory
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees
Paging and Segmentation
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Steve Ko Computer Sciences and Engineering University at Buffalo
The Object-Oriented Thought Process Chapter 05
Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.
Chapter 2: System Structures
CPSC 221: Algorithms and Data Structures Lecture #6 Balancing Act
Review and Q/A.
Properties of the Real Numbers Part I
Indexing and Hashing Basic Concepts Ordered Indices
CSE 486/586 Distributed Systems Consistency --- 1
Virtual Memory Hardware
Instructor: Craig Duckett
Operating Systems Lecture 3.
Introduction to Operating Systems
Advanced Implementation of Tables
CSE451 Virtual Memory Paging Autumn 2002
Virtual Memory Prof. Eric Rotenberg
ECE 352 Digital System Fundamentals
Week 9 March 4, 2004 Adrienne Noble.
B-Trees.
ECE 352 Digital System Fundamentals
COMP755 Advanced Operating Systems
Cache writes and examples
CSE 486/586 Distributed Systems Consistency --- 1
CSE 326: Data Structures Lecture #14
Promises.
Presentation transcript:

GlusterFS as a Development Platform (Writing GlusterFS Translators) Kaleb KEITHLEY Red Hat 23 February, 2013

Extending GlusterFS with Translators What is a translator? An example: the HekaFS uidmap translator Translator basics Translator file-ops (fops) methods Inside fop methods — the nuts and bolts Building in-tree or out-of-tree GluPy Resources Outline of what I'm going to talk about Translators, per se, aren't hard to write. Yes, the devil (or God) is in the details. Hard things will always be hard, and doing them in the I/O path will not make them any easier.

What is a GlusterFS translator Pluggable software component Provision Storage == Create a directed graph of translators Linear, e.g. trivial one brick volume, NFS volume Trees, e.g. Distribution, replication, stripe Translators are on both the server and on the client Translators may be moved from client to server and vice versa Every translator implements the same API/ABI Pluggable — will say more about that later When you use CLI, graph is created. It's a file. You can edit the file to insert your own translator The client-server split is not set in stone. But use caution. It doesn't always make sense to move translators from client to server or server to client As we'll see, the API is simple: just two methods. The important part is the table of function pointers for the fops Every particular fop method has the same API, i.e. Same function signature

Simple Translator Stack — Linear protocol/server protocols/client performance/write-behind performance/read-ahead performance/io-cache performance/quick-read Performance/stat-prefetch Debug/io-stats debug/io-stats features/marker performance/io-threads features/locks Here's a simple single brick volume. You can see which translators run on the client And which ones run on the server In the lower left is the brick, the real disk drive where the data is stored. In the upper left is the “virtual” disk on the client. A write to the virtual disk on the client traverses all the translators down to the protocols/client translator, where both the data and metadata are marshaled and sent to the server, and down through the server's stack to the disk. And then the results come back the opposite path. features/access-control storage/posix

Complex Translator Stack — Distribute (DHT) protocol/server debug/io-stats performance/stat-prefetch debug/io-stats protocol/server debug/io-stats features/access-control ... storage/posix performance/write-behind features/access-control ... Here's a similar setup, but now we have added another brick and and are using DHT. See how the I/O is split by the cluster/distribute xlator and sent to both servers. ... cluster/distribute storage/posix protocol/client protocol/client

Complex Translator Stack — DHT + NFS nfs clients protocol/server debug/io-stats nfs/server debug/io-stats protocol/server debug/io-stats performance/stat-prefetch features/access-control ... storage/posix performance/write-behind features/access-control ... A similar setup, but now we've added NFS by adding a NFS xlator. Color here is important. On the previous slides I used different colors to indicate which pieces ran on different machines. That's still true here, but notice that the NFS server “client” stack runs on one of the servers. Thus you can see how GlusterFS uses NFS with DHT (or AFR). ... cluster/distribute storage/posix protocol/client protocol/client

An example: HekaFS uidmap translator Consider a service provider with several customers Each customer has thousands of users Collisions in the uid and gid space Uidmap xlator maps uids and gids to discrete sets of uids and gids per tenant Tenant red 512 513 516 Tenant blue 513 515 516 Tenant green 511 512 515 I want to talk a little bit about what we did with HekaFS Define tenant: a set of related users. We wanted truly discrete storage for every user from every tenant. Not only do their files live in an entirely separate part of the namespace, but we further ensure that their uids and gids don't collide by remapping them with the uidmap translator /export/bricks/vol /export/bricks/vol/green /export/bricks/vol/red /export/bricks/vol/blue uidmap xlator 10001 10002 10003 11001 11002 11003 12001 12002 12003

Translator basics Translators are shared objects (shlibs) Methods int32_t init(xlator_t *this); void fini(xlator_t *this); Data struct xlator_fops fops { ... }; struct xlator_cbks cbks { }; struct volume_options options [] = { ... }; Client, Server, Client/Server Threads: write MT-SAFE Portability: GlusterFS != Linux only License: server GPLv3+, client GPLv2 or LGPLv3+ fops: function pointer table or vtble with its own “sub-API” cbks: never used in the translator, legacy. Run-time will dlsym it, so you have to have it. Think carefully about where your translator should run. Client? Server? Either one? GlusterFS is threaded, your xlator must be MT-SAFE. This is LinuxCon, yes, and GlusterFS is developed on Linux, but— Say some words about the pieces and their license(s)

Volume Options Here's an example of the options table. struct volume_options options[] = { { .key = {"uidmap-plugin", "plugin"}, .type = GF_OPTION_TYPE_STR, }, { .key = {"root-squash"}, .value = { "yes", "no"} { .key = {"uid-range"}, { .key = {"gid-range"}, { .key = {NULL} }, }; GF_OPTION_TYPE_{ANY,STR,INT,SIZET,PERCENT,BOOL,...} Here's an example of the options table. There's a rich set of other types not shown here

Translator fops Here there be dragons (and they're not dragons of good fortune) Every fop method has a different signature Signatures change from release to release Documentation? :-( The nuts and bolts fop methods and fop callbacks STACK_WIND(), STACK_UNWIND(), and friends Calling multiple “children” Dealing with errors Indicating an I/O error Indicating a Method error Every fop has it's own signature (API), so be careful. Changes in fop signatures is slowing down as GlusterFS matures. But— Documentation is getting better. But— Like fop signatures, corresponding cbk, STACK_WIND and STACK_UNWIND signatures are unique per fop and they change from release to release too. We'll talk about fops and cbks and STACK_WIND and STACK_UNWIND, and much more

Every method has a different signature Open fop method and callback typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *); typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *, xlator_t *, int32_t, int32_t, fd_t *, dict_t *); Rename fop method and callback typedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *, loc_t *, loc_t *, dict_t *); typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *, xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *, struct iatt *, struct iatt *, struct iatt *); here we compare the fop and cbk signatures for open(), and also for rename(), just emphasize the point that fops and cbks have different signatures. Perhaps that's obvious—

Method signatures change from release to release 3.2 rename fop typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, int32_t); 3.3 rename fop typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, loc_t *, int32_t, fd_t *, dict_t *); If you're developing for different releases of GlusterFS, be aware that the fop and cbk signatures have changed. Someday we'll be happy with them and they won't change any more. Someday.

Translator Data Types call_frame_t — xlator_t — translator context inode_t — represents a file on disk; ref-counted fd_t — represents an open file; ref-counted iatt_t — ~= struct stat dict_t — ~= Python dict (or C++ std::map) call_frame_t, context for the I/O as it traverses down and back up the stack of translators. Valid for the duration of this particular I/O xlator_t, context about the particular translator the I/O is in at this moment in time. Valid for as long as the translator is loaded, i.e. effectively indefinite. inode_t and fd_t. It is possible for them to go away. If you need them to stay around, bump their ref count. Don't forget to decrement the ref count when you're done with it.

Utility Functions Memory Management GF_MALLOC, GF_CALLOC, GF_FREE Logging gf_log, gf_print_trace Red-black trees, hashes, etc

fop methods and fop callbacks uidmap_writev (...) { ... STACK_WIND (frame, uidmap_writev_cbk, FIRST_CHILD (this), FIRST_CHILD (this)->fops->writev, fd, vector, count, offset, iobref); /* DANGER ZONE */ return 0; } Effectively lose control after STACK_WIND Callback might have already happened Or might be running right now Or maybe it's not going to run 'til later Here's an fop method for writev(). Do all your work before passing the I/O on to the next translator in the chain, i.e. Before calling STACK_WIND. Note cbk, in this case uidmap_writev_cbk(). No I/O occurs here. It all happens sometime between the STACK_WIND and the invocation of the cbk. N.B. In this case we're only passing the I/O to a single child. In the danger zone? Only do clean-up. E.g. Release local stuff. Make sure it's not in use further down the call tree.

fop methods and fop callback methods, cont. uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...) { ... STACK_UNWIND_STRICT (writev, frame op_ret, op_errno, prebuf, postbuf); return 0; } The I/O is complete when the callback is called Here's the cbk. The I/O is complete at this point, return back up the stack with STACK_REWIND Notice the cookie parameter. More about this in the next slide

STACK_WIND versus STACK_WIND_COOKIE Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs (call_frame_t *frame, xlator_t *this, loc_t *loc) { inode_t *root_inode = loc->inode->table->root; STACK_WIND_COOKIE (frame, quota_statfs_cbk, root_inode, FIRST_CHILD (this), FIRST_CHILD (this)->fops->statfs, loc, xdata); return 0; } There is also frame->local shared by all STACK_WIND callbacks compare STACK_WIND to STACK_WIND_COOKIE. Notice 'inode' and that it's being passed as an extra param The 'cookie' is only shared between fop and cbk There's also frame->local which is shared by all fops and cbks, but be careful you don't overwrite something that's already there.

STACK_WIND, STACK_WIND_COOKIE, cont. Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs_cbk (call_frame_t *frame, void *cookie, ...) { inode_t *root_inode = cookie; ... } Here we see that cookie != NULL when the fop used STACK_WIND_COOKIE

STACK_UNWIND versus STACK_UNWIND_STRICT STACK_UNWIND_STRICT uses the correct type /* return from function in a type-safe way */ #define STACK_UNWIND (frame, params ...) do { ret_fn_t fn = frame->ret; ... versus #define STACK_UNWIND_STRICT (op, frame, params ...) fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame->ret; And why wouldn't you want strong typing? This slide speaks for itself. Is there a reason to use STACK_UNWIND? Ever?

Calling multiple children (fan out) afr_writev_wind (...) { ... for (i = 0; i < priv->child_count; i++) { if (local->transaction.pre_op[i]) { STACK_WIND_COOKIE (frame, afr_writev_wind_cbk, (void *) (long) i, priv->children[i], priv->children[i]->fops->writev, local->fd, ...); } return 0; Here's an example from AFR where the I/O is sent to all the replicas. Remember the danger zone

Calling multiple children, cont. (fan in) afr_writev_wind_cbk (...) { LOCK (&frame->lock); callcnt = --local->call_count; UNLOCK (&frame->lock); if (callcnt == 0) /* we're done */ ... } failure by any one child means the whole transaction failed And needs to be handled accordingly Not much else to say

Dealing With Errors: I/O errors uidmap_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno, ...) { ... STACK_UNWIND_STRICT (writev, frame, -1, EIO, ...); return 0; } op_ret: 0 or -1, success or failure op_errno: from <errno.h> Use an op_errno that's valid and/or relevant for the fop Note that this is in the cbk Xlators further down the call tree may return an error in op_ret and op_errno. E.g. The storage/posix xlator that actually writes to disk may have a real error, and this propagates back up. You could decide it's not a real error— Conversely the lower xlators may return success and you could decide it is an error anyway— Either pass the parameters on, or change them accordingly. Recommend: use appropriate errno, or you risk confusing people

Dealing With Errors: method errors uidmap_writev (call_frame_t *frame, xlator_t *this, ...) { ... if (horrible_logic_error_must_abort) { goto error; /* glusterfs idiom */ } STACK_WIND(frame, uid_writev_cbk, ...); return 0; error: STACK_UNWIND_STRICT (writev, frame, -1, EIO, NULL, NULL); And here's how to handle some unrecoverable error, logic or otherwise, in the fop. Note the glusterfs idiom (retch)

Building: in-tree or out-of-tree In-tree: Gluster.org source hasn't had a -devel package. Good news: starting with 3.3.0 there is now a -devel package if you build the RPM from the glusterfs.spec(.in) file included in the source Bad news: unclear what .deb packagers are doing Fedora RPMs have had a -devel package since 3.2.x Out-of-tree: Use HekaFS sources as a model for building out-of-tree Here's an example of community at work. Gluster never provided a -devel rpm until we did it for Fedora. HekaFS source tree is a good starting point for adding your own translators for building out-of-tree Or build in-tree. On modern hardware the whole gluster build takes only a couple of minutes. Need example of files to change to add new xlator

Writing a translator in Python with GluPy io-cache STACK_UNWIND fop ... fop GluPy python gluster.py cbk ctypes (ffi) cbk ... STACK_WIND libglusterfs dht

Resources Jeff Darcy's HekaFS.org Translator tutorials http://hekafs.org/index.php/2011/11/translator-101-class-1-setting-the-stage/ http://hekafs.org/index.php/2011/11/translator-101-lesson-2-init-fini-and-private- context/ http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/ http://hekafs.org/index.php/2011/11/translator-101-lesson-4-debugging-a- translator/ Jeff Darcy's Glupy Articles http://hekafs.org/index.php/2012/08/glupy-writing-glusterfs-translators-in-python/ http://www.linuxjournal.com/content/extending-glusterfs-python GlusterFS documentation http://www.gluster.org/community/documentation/index.php/Main_Page GlusterFS Git repos https://github.com/gluster/glusterfs ssh://git.gluster.com/glusterfs.git This presentation http://www.fedorapeople.org/kkeithle/DevCon-Gluster.odp My email mailto:kkeithle@redhat.com

Call to action Go forth and write GlusterFS translators! A basic translator isn't hard. Trust me. ;-)