A New Distributed Processing Framework By Colin Pronovost
Introduction Parallel processing can solve some problems faster than sequential processing More systems = more RAM and CPUs = better I set out to create a framework that uses the concept of “tasks”: subprocesses that can be run in parallel Specifically, parallel processing can be faster than sequential processing when a given process is composed of multiple independent parts. Some terminology: The “controller” is the system that is coordinating the whole process. The “clients” are the workhorses that are running the “tasks.”
Prior Work Message Passing Interface (MPI) OpenMP Apache Spark BOINC MPI allows for interhost communication in a distributed processing environment, similar to the system of communication I have described. It even allows for programs to be run on multiple systems over a network (Message Passing Interface, 2014) (mpiexec(1) Man Page, 2014). However, the complexity of performing some operations in MPI increases the workload of production significantly (Message Passing Interface, 2014). The system I propose will have a simple, easy-to-use, and non-intrusive interface in order to ease production and experimentation. OpenMP OpenMP provides support for parallelizing loops and sections of code through preprocessor directives, allow- ing for code to be written to run in parallel with very little effort on the part of the programmer. However, since the platform is based on sharing memory across threads, it has virtually no support for any sort of network-based parallelization, unless shared memory across the network is implemented, for instance, in the kernel, so that the distribution over the network is transparent to user-space programs (OpenMP, 2014). Apache Spark Spark has an extensive library and claims to be easy to use (Apache Spark, 2014). However, it lacks a native interface and does not support dynamic loading, requiring the user to SSH into all cluster systems to update programs (Apache Spark, 2014) (Running Spark on EC2, 2014). This makes it not well suited to applications that require running several different programs or rapidly changing programs. BOINC BOINC is a framework that provides a lot of the same features that I have proposed. It allows for a form of dynamic program loading, remote function calls, and has a task-based framework (How BOINC Works, 2014). The key difference is that my project is focused on a system for doing research with the capabilities of distributed computing while BOINC is designed more for using volunteer CPU cycles to do big data calculations (How BOINC Works, 2014).
Goals Create a library that contains functions for easy communication Create a task queue that will keep track of what has been done and what needs to be done in what order Create a network API that will simplify the process of transferring data over a network connection Create a dynamic loading mechanism that will allow code to be sent over the network connection and run
The Library (client functions) int getControllerSockFD() returns the socket for communication with the controller. void sendMessage(struct message* msg) sends the controller a message. message contains 4 members: uint8 message type; uint16 message length; uint8 status code; void* data. struct message* recvMessage(struct message* msg) receives a message from the controller. If msg is non-null, the message is stored in that address, which is returned. Otherwise, new memory is allocated and returned. The length of the data is checked to ensure that it is equal to the message length field, otherwise the message is dropped. void setMessageMask(int mask) specifies which messages the client is interested in receiving. void clearMessageMask(int mask) clears the message mask so that all messages are received.
The Library (controller functions) int getNumClients() returns the number of clients. struct client data* getClientData(int client, struct client data* cldata) retrieves data on the specified client, such as OS, architecture, clock speed, amount of RAM, and other useful information. If cldata is non-null, the data is stored in that address, which is returned. Otherwise, new memory is allocated and returned. void sendMessage(struct controller message* msg) sends a client a message. controller message is the same as for the client, except that an int client field is added. void sendAllMessage(struct message* msg) sends all clients a message. message is the same as for the client. struct controller message* recvMessage(struct controller message *msg) receives the next message. If msg is non-null, the message is stored in that address, which is returned. Otherwise, new memory is allocated and returned. The length of the data is checked to ensure that it is equal to the message length field, otherwise the message is dropped.
The Task Queue Needs to be able to handle task dependencies and non-completion Represent task dependencies as a directed graph, complete tasks in topological order If a task has been “checked out” but not completed, it can be assigned to another client (in case something went wrong with the first one) Should be able to handle a client going down mid-process, network failure, etc. Topological order ensures that a task's dependencies are completed before the task itself Assigning the same task to multiple clients (if no other tasks are available to assign) ensures that a client can not bring the whole process to a halt by going down after it agreed to do a task
The Network API Messages are enqueued on a send queue and dequeued from a receive queue Sending of messages from the send queue and receiving messages onto the receive queue are each handled by a separate thread Each connection has its own queue/thread pair The queues are thread-safe. This setup insulates the main program thread from having to wait for network operations or deal with network failure. An optional callback function can be specified to be called upon completion or failure of a send operation.
Dynamic Loading Simple case: do the same thing that ld.so does to load shared libraries If a required library is not present, the controller can be queried for it Full programs can be written to /tmp and run with execve()
Accomplishments The network API works until one side disconnects A basic proof-of-concept using matrix multiplication The library, job queue, and dynamic loading were not implemented due to time constraints. It turns out that networking is more complicated and difficult than I originally thought that it would be. However, with most of the network back end mostly in place and functional, the rest should be fairly easy to accomplish (shared library loading code can be borrowed from glibc's ld.so).
Computation Time for Multiplying an NxN matrix with M CPUs Results Computation Time for Multiplying an NxN matrix with M CPUs CPUs Matrix Size 2 4 6 8 10 32 0.012363022 0.030523271 0.010454084 0.008981629 0.007652233 64 0.058882492 0.019197998 0.019730628 0.020511414 0.211174583 96 0.084705421 0.069359721 0.077731055 0.077675274 0.078329731 128 0.135407848 0.127247027 0.128140875 0.126661721 0.250957224 160 0.181245299 0.208812409 0.211340652 0.175471061 0.181304835 192 0.307811235 0.339090654 0.336397926 0.302383864 0.414096991 224 0.484918176 0.495227919 0.479458751 0.479369095 0.480608001 This table shows the amount of time that the framework took to multiply an NxN matrix with varying numbers of clients (CPUs). The data is fairly inconsistent, which I believe is due to uncontrollable events in the network itself. With such small matrix sizes, the network overhead is hard to filter out.
Results (continued) Average Computation Time per CPU for Multiplying an NxN matrix with M CPUs CPUs Matrix Size 2 4 6 8 10 32 0.000166459 0.000082589 0.000054977 0.000049028 0.00003304 64 0.001552289 0.000712410 0.000502733 0.000339796 0.000288078 96 0.004982714 0.002399564 0.001663380 0.001167981 0.000996738 128 0.011975964 0.005768813 0.003909034 0.002980716 0.002323040 160 0.022497528 0.010967518 0.007415074 0.005558941 0.004402479 192 0.039112715 0.019328565 0.012831588 0.009577841 0.007709864 224 0.060638093 0.030214250 0.020178754 0.014845676 0.012002253 This table shows the average time per CPU it took to multiply an NxN matrix for varying numbers of clients (CPUs). The average time decreases as the number of CPUs increases, which coincides with the idea that parallelism speeds up the operation.
Future Work Increase maximum message size Utilize multiple CPU cores on systems with multi- core CPUs Add more functions to the library I discovered during testing that the way in which the protocol is defined severely limits the maximum message size. A simple implementation for multiple core utilization could be running a task on each core. I would like to add more higher-level library functions (something along the lines of a loadLibrary() or a sendData()) which abstract the process of message creation.
Special Thanks David Costello and James Roche, for giving me some computers to use Professor Kai Shen, for taking the time to meet with me and answer my questions Professor Chris Brown, for directing me to where I could find resources
References Apache spark. (2014). web. Retrieved from http://spark.apache.org/ How BOINC works. (2014). web. Retrieved from http://boinc.berkeley.edu/wiki/How BOINC works Message passing interface. (2014). web. Retrieved from http://en.wikipedia.org/wiki/Messag e Passing Interface mpiexec(1) man page. (2014). web. Retrieved from https://www.open- mpi.org/doc/v1.8/man1/mpiexec.1.p hp OpenMP. (2014). web. Retrieved from http://en.wikipedia.org/wiki/OpenM P Running spark on EC2. (2014). web. Retrieved from http://spark.apache.org/docs/latest/e c2-scripts.html