A New Distributed Processing Framework

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

COM vs. CORBA.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
A CHAT CLIENT-SERVER MODULE IN JAVA BY MAHTAB M HUSSAIN MAYANK MOHAN ISE 582 FALL 2003 PROJECT.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
COM vs. CORBA Computer Science at Azusa Pacific University September 19, 2015 Azusa Pacific University, Azusa, CA 91702, Tel: (800) Department.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Logic Analyzer ECE-4220 Real-Time Embedded Systems Final Project Dallas Fletchall.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
Visual Basic for Application - Microsoft Access 2003 Finishing the application.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Parallel Computing Presented by Justin Reschke
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
GridOS: Operating System Services for Grid Architectures
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Module 12: I/O Systems I/O hardware Application I/O Interface
Resource Management IB Computer Science.
Design Components are Code Components
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Chapter 3: Process Concept
Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab
Distributed Shared Memory
LOCO Extract – Transform - Load
Self Healing and Dynamic Construction Framework:
Lesson Objectives Aims Key Words
Advanced Topics in Concurrency and Reactive Programming: Asynchronous Programming Majeed Kassis.
CS533 Concepts of Operating Systems
COMPSCI 110 Operating Systems
The University of Adelaide, School of Computer Science
CHAPTER 3 Architectures for Distributed Systems
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Replication Middleware for Cloud Based Storage Service
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Threads and Data Sharing
Process-to-Process Delivery:
CS703 - Advanced Operating Systems
CSE8380 Parallel and Distributed Processing Presentation
Lecture Topics: 11/1 General Operating System Concepts Processes
CSCE 313 – Introduction to UNIx process
Design Components are Code Components
Multithreaded Programming
Chapter 2: Operating-System Structures
MPJ: A Java-based Parallel Computing System
Prof. Leonardo Mostarda University of Camerino
Experiences in Deploying Services within the Axis Container
Concurrency, Processes and Threads
Chapter 3: Processes.
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Chapter 2: Operating-System Structures
Why Threads Are A Bad Idea (for most purposes)
COMP755 Advanced Operating Systems
Types of Parallel Computers
Chapter 3: Process Management
Presentation transcript:

A New Distributed Processing Framework By Colin Pronovost

Introduction Parallel processing can solve some problems faster than sequential processing More systems = more RAM and CPUs = better I set out to create a framework that uses the concept of “tasks”: subprocesses that can be run in parallel Specifically, parallel processing can be faster than sequential processing when a given process is composed of multiple independent parts. Some terminology: The “controller” is the system that is coordinating the whole process. The “clients” are the workhorses that are running the “tasks.”

Prior Work Message Passing Interface (MPI) OpenMP Apache Spark BOINC MPI allows for interhost communication in a distributed processing environment, similar to the system of communication I have described. It even allows for programs to be run on multiple systems over a network (Message Passing Interface, 2014) (mpiexec(1) Man Page, 2014). However, the complexity of performing some operations in MPI increases the workload of production significantly (Message Passing Interface, 2014). The system I propose will have a simple, easy-to-use, and non-intrusive interface in order to ease production and experimentation. OpenMP OpenMP provides support for parallelizing loops and sections of code through preprocessor directives, allow- ing for code to be written to run in parallel with very little effort on the part of the programmer. However, since the platform is based on sharing memory across threads, it has virtually no support for any sort of network-based parallelization, unless shared memory across the network is implemented, for instance, in the kernel, so that the distribution over the network is transparent to user-space programs (OpenMP, 2014). Apache Spark Spark has an extensive library and claims to be easy to use (Apache Spark, 2014). However, it lacks a native interface and does not support dynamic loading, requiring the user to SSH into all cluster systems to update programs (Apache Spark, 2014) (Running Spark on EC2, 2014). This makes it not well suited to applications that require running several different programs or rapidly changing programs. BOINC BOINC is a framework that provides a lot of the same features that I have proposed. It allows for a form of dynamic program loading, remote function calls, and has a task-based framework (How BOINC Works, 2014). The key difference is that my project is focused on a system for doing research with the capabilities of distributed computing while BOINC is designed more for using volunteer CPU cycles to do big data calculations (How BOINC Works, 2014).

Goals Create a library that contains functions for easy communication Create a task queue that will keep track of what has been done and what needs to be done in what order Create a network API that will simplify the process of transferring data over a network connection Create a dynamic loading mechanism that will allow code to be sent over the network connection and run

The Library (client functions) int getControllerSockFD() returns the socket for communication with the controller. void sendMessage(struct message* msg) sends the controller a message. message contains 4 members: uint8 message type; uint16 message length; uint8 status code; void* data. struct message* recvMessage(struct message* msg) receives a message from the controller. If msg is non-null, the message is stored in that address, which is returned. Otherwise, new memory is allocated and returned. The length of the data is checked to ensure that it is equal to the message length field, otherwise the message is dropped. void setMessageMask(int mask) specifies which messages the client is interested in receiving. void clearMessageMask(int mask) clears the message mask so that all messages are received.

The Library (controller functions) int getNumClients() returns the number of clients. struct client data* getClientData(int client, struct client data* cldata) retrieves data on the specified client, such as OS, architecture, clock speed, amount of RAM, and other useful information. If cldata is non-null, the data is stored in that address, which is returned. Otherwise, new memory is allocated and returned. void sendMessage(struct controller message* msg) sends a client a message. controller message is the same as for the client, except that an int client field is added. void sendAllMessage(struct message* msg) sends all clients a message. message is the same as for the client. struct controller message* recvMessage(struct controller message *msg) receives the next message. If msg is non-null, the message is stored in that address, which is returned. Otherwise, new memory is allocated and returned. The length of the data is checked to ensure that it is equal to the message length field, otherwise the message is dropped.

The Task Queue Needs to be able to handle task dependencies and non-completion Represent task dependencies as a directed graph, complete tasks in topological order If a task has been “checked out” but not completed, it can be assigned to another client (in case something went wrong with the first one) Should be able to handle a client going down mid-process, network failure, etc. Topological order ensures that a task's dependencies are completed before the task itself Assigning the same task to multiple clients (if no other tasks are available to assign) ensures that a client can not bring the whole process to a halt by going down after it agreed to do a task

The Network API Messages are enqueued on a send queue and dequeued from a receive queue Sending of messages from the send queue and receiving messages onto the receive queue are each handled by a separate thread Each connection has its own queue/thread pair The queues are thread-safe. This setup insulates the main program thread from having to wait for network operations or deal with network failure. An optional callback function can be specified to be called upon completion or failure of a send operation.

Dynamic Loading Simple case: do the same thing that ld.so does to load shared libraries If a required library is not present, the controller can be queried for it Full programs can be written to /tmp and run with execve()

Accomplishments The network API works until one side disconnects A basic proof-of-concept using matrix multiplication The library, job queue, and dynamic loading were not implemented due to time constraints. It turns out that networking is more complicated and difficult than I originally thought that it would be. However, with most of the network back end mostly in place and functional, the rest should be fairly easy to accomplish (shared library loading code can be borrowed from glibc's ld.so).

Computation Time for Multiplying an NxN matrix with M CPUs Results Computation Time for Multiplying an NxN matrix with M CPUs CPUs Matrix Size 2 4 6 8 10 32 0.012363022 0.030523271 0.010454084 0.008981629 0.007652233 64 0.058882492 0.019197998 0.019730628 0.020511414 0.211174583 96 0.084705421 0.069359721 0.077731055 0.077675274 0.078329731 128 0.135407848 0.127247027 0.128140875 0.126661721 0.250957224 160 0.181245299 0.208812409 0.211340652 0.175471061 0.181304835 192 0.307811235 0.339090654 0.336397926 0.302383864 0.414096991 224 0.484918176 0.495227919 0.479458751 0.479369095 0.480608001 This table shows the amount of time that the framework took to multiply an NxN matrix with varying numbers of clients (CPUs). The data is fairly inconsistent, which I believe is due to uncontrollable events in the network itself. With such small matrix sizes, the network overhead is hard to filter out.

Results (continued) Average Computation Time per CPU for Multiplying an NxN matrix with M CPUs CPUs Matrix Size 2 4 6 8 10 32 0.000166459 0.000082589 0.000054977 0.000049028 0.00003304 64 0.001552289 0.000712410 0.000502733 0.000339796 0.000288078 96 0.004982714 0.002399564 0.001663380 0.001167981 0.000996738 128 0.011975964 0.005768813 0.003909034 0.002980716 0.002323040 160 0.022497528 0.010967518 0.007415074 0.005558941 0.004402479 192 0.039112715 0.019328565 0.012831588 0.009577841 0.007709864 224 0.060638093 0.030214250 0.020178754 0.014845676 0.012002253 This table shows the average time per CPU it took to multiply an NxN matrix for varying numbers of clients (CPUs). The average time decreases as the number of CPUs increases, which coincides with the idea that parallelism speeds up the operation.

Future Work Increase maximum message size Utilize multiple CPU cores on systems with multi- core CPUs Add more functions to the library I discovered during testing that the way in which the protocol is defined severely limits the maximum message size. A simple implementation for multiple core utilization could be running a task on each core. I would like to add more higher-level library functions (something along the lines of a loadLibrary() or a sendData()) which abstract the process of message creation.

Special Thanks David Costello and James Roche, for giving me some computers to use Professor Kai Shen, for taking the time to meet with me and answer my questions Professor Chris Brown, for directing me to where I could find resources

References Apache spark. (2014). web. Retrieved from http://spark.apache.org/ How BOINC works. (2014). web. Retrieved from http://boinc.berkeley.edu/wiki/How BOINC works Message passing interface. (2014). web. Retrieved from http://en.wikipedia.org/wiki/Messag e Passing Interface mpiexec(1) man page. (2014). web. Retrieved from https://www.open- mpi.org/doc/v1.8/man1/mpiexec.1.p hp OpenMP. (2014). web. Retrieved from http://en.wikipedia.org/wiki/OpenM P Running spark on EC2. (2014). web. Retrieved from http://spark.apache.org/docs/latest/e c2-scripts.html