Using the Parallel Universe beyond MPI

Using the Parallel Universe beyond MPI

Parallel Universe applications using Metronome
Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them Provides advanced testing opportunities Some examples: client/server, cross-platform, compatibility, stress/scalability Metronome leverages Condor’s Parallel Universe to run parallel jobs Now we have the ability to run Metronome jobs on multiple machines at the same time You can see how this expands our testing capabilities, especially for service type testing Continuous integration.

Service testing challenges
Starting multiple services on the same machine does not allow for testing across a network or different platforms Deciding when to start the services and when to start tests requires human intervention Setup of the services is usually a manual process, or don’t bother testing. Same goes for the teardown of services to return the machines to their original state Here are some challenges we face when running service type testing. Running scalability or stress tests is possible using 1 machine. However this doesn’t allow for testing across a network or different platforms The timing and synchronization of the services is a manual process And even with 2 or more machines the setup and teardown of services often requires human intervention.

Benefits of using Metronome
Condor manages dynamic claiming of resources, communication between job nodes and cleaning up after the jobs run Metronome publishes basic information about each task to the job ad where it’s accessible by any node, acting as a “scratch space” for the job The hostnames of all job nodes, the start time, return code, and end time for each task on each node are published to this shared job ad This information is useful for communication between nodes and synchronization in the user’s glue scripts. Using Metronome to run your tests provides some management Condor handles the underlying details of running the parallel jobs. Basic information is published to the job ad so any node of the job may access it. This includes useful information such as the hostnames of all of the nodes in the parallel job, and the start time, return code, and end time for each task on each node The user may use then this information to assist with synchronization in their glue scripts or for inter-node communication

Client/server test example
Start server Execute Node 0 Parallel Job Send port to client Handle client requests Poll for ALLDONE from client Exit Submit Node Discover server hostname and port Start client As an example I’m going to describe a client server test and walk through the steps. Port and ALLDONE use metronome/Condor to send info to the job ad. Run queries against server Execute Node 1 Send ALLDONE message to server Exit CLIENT

How to submit a parallel job in Metronome
Several minor modifications to the Metronome submit file are necessary for parallel jobs List of platforms is comma separated with parentheses around the outside Platforms = (x86_rhas_3, x86_rhas_4) To submit a parallel job using Metronome, several modifications to the Metronome submit file are necessary The list of platforms is the normal comma separated list but with parentheses around the outside As you can see, this example uses 2 platforms

Parallel job submit files continued
Add a glue script for each task/node combination to be executed remotely. platform_pre_0 = client/platform_pre platform_pre_1 = server/platform_pre remote_declare_0 = client/remote_declare remote_declare_1 = server/remote_declare remote_task_0 = client/remote_task remote_task_1 = server/remote_task remote_task_args_0 = 9000 remote_task_args_1 = 9001 … and so forth for all glue scripts. You can see that for each task hook, you must specify a glue script to execute on each node. It’s ok to specify a no-op script in any of these cases. For example, if you have a list of client tasks to run that are created in the remote_declare step, but the server operations are all done in remote_task so no remote_declare step is required, you would add a no-op script to the remote_task node for the server.

Other parallel job use cases
Cross platform testing (Linux to Solaris) Scalability/stress testing (1 server, many clients) Compatibility testing (cross version, stable vs. development series) Other types of testing are possible with parallel jobs. Testing a Linux client against a Solaris server is one possibility. Stress or scalability testing is also easily accomplished Compatibility testing (cross-version, stable vs. development)

For more information Documentation is available on the NMI site
See for information on running parallel jobs using Metronome describes how to set up your own Metronome installation for running parallel jobs For more information on how to run parallel jobs using Metronome, see the documentation link listed here. This documentation also includes notes on setting up your own Metronome pool for running parallel jobs. Any questions? Please feel free to come find me and ask at any point.

Using the Parallel Universe beyond MPI

Similar presentations

Presentation on theme: "Using the Parallel Universe beyond MPI"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using the Parallel Universe beyond MPI

Similar presentations

Presentation on theme: "Using the Parallel Universe beyond MPI"— Presentation transcript:

Similar presentations

About project

Feedback