Using the Parallel Universe beyond MPI

Slides:

Advertisements

Similar presentations

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Advertisements

Advanced SQL Schema Customization & Reporting Presented By: John Dyke As day to day business needs become more complex so does the need for specifically.

ADABAS to RDBMS UsingNatQuery. The following session will provide a high-level overview of NatQuerys ability to automatically extract ADABAS data from.

1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.

Chapter 26 Client Server Interaction Communication across a computer network requires a pair of application programs to cooperate. One application on one.

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

Using Data Active Server Pages Objectives In this chapter, you will: Learn about variables and constants Explore application and session variables Learn.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.

NASA/ESA Interoperability Efforts CEOS Subgroup - CINTEX Alexandria, Sept 12, 2002 Ananth Rao Yonsook Enloe SGT, Inc.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Jozef Goetz, Application Layer PART VI Jozef Goetz, Position of application layer The application layer enables the user, whether human.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Portal Update Plan Ashok Adiga (512)

Testing Grid Software on the Grid Steven Newhouse Deputy Director.

Module: Software Engineering of Web Applications Chapter 2: Technologies 1.

By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.

TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Selenium server By, Kartikeya Rastogi Mayur Sapre Mosheca. R

Internet Flow By: Terry Hernandez. Getting from the customers computer onto the internet Internet Browser

Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.

Introduction to Mobile-Cloud Computing. What is Mobile Cloud Computing? an infrastructure where both the data storage and processing happen outside of.

Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.

Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

ArcGIS for Server Security: Advanced

Introduction to threads

Architecture Review 10/11/2004

TensorFlow– A system for large-scale machine learning

Chapter 4: Threads.

Web Application.

Data Virtualization Tutorial: Introduction to SQL Script

BDII Performance Tests

Programming Assignment #1

Remote Method Invocation

Enterprise Computing Collaboration System Example

CSC 480 Software Engineering

PHP / MySQL Introduction

Introduction to client/server architecture

Building and Testing using Condor

HTTP: the hypertext transfer protocol

#01 Client/Server Computing

File Transfer Olivia Irving and Cameron Foss

MediaWiki May 2017 Mediawiki.

Distributed System Structures 16: Distributed Structures

DHCP, DNS, Client Connection, Assignment 1 1.3

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

Do it now – PAGE 11 You will find your do it now task in your workbook – look for the start button! Wednesday, 21 November 2018.

Tiers vs. Layers.

Protocols and combining networks

Issues in Client/Server Programming

HyperText Transfer Protocol

Multithreaded Programming

Overview of Workflows: Why Use Them?

Programming Assignment #1

Message Queuing and Asynchronous Inter Process Communication

PyWBEM Python WBEM Client: Overview #2

#01 Client/Server Computing

Presentation transcript:

Using the Parallel Universe beyond MPI

Parallel Universe applications using Metronome Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them Provides advanced testing opportunities Some examples: client/server, cross-platform, compatibility, stress/scalability Metronome leverages Condor’s Parallel Universe to run parallel jobs Now we have the ability to run Metronome jobs on multiple machines at the same time You can see how this expands our testing capabilities, especially for service type testing Continuous integration.

Service testing challenges Starting multiple services on the same machine does not allow for testing across a network or different platforms Deciding when to start the services and when to start tests requires human intervention Setup of the services is usually a manual process, or don’t bother testing. Same goes for the teardown of services to return the machines to their original state Here are some challenges we face when running service type testing. Running scalability or stress tests is possible using 1 machine. However this doesn’t allow for testing across a network or different platforms The timing and synchronization of the services is a manual process And even with 2 or more machines the setup and teardown of services often requires human intervention.

Benefits of using Metronome Condor manages dynamic claiming of resources, communication between job nodes and cleaning up after the jobs run Metronome publishes basic information about each task to the job ad where it’s accessible by any node, acting as a “scratch space” for the job The hostnames of all job nodes, the start time, return code, and end time for each task on each node are published to this shared job ad This information is useful for communication between nodes and synchronization in the user’s glue scripts. Using Metronome to run your tests provides some management Condor handles the underlying details of running the parallel jobs. Basic information is published to the job ad so any node of the job may access it. This includes useful information such as the hostnames of all of the nodes in the parallel job, and the start time, return code, and end time for each task on each node The user may use then this information to assist with synchronization in their glue scripts or for inter-node communication

Client/server test example Start server Execute Node 0 Parallel Job Send port to client Handle client requests Poll for ALLDONE from client Exit Submit Node Discover server hostname and port Start client As an example I’m going to describe a client server test and walk through the steps. Port and ALLDONE use metronome/Condor to send info to the job ad. Run queries against server Execute Node 1 Send ALLDONE message to server Exit CLIENT

How to submit a parallel job in Metronome Several minor modifications to the Metronome submit file are necessary for parallel jobs List of platforms is comma separated with parentheses around the outside Platforms = (x86_rhas_3, x86_rhas_4) To submit a parallel job using Metronome, several modifications to the Metronome submit file are necessary The list of platforms is the normal comma separated list but with parentheses around the outside As you can see, this example uses 2 platforms

Parallel job submit files continued Add a glue script for each task/node combination to be executed remotely. platform_pre_0 = client/platform_pre platform_pre_1 = server/platform_pre remote_declare_0 = client/remote_declare remote_declare_1 = server/remote_declare remote_task_0 = client/remote_task remote_task_1 = server/remote_task remote_task_args_0 = 9000 remote_task_args_1 = 9001 … and so forth for all glue scripts. You can see that for each task hook, you must specify a glue script to execute on each node. It’s ok to specify a no-op script in any of these cases. For example, if you have a list of client tasks to run that are created in the remote_declare step, but the server operations are all done in remote_task so no remote_declare step is required, you would add a no-op script to the remote_task node for the server.

Other parallel job use cases Cross platform testing (Linux to Solaris) Scalability/stress testing (1 server, many clients) Compatibility testing (cross version, stable vs. development series) Other types of testing are possible with parallel jobs. Testing a Linux client against a Solaris server is one possibility. Stress or scalability testing is also easily accomplished Compatibility testing (cross-version, stable vs. development)

For more information Documentation is available on the NMI site See http://nmi.cs.wisc.edu/node/1001 for information on running parallel jobs using Metronome http://nmi.cs.wisc.edu/node/282 describes how to set up your own Metronome installation for running parallel jobs For more information on how to run parallel jobs using Metronome, see the documentation link listed here. This documentation also includes notes on setting up your own Metronome pool for running parallel jobs. Any questions? Please feel free to come find me and ask at any point.