Master Control Program Subha Sivagnanam SDSC. Master Control Program Provides automatic resource selection for running a single parallel job on HPC resources.

Slides:



Advertisements
Similar presentations
Test harness and reporting framework Shava Smallen San Diego Supercomputer Center Grid Performance Workshop 6/22/05.
Advertisements

Running DiFX with SGE/OGE Helge Rottmann Max-Planck-Institut für Radioastronomie Bonn, Germany DiFX Meeting Sydney.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Using the Argo Cluster Paul Sexton CS 566 February 6, 2006.
Computing Lectures Introduction to Ganga 1 Ganga: Introduction Object Orientated Interactive Job Submission System –Written in python –Based on the concept.
Job Submission Using PBSPro and Globus Job Commands.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Batch Queuing Systems The Portable Batch System (PBS) and the Load Sharing Facility (LSF) queuing systems share much common functionality in running batch.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
IT MANAGEMENT OF FME, 21 ST JULY  THE HPC FACILITY  USING PUTTY AND WINSCP TO ACCESS THE SERVER  SENDING FILES TO THE SERVER  RUNNING JOBS 
New MPI Library on the cluster Since WSU’s Grid had an upgrade of its operating system recently, we need to use a new MPI Library to compile and run our.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.
Monitoring and performance measurement in Production Grid Environments David Wallom.
S/W meeting 18 October 2007RSD 1 Remote Software Deployment Nick West.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Using the P4-Xeon cluster HPCCC, Science Faculty, HKBU Usage Seminar for the 64-nodes P4-Xeon Cluster in Science Faculty March 24, 2004.
Enabling Grids for E-sciencE Medical image processing web portal : Requirements analysis. An almost end user point of view … H. Benoit-Cattin,
Overview of TeraGrid Resources and Usage Selim Kalayci Florida International University 07/14/2009 Note: Slides are compiled from various TeraGrid Documentations.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Ashok Agarwal 1 BaBar MC Production on the Canadian Grid using a Web Services Approach Ashok Agarwal, Ron Desmarais, Ian Gable, Sergey Popov, Sydney Schaffer,
High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to the HPCC Jim Leikert System Administrator High Performance Computing Center.
WestGrid Seminar Series Copyright © 2006 University of Alberta. All rights reserved Integrating Gridstore Into The Job Submission Process With GSUB Edmund.
1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.
1 Preparing Your Application for TeraGrid Beyond 2010 TG09 Tutorial June 22, 2009.
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.
Ganga A quick tutorial Asterios Katsifodimos Trainer, University of Cyprus Nicosia, Feb 16, 2009.
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
© 2007 UC Regents1 Track 1: Cluster and Grid Computing NBCR Summer Institute Session 1.1: Introduction to Cluster and Grid Computing July 31, 2007 Wilfred.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Review of Condor,SGE,LSF,PBS
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
TORQUE Kerry Chang CCLS December 13, O UTLINE Torque How does it work? Architecture MADA Demo Results Problems Future Improvements.
1 High-Performance Grid Computing and Research Networking Presented by David Villegas Instructor: S. Masoud Sadjadi
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
Using the ARCS Grid and Compute Cloud Jim McGovern.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.
HUBbub 2013: Developing hub tools that submit HPC jobs Rob Campbell Purdue University Thursday, September 5, 2013.
GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Modules, Compiling WRF, and Running on CHPC Clusters Adam Varble WRF Users Meeting 10/26/15.
TG ’08, June 9-13, State of TeraGrid John Towns Co-Chair, TeraGrid Forum Director, Persistent Infrastructure National Center for Supercomputing.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
Using ROSSMANN to Run GOSET Studies Omar Laldin ( using materials from Jonathan Crider, Harish Suryanarayana ) Feb. 3, 2014.
Advanced Computing Facility Introduction
PARADOX Cluster job management
MPI Basics.
Example: Rapid Atmospheric Modeling System, ColoState U
LEAD-VGrADS Day 1 Notes.
IW2D migration to HTCondor
Grid Computing AEI Numerical Relativity Group has access to high-end resources in over ten centers in Europe/USA They want: Bigger simulations, more simulations.
Postdoctoral researcher Department of Environmental Sciences, LSU
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
gLite Job Management Christos Theodosiou
Overview of Workflows: Why Use Them?
Working in The IITJ HPC System
Presentation transcript:

Master Control Program Subha Sivagnanam SDSC

Master Control Program Provides automatic resource selection for running a single parallel job on HPC resources MCP uses directives in batch submission scripts to submit to the queues of multiple resources. Eg: #MCP submit_host #MCP username #MCP scratch_dir As soon as the job starts to run on one of the resources, it removes the jobs from all other resources' queues.

Assumption: User should compile the application on the desired machines Input should be staged on the remote clusters Submission will be initiated only from one machine MCP can be initiated by –using mcp.py, manually creating job scripts –using fullauto.py, automating job scripts based on desired attributes

MCP flow Grid credential needs to be established (grid-proxy-init or myproxy-get-delegation ) Write job script for each resource Example – NCSA jobscript #!/bin/ksh #MCP qtype pbs #MCP submit_host tg-login.ncsa.teragrid.org #MCP username your_username #MCP scratch_dir /home/ncsa/your_username/info/mcp/test/mcp #PBS -l walltime=00:05:00,nodes=4:ppn=2:compute #PBS -d /home/ncsa/your_username/info/mcp/test/run NPROCS=`wc -l < $PBS_NODEFILE` /usr/local/mpich/mpich-gm intel-r2/bin/mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS /home/ncsa/your_username/testprog/ring26 -t 10 -n 2 -l 10 -i #/bin/sleep 900

User submits the job files to MCP with job files as the input../mcp.py [--debug] MCP submits jobs to all clusters and monitors all clusters for job start Once one job starts, MCP cancels all other jobs

Fullauto Flow User runs grid-proxy-init or myproxy-get-delegation to establish grid credential. autojob.py is created with personalized settings. Eg: match_attributes = { 'CPU_MODEL' : ['==', 'ia64'], 'CPU_MEMORY_GB' : ['>=', 2], 'CPU_MHZ' : ['>=', 1300], 'CPU_SMP' : ['>=', 2], 'NODECOUNT' : ['>=', 128], } machine_dict_list = [ { 'HOSTNAME' : 'tg-login.ncsa.teragrid.org', 'substitutes_dict' : { 'arguments' : ['-t', '100', '-n', '10', '-l', '4000', '-i', ' ', '-c', '0', '-s', '0'], 'wallclock_seconds' : '300', ‘ __MCP_SHELL__' : '/bin/ksh', ‘ __MCP_PARALLEL_RUN__' : '/usr/local/mpich/mpich-gm b-intel-r2/bi n/mpirun', ‘ __MCP_SERIAL_RUN__' : '#', ‘ __MCP_NODES__' : '4', ‘ __MCP_CPUS_PER_NODE__' : '2', ‘ __MCP_USERNAME__' : 'your_username', ‘ __MCP_SCRATCH_DIR__' : '/home/ncsa/your_username/info/mcp/test/mcpdata', ‘ __MCP_JOB_DIR__' : '/home/ncsa/your_username/info/mcp/test/run', ‘ __MCP_EXECUTABLE__' : '/home/ncsa/your_username/testprog/ring26', }, }, ]

User runs fullauto.py with autojob.py as the input. fullauto.py --autojobfile= Fullauto finds clusters from the allowable list of resources (automachine.py) and creates job scripts for each selected cluster. Fullauto uses MCP to run the scripts.

Resources available Fullauto.py –attributes or from automachine.py Resource NameLocation Queen Bee (Dell IA64 cluster) LONI Mercury (Intel IA64 cluster)NCSA Abe (Dell Intel IA64 cluster)NCSA Lonestar (Dell 1955 cluster) TACC Steele (Dell 1950 cluster)Purdue