Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

Slides:

Advertisements

Similar presentations

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.

6a.1 Globus Toolkit Execution Management. Data Management Security Common Runtime Execution Management Information Services Web Services Components Non-WS.

1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Grids and Globus at BNL Presented by John Scott Leita.

Globus Computing Infrustructure Software Globus Toolkit 11-2.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Overview of TeraGrid Resources and Usage Selim Kalayci Florida International University 07/14/2009 Note: Slides are compiled from various TeraGrid Documentations.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen

Grid Toolkits Globus, Condor, BOINC, Xgrid Young Suk Moon.

National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

Condor Birdbath Web Service interface to Condor

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

Grid job submission using HTCondor Andrew Lahiff.

Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,

Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Review of Condor,SGE,LSF,PBS

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Workload Management Workpackage

OpenPBS – Distributed Workload Management System

Peter Kacsuk – Sipos Gergely MTA SZTAKI

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Presentation transcript:

Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

2 The Grid Virtual organizations spanning multiple administrative domains –Different organizations and administrators –Different hardware –Different queuing systems How do we make sense of it all? 01/19/09Service Oriented Cyberinfrastructure Lab,

3 The Problem At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware At the other end are end-users and their jobs/applications Need software and protocols for submitting jobs to the computing resources Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09Service Oriented Cyberinfrastructure Lab,

4 Grid Architecture 01/19/09Service Oriented Cyberinfrastructure Lab, Image from Ian Foster paper (The Anatomy of the Grid) Job Submission

5 Batch Queuing Systems Submitting a job directly to the batch queuing system One or more queues –Priorities Two common architectures –Client/server –Dynamic offloading User credential (delegation) Jobs have states (e.g. Pending, Running) 01/19/09Service Oriented Cyberinfrastructure Lab,

6 Batch Queuing Systems Important examples: –Portable Batch System –TORQUE –Xgrid –Sun Grid Engine –Load Sharing Facility –Condor 01/19/09Service Oriented Cyberinfrastructure Lab,

7 Portable Batch System (PBS) Originally developed for NASA Client/server architecture Server: pbs_server Client: pbs_mom Works with MPI with built-in shell script variables 01/19/09Service Oriented Cyberinfrastructure Lab,

8 PBS Example cat test.sh #!/bin/sh #testpbs echo This is a test echo today is `date` echo This is `hostname` echo The current working directory is `pwd` ls -alF /home uptime 01/19/09Service Oriented Cyberinfrastructure Lab,

9 PBS Example qsub test.sh 6.gras.carrion.rit.edu qstat Job id Name User Time Use S Queue gras test.sh litherum 00:00:00 C batch cat test.sh.o6 This is a test today is Sat Jan 17 18:20:20 EST 2009 This is carrion02 The current working directory is /home/litherum total 20 drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00, /19/09Service Oriented Cyberinfrastructure Lab,

10 Torque Built on top of PBS Supports reservations, where you can reserve specific resources for specific times. Supports partitions, where you can partition a cluster into smaller sub-clusters. 01/19/09Service Oriented Cyberinfrastructure Lab,

11 Torque showq ACTIVE JOBS JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%) IDLE JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 01/19/09Service Oriented Cyberinfrastructure Lab,

12 Xgrid Apple Essentially the same as Condor GUI! =) Client/server model 01/19/09Service Oriented Cyberinfrastructure Lab,

13 Sun Grid Engine Open source, like everything new Sun puts out Supports –Reservations –Job dependencies, –Checkpointing –Multiple scheduling algorithms –Web interface Professional! 01/19/09Service Oriented Cyberinfrastructure Lab,

14 Middleware These queuing systems are hard to use There may be many systems employed in a given grid Wouldn’t it be nice if all this were unified in a single implementation? Middleware that handles job submission in a virtual organization across resources spread throughout multiple administration domains would be useful! 01/19/09Service Oriented Cyberinfrastructure Lab,

15 A tool for pooling and “scavenging” computing resources and distributing jobs Similar to a batch queuing system [2] –job management –scheduling policy –priority scheme –resource monitoring –resource management. Also focuses on high-throughput and “opportunistic computing” [2] –Utilize computing resources whenever they are available 01/19/09Service Oriented Cyberinfrastructure Lab, Condor image from:

16 Condor Universes [1] Standard –Check pointing, fault tolerance –Link job against condor libraries Vanilla –Simpler, can run universal binaries (do not need to be “condor compiled”) –No support for partial execution or job relocation Others –PVM –MPI –Java 01/19/09Service Oriented Cyberinfrastructure Lab,

17 Condor Submission File Example [1] #hello.sub #condor job file example Universe = Vanilla Executable = hello Output = hello.out Input = hello.in Error = hello.err Log = hello.log Queue 01/19/09Service Oriented Cyberinfrastructure Lab,

18 Some Condor Commands [5] condor_submit – Submit a condor job condor_q – View condor job queue condor_status – Check status of jobs in queue condor_compile – Re-links jobs for use in standard universe 01/19/09Service Oriented Cyberinfrastructure Lab,

19 Condor job structures Master-Worker Single master process coordinates all the independent tasks Collects results as workers finish, distributes new jobs to workers DAG (Directed Acyclic Graph) 01/19/09Service Oriented Cyberinfrastructure Lab, Programming models for larger scale jobs using condor agent

20 GRAM [4] Globus Resource Allocation Manager (GRAM) –Resource allocation –Process creation –Monitoring –Management –Maps requests expressed in a Resource Specification Language (RSL) into commands to local schedulers and computers. 01/19/09Service Oriented Cyberinfrastructure Lab,

21 GRAM Pluggable! Can’t make up their mind how to describe jobs Will submit jobs to: –Condor –LSF –PBS/Torque –??? Unified interface, identifier for which cluster/service to use Job submission file 01/19/09Service Oriented Cyberinfrastructure Lab,

22 GRAM Example globusrun-ws -submit -factory 44/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-command /bin/ hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid: e4f2-11dd-81df bb4e6 Termination time: 01/18/ :57 GMT Current job state: Pending Current job state: Active tg-c15 Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. 01/19/09Service Oriented Cyberinfrastructure Lab,

23 GRAM Input Example /bin/echo this is an example string Globus was here ${GLOBUS_USER_HOME}/stdout ${GLOBUS_USER_HOME}/stderr 01/19/09Service Oriented Cyberinfrastructure Lab,

24 Condor-G [4] Condor-G is a Globus-enabled version of the Condor scheduler. It uses Globus to handle inter-organizational problems like: –Security –Resource management for supercomputers, –Executable staging. The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites. It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as: –GSI for security –GRAM protocol for job submission –GASS for file transfer Condor-G can be used to submit jobs to systems managed by Globus. Globus tools can be used to submit jobs to systems managed by Condor 01/19/09Service Oriented Cyberinfrastructure Lab,

25 Condor-G 01/19/09Service Oriented Cyberinfrastructure Lab,

26 Using Condor-G Set condor universe=globus in submit file Also need to specify the globus scheduler hostname, for example: globusscheduler = example.org/jobmanager Still use globus_submit command TeraGrid Condor-G example here: – 01/19/09Service Oriented Cyberinfrastructure Lab,

27 UNICORE Alternative to Globus Primarily used in Europe Uses web services, similar to GT4 GUI Abstract Job Objects User -> Server -> Virtual Site X.509 and SSL 01/19/09Service Oriented Cyberinfrastructure Lab,

28 UNICORE GUI 01/19/09Service Oriented Cyberinfrastructure Lab,

29 Upperware Abstract Job Objects? Workflows? What is all this nonsense?! Scientist (primary user) doesn’t care about this stuff Shouldn’t have to deal with writing XML description files or creating a complicated workflow Simply let them run their program 01/19/09Service Oriented Cyberinfrastructure Lab,

30 GridShell 01/19/09Service Oriented Cyberinfrastructure Lab, Unified command line interface Defer to resident experts

31 References Getting started with Condor 2. Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience ubmission.ppt ubmission.ppt usagescenarios-jdd usagescenarios-jdd Wikipedia ce_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg ce_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg 01/19/09Service Oriented Cyberinfrastructure Lab,