Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,
2 The Grid Virtual organizations spanning multiple administrative domains –Different organizations and administrators –Different hardware –Different queuing systems How do we make sense of it all? 01/19/09Service Oriented Cyberinfrastructure Lab,
3 The Problem At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware At the other end are end-users and their jobs/applications Need software and protocols for submitting jobs to the computing resources Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09Service Oriented Cyberinfrastructure Lab,
4 Grid Architecture 01/19/09Service Oriented Cyberinfrastructure Lab, Image from Ian Foster paper (The Anatomy of the Grid) Job Submission
5 Batch Queuing Systems Submitting a job directly to the batch queuing system One or more queues –Priorities Two common architectures –Client/server –Dynamic offloading User credential (delegation) Jobs have states (e.g. Pending, Running) 01/19/09Service Oriented Cyberinfrastructure Lab,
6 Batch Queuing Systems Important examples: –Portable Batch System –TORQUE –Xgrid –Sun Grid Engine –Load Sharing Facility –Condor 01/19/09Service Oriented Cyberinfrastructure Lab,
7 Portable Batch System (PBS) Originally developed for NASA Client/server architecture Server: pbs_server Client: pbs_mom Works with MPI with built-in shell script variables 01/19/09Service Oriented Cyberinfrastructure Lab,
8 PBS Example cat test.sh #!/bin/sh #testpbs echo This is a test echo today is `date` echo This is `hostname` echo The current working directory is `pwd` ls -alF /home uptime 01/19/09Service Oriented Cyberinfrastructure Lab,
9 PBS Example qsub test.sh 6.gras.carrion.rit.edu qstat Job id Name User Time Use S Queue gras test.sh litherum 00:00:00 C batch cat test.sh.o6 This is a test today is Sat Jan 17 18:20:20 EST 2009 This is carrion02 The current working directory is /home/litherum total 20 drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00, /19/09Service Oriented Cyberinfrastructure Lab,
10 Torque Built on top of PBS Supports reservations, where you can reserve specific resources for specific times. Supports partitions, where you can partition a cluster into smaller sub-clusters. 01/19/09Service Oriented Cyberinfrastructure Lab,
11 Torque showq ACTIVE JOBS JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%) IDLE JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 01/19/09Service Oriented Cyberinfrastructure Lab,
12 Xgrid Apple Essentially the same as Condor GUI! =) Client/server model 01/19/09Service Oriented Cyberinfrastructure Lab,
13 Sun Grid Engine Open source, like everything new Sun puts out Supports –Reservations –Job dependencies, –Checkpointing –Multiple scheduling algorithms –Web interface Professional! 01/19/09Service Oriented Cyberinfrastructure Lab,
14 Middleware These queuing systems are hard to use There may be many systems employed in a given grid Wouldn’t it be nice if all this were unified in a single implementation? Middleware that handles job submission in a virtual organization across resources spread throughout multiple administration domains would be useful! 01/19/09Service Oriented Cyberinfrastructure Lab,
15 A tool for pooling and “scavenging” computing resources and distributing jobs Similar to a batch queuing system [2] –job management –scheduling policy –priority scheme –resource monitoring –resource management. Also focuses on high-throughput and “opportunistic computing” [2] –Utilize computing resources whenever they are available 01/19/09Service Oriented Cyberinfrastructure Lab, Condor image from:
16 Condor Universes [1] Standard –Check pointing, fault tolerance –Link job against condor libraries Vanilla –Simpler, can run universal binaries (do not need to be “condor compiled”) –No support for partial execution or job relocation Others –PVM –MPI –Java 01/19/09Service Oriented Cyberinfrastructure Lab,
17 Condor Submission File Example [1] #hello.sub #condor job file example Universe = Vanilla Executable = hello Output = hello.out Input = hello.in Error = hello.err Log = hello.log Queue 01/19/09Service Oriented Cyberinfrastructure Lab,
18 Some Condor Commands [5] condor_submit – Submit a condor job condor_q – View condor job queue condor_status – Check status of jobs in queue condor_compile – Re-links jobs for use in standard universe 01/19/09Service Oriented Cyberinfrastructure Lab,
19 Condor job structures Master-Worker Single master process coordinates all the independent tasks Collects results as workers finish, distributes new jobs to workers DAG (Directed Acyclic Graph) 01/19/09Service Oriented Cyberinfrastructure Lab, Programming models for larger scale jobs using condor agent
20 GRAM [4] Globus Resource Allocation Manager (GRAM) –Resource allocation –Process creation –Monitoring –Management –Maps requests expressed in a Resource Specification Language (RSL) into commands to local schedulers and computers. 01/19/09Service Oriented Cyberinfrastructure Lab,
21 GRAM Pluggable! Can’t make up their mind how to describe jobs Will submit jobs to: –Condor –LSF –PBS/Torque –??? Unified interface, identifier for which cluster/service to use Job submission file 01/19/09Service Oriented Cyberinfrastructure Lab,
22 GRAM Example globusrun-ws -submit -factory 44/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-command /bin/ hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid: e4f2-11dd-81df bb4e6 Termination time: 01/18/ :57 GMT Current job state: Pending Current job state: Active tg-c15 Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. 01/19/09Service Oriented Cyberinfrastructure Lab,
23 GRAM Input Example /bin/echo this is an example string Globus was here ${GLOBUS_USER_HOME}/stdout ${GLOBUS_USER_HOME}/stderr 01/19/09Service Oriented Cyberinfrastructure Lab,
24 Condor-G [4] Condor-G is a Globus-enabled version of the Condor scheduler. It uses Globus to handle inter-organizational problems like: –Security –Resource management for supercomputers, –Executable staging. The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites. It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as: –GSI for security –GRAM protocol for job submission –GASS for file transfer Condor-G can be used to submit jobs to systems managed by Globus. Globus tools can be used to submit jobs to systems managed by Condor 01/19/09Service Oriented Cyberinfrastructure Lab,
25 Condor-G 01/19/09Service Oriented Cyberinfrastructure Lab,
26 Using Condor-G Set condor universe=globus in submit file Also need to specify the globus scheduler hostname, for example: globusscheduler = example.org/jobmanager Still use globus_submit command TeraGrid Condor-G example here: – 01/19/09Service Oriented Cyberinfrastructure Lab,
27 UNICORE Alternative to Globus Primarily used in Europe Uses web services, similar to GT4 GUI Abstract Job Objects User -> Server -> Virtual Site X.509 and SSL 01/19/09Service Oriented Cyberinfrastructure Lab,
28 UNICORE GUI 01/19/09Service Oriented Cyberinfrastructure Lab,
29 Upperware Abstract Job Objects? Workflows? What is all this nonsense?! Scientist (primary user) doesn’t care about this stuff Shouldn’t have to deal with writing XML description files or creating a complicated workflow Simply let them run their program 01/19/09Service Oriented Cyberinfrastructure Lab,
30 GridShell 01/19/09Service Oriented Cyberinfrastructure Lab, Unified command line interface Defer to resident experts
31 References Getting started with Condor 2. Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience ubmission.ppt ubmission.ppt usagescenarios-jdd usagescenarios-jdd Wikipedia ce_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg ce_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg 01/19/09Service Oriented Cyberinfrastructure Lab,