Presentation is loading. Please wait.

Presentation is loading. Please wait.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training,

Similar presentations


Presentation on theme: "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training,"— Presentation transcript:

1 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training, Argonne, IL 28 April 1999

2 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System2 Outline Quick example How batch processing works Batch and pipe queues How to submit jobs Monitoring jobs Reminders and Pointers

3 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System3 #!/bin/csh # # file: simple1 # #QSUB -q serial #QSUB -J y # keep job log set myname=`whoami` set now=`date` set mylocn=`pwd` echo "" echo "Hello $myname, this is your shell script $0," echo "running at $now." echo "" echo "Your current directory is $mylocn, which should" echo "be the same as $HOME." echo "" echo "I'm going to sleep now." echo "" sleep 90 exit

4 % cqsub simple1 Task id t51847 inserted into database nqedb. % cqstatl t51847 ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t51847 simple1 scheduler.main mjdurst NQE Database NNew % cqstatl t51847 ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t51847 simple1 scheduler.main mjdurst NQE Database NPend % cqstatl t51847 ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t51847 simple1 lws.mcurie mjdurst NQE Database NSche % cqstatl t51847 ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t51847 (49939.mcurie) simple1 lws.mcurie mjdurst nqs@mcurie NSubm

5 % qstat 49939 --------------------------------- NQS 3.3.0.9 BATCH REQUEST SUMMARY --------------------------------- IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST ------------- ------- -------- --------------------- ---- ---- ------ ------ --- 49939.mcurie simple1 mjdurst serial_short@mcurie 3753 25 364 1800 R03 % qstat 49939 nqs-100 qstat: CAUTION Request : not found. % cqstatl t51847 ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t51847 (49939.mcurie) simple1 monitor.main mjdurst NQE Database NComp % ls -l total 12 -rwxrw-r-- 1 mjdurst mpccc 365 Jan 15 10:47 simple1* -rw-r--r-- 1 mjdurst mpccc 0 Jan 15 10:50 simple1.e51847 -rw-r--r-- 1 mjdurst mpccc 1285 Jan 15 10:50 simple1.l51847 -rw-r--r-- 1 mjdurst mpccc 2638 Jan 15 10:50 simple1.o51847

6 % cat simple1.l51847 01/15 10:48:13 Submitting to queue by 01/15 10:48:13 Command line options: <-e /u1/mjdurst/tests/bat.simple/simple1.e51847 -J y -j /u1/mjdurst/tests/bat.simple/simple1.l51847 -lM 28mw 28mw -lT 1800 1800 -mu mjdurst@mcurie -o /u1/mjdurst/tests/bat.simple/simple1.o51847 -r simple1 -x -q serial>. 01/15 10:48:13 Script file options:. 01/15 10:48:15 Arrived in from. 01/15 10:48:15 Request-id is, Request name=. 01/15 10:48:15 NQE Task ID is. 01/15 10:48:15 Origin uid=, Target username=. 01/15 10:48:15 Account/Project name=, Account/Project ID=. 01/15 10:48:15 Submission security level=, compartments=. 01/15 10:48:17 Account/Project name=, Account/Project ID=. 01/15 10:48:17 Arrived in from. 01/15 10:48:20 Submission security level=, compartments=. 01/15 10:48:20 Execution security level=, compartments=. 01/15 10:48:23 Started, pid=, jid=, shell=, umask=. 01/15 10:48:23 Running in queue. 01/15 10:50:02 Finished. 01/15 10:50:02 Returning stderr output file. 01/15 10:50:03 Returning stdout output file.

7 % cat simple1.o51847 mcurie.nersc.gov, a Cray T3E-900 running UNICOS/mk 2.0.3.32 ------------------------------Contact Information------------------------------ NERSC Web http://www.nersc.gov/ ESnet Web http://www.es.net/ ESCHER Web http://www.nersc.gov/hardware/servers/vis-server.html CFS CONVERSION CFS to HPSS conversion was successfully completed on January 7, 1999. Users can access all of their CFS files on the new HPSS system, "archive". The cfs command on the NERSC Crays now points to the new HPSS interface, hsi. For more info on using hsi reference this URL: http://www.nersc.gov/hardware/storage/hsi.ch1.html. If your HPSS password fails or you don't have an HPSS account, contact the Account Support group at 1-800-66NERSC, option 2, or (510) 486-8612 ------------------------------------------------------------------------------ Your current working directory is /u/mpccc/mjdurst. Hello mjdurst, this is your shell script /usr/spool/nqe/spool/scripts/++BBI+++++0+++, running at Fri Jan 15 10:48:31 PST 1999. Your current directory is /u1/mjdurst, which should be the same as /u/mpccc/mjdurst. I'm going to sleep now. logout

8 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System8 Why Batch Processing? Batch queues are necessary: –On systems with many jobs –When scheduling is difficult –To assure greater throughput Interactive jobs are limited –J90: 10 hrs. –T3E: < 64 PEs, < 30 minutes parallel (1 hr serial) Some machines/processors batch-only –J90: all batch machines –T3E: many APP PEs (at night, almost all)

9 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System9 The Batch Process User creates shell script myscript Submits to NQE with cqsub myscript –Returns NQE task id (e.g., t4913 ) NQE forwards to NQS –J90: selects a machine (J90 wait time here) NQS runs the job –Assign NQS job id (e.g., 6859.mcurie ) –Select a batch queue –Place the job there (T3E wait time here) –Run it when appropriate NQS/NQE returns job logs at completion

10 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System10 Pipe Queues Groups of batch queues –Direct to a pipe with #QSUB -q serial –Default is production To see them: qstat -p T3E: – serial,debug, production,long J90: – production – batchk (for evening, weekend killeen queues) – batch{b,f,s,c,j} (not recommended)

11 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System11 Preparing for Batch Submission Write your shell script –C shell or Bourne/Korn shell –Starts in user’s home directory Debug interactively (if possible) Decide on needed resources –J90: CPU time, memory –T3E: amount of parallel, serial time; number of PEs Select other #QSUB options Check for appropriate queue and submit

12 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System12 Essential options to cqsub ( #QSUB directives) J90: – -lM – -lT T3E: – -l mpp_p – -l mpp_t – -lT –don’t use -lM

13 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System13 Other cqsub options -J y : save job log (recommended) -j : save it in file -mb : send mail when job starts (-me : ends) -a : hold job until after time -o : put standard output in file default name:.o ) -eo : combine standard error and output makes output look like terminal record -x : exports user’s environment to job -s : specify shell

14 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System14 Job Submission cqsub Can give options at submission time –Override file options –Less dependable If no file name, expects commands from terminal –Useful behavior in automated script generation & submission Response: Task id t16839 inserted into database nqedb. –Task id useful for tracking with cqstatl. Don’t break (Ctrl-C) out of cqsub ! –Instead, allow to finish, then use cqdel

15 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System15 Monitoring Jobs cqstatl – cqstatl -a | grep (if no ) ST column (“status”) indicates progress – NNew, NPend, NSche : still in NQE – NSubm : submitted to NQS – NComp : done – NTerm : killed – NFail : job failed (user or system error) IDENTIFIER column holds NQS job id (once submitted) cqstatl -f : details for your job

16 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System16 Monitoring Jobs (cont’d) T3E: qstat once your job reaches NQS – cqstatl -d nqs = qstat – qstat -au (if no ) J90: qstat -h –Find hostname from NQS id (from cqstatl ) –e.g., 2861.seymour ST column (“status”) now indicates – RNN : Running (with NN processes) – Qxy : waiting in the queue ( xy encodes reason) man qstat to decode

17 % cqstatl -a ----------------------------- NQE 3.3.0.9 Database Task Summary ----------------------------- IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST -------------------- ------- ---------------- -------- ------------------- ---- t48217 (46356.mcurie) PCM lws.mcurie alewife nqs@mcurie NSubm t48713 (46848.mcurie) third lws.mcurie u6670 nqs@mcurie NSubm t49200 (47518.mcurie) int566A lws.mcurie u61176 nqs@mcurie NSubm t49245 (47368.mcurie) xqcd_ho lws.mcurie snm nqs@mcurie NSubm t50349 (48480.mcurie) int650 lws.mcurie u61176 nqs@mcurie NSubm t50881 (49338.mcurie) lte34-0 lws.mcurie lungfish nqs@mcurie NSubm t51870 case17c scheduler.main salmon NQE Database NTerm t51871 case1c9 scheduler.main salmon NQE Database NFail t51872 case16c scheduler.main salmon NQE Database NPend t51873 (49967.mcurie) q_lsms lws.mcurie marlin nqs@mcurie NSubm t51875 case11c scheduler.main salmon NQE Database NPend t51877 (49970.mcurie) G08 lws.mcurie u66870 nqs@mcurie NSubm t51878 (49971.mcurie) qHsig.3 lws.mcurie bass nqs@mcurie NSubm t51881 (49975.mcurie) Jobge_b lws.mcurie carp nqs@mcurie NSubm t51884 (49979.mcurie) job16.a lws.mcurie adt nqs@mcurie NSubm t51885 (49980.mcurie) run_dyn lws.mcurie flounder nqs@mcurie NSubm t51886 (49981.mcurie) jupiter lws.mcurie grouper nqs@mcurie NSubm t51887 (49983.mcurie) JobCZ.b lws.mcurie tarpon nqs@mcurie NComp (output greatly abridged)

18 % qstat -a --------------------------------- NQS 3.3.0.9 BATCH REQUEST SUMMARY --------------------------------- IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST ------------- ------- -------- --------------------- ---- ---- ------ ------ --- 49979.mcurie job16.ag adt pe32@mcurie 4164 25 255 1520 R03 49936.mcurie akr520 u6677 pe32@mcurie 3732 25 323 1800 R03 49964.mcurie case14c9 salmon pe32@mcurie 3944 25 255 1795 R03 49967.mcurie q_lsms marlin pe32@mcurie 999 28672 1800 Cge 49983.mcurie JobCZ.bb tarpon pe32@mcurie 317 28672 1800 Qge 49984.mcurie bitgc11 u62098 pe32@mcurie 244 28672 1800 Qge 49985.mcurie bitgc11 u62098 pe32@mcurie 242 28672 1800 Qge 49362.mcurie Job_a2 carp pe128@mcurie 5308 25 323 1800 R03 49335.mcurie script.2 sturgeon pe256@mcurie 999 28672 1800 Qqs 49033.mcurie uo2_3h2o dorado gc128@mcurie --- 28672 7200 Hop 49255.mcurie run010_A bluegill long128@mcurie 4617 25 255 1800 R03 49276.mcurie sg3D10 aku long128@mcurie 999 4096 1800 Qce 49277.mcurie sg3D10 aku long128@mcurie 999 4096 1800 Qqu 49867.mcurie run_t4 flounder long128@mcurie 70 28672 1800 Cgg no pipe queue entries (output greatly abridged)

19 % qstat -f pe32 ------------------------------------ NQS 3.3.0.9 BATCH QUEUE: pe32@mcurie Status: ENABLED/RUNNING ------------------------------------ Priority: 15 Total: 17 Running: 5 Queued: 12 Waiting: 0 Holding: 0 Arriving: 0 Exiting: 0 Queue: 13 User: 2 Group: 20 regular Miser Queue: unspecified Scheduling Window: 0:0.0 LIMIT ALLOCATED Memory Size unlimited 143360kw Quick File Space 0b 0kw MPP Processor Elements 416 60 PER-PROCESS PER-REQUEST type a Tape Drives unspecified (0) type b Tape Drives unspecified (0) type c Tape Drives unspecified (0) type d Tape Drives unspecified (0) (cont’d)

20 type e Tape Drives unspecified (0) type f Tape Drives unspecified (0) type g Tape Drives unspecified (0) type h Tape Drives unspecified (0) Core File Size unspecified (256mw) Data Size unspecified (256mw) Permanent File Space 20gb 25gb Memory Size 28mw 29mw Nice Increment 5 Quick File Space unspecified (0b) 0b Stack Size unspecified (256mw) CPU Time Limit 3600sec 7200sec Temporary File Space unspecified (0b) unspecified (0b) Working Set Limit unspecified (256mw) MPP Processor Elements 32 MPP Time Limit 15000sec 15000sec Shared Memory Limit unspecified (0mw) Shared Memory Segments unspecified (0) MPP Memory Size unspecified (256mw) unlimited Route: Pipe Only Users: Unrestricted System Time: 3563114615067464.00 secs User Time: 281421545294442428.00 secs (qstat -f output, cont’d from previous slide)

21 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System21 Troubleshooting No task id returned –Typically means NQE down –message like “ Can’t connect ” Job doesn’t make it to NQS: try cqstatl – NFail usually indicates submission error – Nabort could be a system problem –No listing if many days old (NQE database is purged frequently) Stuck in NPend status –J90: Many jobs ahead of you? –T3E: over pipe queue limit?

22 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System22 Troubleshooting (cont’d) Stuck in NSubm : use qstat – Q : normal on T3E, rare on J90 –T3E: Hop can be allocation problem C (“checkpointed”) may be daily shuffling May need both pslist and qstat -m to sort it all out Job crashes –Read job log, stdout, stderr –...limit exceeded: ran out of time (or memory, or…) Job vanishes –Did machine(s) crash? If not, collect info and contact Consultants

23 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System23 Pointers Batch job is like a login session –Starts in your home directory –Uses your startup files –But doesn’t inherit environment (unless you use -x ) Environment variable ENVIRONMENT –Not set in interactive work, set to BATCH in batch jobs –Can exclude parts of startup files /usr/tmp faster than home directory –$TMPDIR vanishes (avoids littering) –Just one quota for $TMPDIR, rest of /usr/tmp/ –Can’t monitor batch J90 temp file systems

24 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System24 Pointers (cont’d) Don’t submit blindly –Debug executables, scripts first –Don’t trust inherited shell scripts –Spend time with man pages J90: large memory jobs should/must multitask T3E: reduce serial time in parallel jobs –“Stage” HPSS retrievals ( dmget ) –Submit follow-on serial jobs within your job


Download ppt "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training,"

Similar presentations


Ads by Google