N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training,

Slides:



Advertisements
Similar presentations
1 CS345 Operating Systems Φροντιστήριο Άσκησης 1.
Advertisements

Zhang Hongyi CSCI2100B Data Structures Tutorial 2
Introduction to Unix – CS 21 Lecture 10. Lecture Overview Midterm questions Jobs and processes description The foreground and background Controlling jobs.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
DCC/FCUP Grid Computing 1 Resource Management Systems.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00.
Process Description and Control Module 1.0. Major Requirements of an Operating System Interleave the execution of several processes to maximize processor.
CS Lecture 03 Outline Sed and awk from previous lecture Writing simple bash script Assignment 1 discussion 1CS 311 Operating SystemsLecture 03.
Processes CSCI 444/544 Operating Systems Fall 2008.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
CS-502 Fall 2006Processes in Unix, Linux, & Windows 1 Processes in Unix, Linux, and Windows CS502 Operating Systems.
More Shell Basics CS465 - Unix. Unix shells User’s default shell - specified in /etc/passwd file To show which shell you are currently using: $ echo $SHELL.
Unix & Windows Processes 1 CS502 Spring 2006 Unix/Windows Processes.
Using ITAMS as a Supervisor or ITAMS Approver Login to ITAMS as usual, at: Enter your User Identification Number (Same as your.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Introduction to Linux and Shell Scripting Jacob Chan.
Lecture 3  Shell Variables  Shell Command History  Job / Process Control  Directory Control.
UNIX Processes. The UNIX Process A process is an instance of a program in execution. Created by another parent process as its child. One process can be.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
Student Financial Assistance. Edconnet SFA to the Internet: EDconnect Software Session 16.
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
– Introduction to the Shell 10/1/2015 Introduction to the Shell – Session Introduction to the Shell – Session 2 · Permissions · Users.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Linux+ Guide to Linux Certification, Third Edition
UNIX Commands. Why UNIX Commands Are Noninteractive Command may take input from the output of another command (filters). May be scheduled to run at specific.
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.
ITEC 502 컴퓨터 시스템 및 실습 Chapter 2-1: Process Mi-Jung Choi DPNM Lab. Dept. of CSE, POSTECH.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER INTRODUCTION TO THE T3E SYSTEM1 Introduction to the T3E Mark Durst NERSC/USG ERSUG Training,
Operating Systems Process Creation
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
CS 390 Unix Programming Environment
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Agenda The Bourne Shell – Part I Redirection ( >, >>,
Process Control Management Prepared by: Dhason Operating Systems.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
NIMAC for Accessible Media Producers: February 2013 NIMAC 2.0 for AMPs.
Advanced Computing Facility Introduction
GRID COMPUTING.
Welcome to Indiana University Clusters
PARADOX Cluster job management
Unix Scripts and PBS on BioU
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
OpenPBS – Distributed Workload Management System
Welcome to Indiana University Clusters
Operating Systems (CS 340 D)
LCGAA nightlies infrastructure
Compiling and Job Submission
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
CSE 451: Operating Systems Winter 2003 Lecture 4 Processes
Processes in Unix and Windows
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training, Argonne, IL 28 April 1999

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System2 Outline Quick example How batch processing works Batch and pipe queues How to submit jobs Monitoring jobs Reminders and Pointers

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System3 #!/bin/csh # # file: simple1 # #QSUB -q serial #QSUB -J y # keep job log set myname=`whoami` set now=`date` set mylocn=`pwd` echo "" echo "Hello $myname, this is your shell script $0," echo "running at $now." echo "" echo "Your current directory is $mylocn, which should" echo "be the same as $HOME." echo "" echo "I'm going to sleep now." echo "" sleep 90 exit

% cqsub simple1 Task id t51847 inserted into database nqedb. % cqstatl t NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t51847 simple1 scheduler.main mjdurst NQE Database NNew % cqstatl t NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t51847 simple1 scheduler.main mjdurst NQE Database NPend % cqstatl t NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t51847 simple1 lws.mcurie mjdurst NQE Database NSche % cqstatl t NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t51847 (49939.mcurie) simple1 lws.mcurie mjdurst NSubm

% qstat NQS BATCH REQUEST SUMMARY IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST mcurie simple1 mjdurst R03 % qstat nqs-100 qstat: CAUTION Request : not found. % cqstatl t NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t51847 (49939.mcurie) simple1 monitor.main mjdurst NQE Database NComp % ls -l total 12 -rwxrw-r-- 1 mjdurst mpccc 365 Jan 15 10:47 simple1* -rw-r--r-- 1 mjdurst mpccc 0 Jan 15 10:50 simple1.e rw-r--r-- 1 mjdurst mpccc 1285 Jan 15 10:50 simple1.l rw-r--r-- 1 mjdurst mpccc 2638 Jan 15 10:50 simple1.o51847

% cat simple1.l /15 10:48:13 Submitting to queue by 01/15 10:48:13 Command line options: <-e /u1/mjdurst/tests/bat.simple/simple1.e J y -j /u1/mjdurst/tests/bat.simple/simple1.l lM 28mw 28mw -lT mu -o /u1/mjdurst/tests/bat.simple/simple1.o r simple1 -x -q serial>. 01/15 10:48:13 Script file options:. 01/15 10:48:15 Arrived in from. 01/15 10:48:15 Request-id is, Request name=. 01/15 10:48:15 NQE Task ID is. 01/15 10:48:15 Origin uid=, Target username=. 01/15 10:48:15 Account/Project name=, Account/Project ID=. 01/15 10:48:15 Submission security level=, compartments=. 01/15 10:48:17 Account/Project name=, Account/Project ID=. 01/15 10:48:17 Arrived in from. 01/15 10:48:20 Submission security level=, compartments=. 01/15 10:48:20 Execution security level=, compartments=. 01/15 10:48:23 Started, pid=, jid=, shell=, umask=. 01/15 10:48:23 Running in queue. 01/15 10:50:02 Finished. 01/15 10:50:02 Returning stderr output file. 01/15 10:50:03 Returning stdout output file.

% cat simple1.o51847 mcurie.nersc.gov, a Cray T3E-900 running UNICOS/mk Contact Information NERSC Web ESnet Web ESCHER Web CFS CONVERSION CFS to HPSS conversion was successfully completed on January 7, Users can access all of their CFS files on the new HPSS system, "archive". The cfs command on the NERSC Crays now points to the new HPSS interface, hsi. For more info on using hsi reference this URL: If your HPSS password fails or you don't have an HPSS account, contact the Account Support group at NERSC, option 2, or (510) Your current working directory is /u/mpccc/mjdurst. Hello mjdurst, this is your shell script /usr/spool/nqe/spool/scripts/++BBI , running at Fri Jan 15 10:48:31 PST Your current directory is /u1/mjdurst, which should be the same as /u/mpccc/mjdurst. I'm going to sleep now. logout

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System8 Why Batch Processing? Batch queues are necessary: –On systems with many jobs –When scheduling is difficult –To assure greater throughput Interactive jobs are limited –J90: 10 hrs. –T3E: < 64 PEs, < 30 minutes parallel (1 hr serial) Some machines/processors batch-only –J90: all batch machines –T3E: many APP PEs (at night, almost all)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System9 The Batch Process User creates shell script myscript Submits to NQE with cqsub myscript –Returns NQE task id (e.g., t4913 ) NQE forwards to NQS –J90: selects a machine (J90 wait time here) NQS runs the job –Assign NQS job id (e.g., 6859.mcurie ) –Select a batch queue –Place the job there (T3E wait time here) –Run it when appropriate NQS/NQE returns job logs at completion

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System10 Pipe Queues Groups of batch queues –Direct to a pipe with #QSUB -q serial –Default is production To see them: qstat -p T3E: – serial,debug, production,long J90: – production – batchk (for evening, weekend killeen queues) – batch{b,f,s,c,j} (not recommended)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System11 Preparing for Batch Submission Write your shell script –C shell or Bourne/Korn shell –Starts in user’s home directory Debug interactively (if possible) Decide on needed resources –J90: CPU time, memory –T3E: amount of parallel, serial time; number of PEs Select other #QSUB options Check for appropriate queue and submit

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System12 Essential options to cqsub ( #QSUB directives) J90: – -lM – -lT T3E: – -l mpp_p – -l mpp_t – -lT –don’t use -lM

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System13 Other cqsub options -J y : save job log (recommended) -j : save it in file -mb : send mail when job starts (-me : ends) -a : hold job until after time -o : put standard output in file default name:.o ) -eo : combine standard error and output makes output look like terminal record -x : exports user’s environment to job -s : specify shell

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System14 Job Submission cqsub Can give options at submission time –Override file options –Less dependable If no file name, expects commands from terminal –Useful behavior in automated script generation & submission Response: Task id t16839 inserted into database nqedb. –Task id useful for tracking with cqstatl. Don’t break (Ctrl-C) out of cqsub ! –Instead, allow to finish, then use cqdel

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System15 Monitoring Jobs cqstatl – cqstatl -a | grep (if no ) ST column (“status”) indicates progress – NNew, NPend, NSche : still in NQE – NSubm : submitted to NQS – NComp : done – NTerm : killed – NFail : job failed (user or system error) IDENTIFIER column holds NQS job id (once submitted) cqstatl -f : details for your job

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System16 Monitoring Jobs (cont’d) T3E: qstat once your job reaches NQS – cqstatl -d nqs = qstat – qstat -au (if no ) J90: qstat -h –Find hostname from NQS id (from cqstatl ) –e.g., 2861.seymour ST column (“status”) now indicates – RNN : Running (with NN processes) – Qxy : waiting in the queue ( xy encodes reason) man qstat to decode

% cqstatl -a NQE Database Task Summary IDENTIFIER NAME SYSTEM-OWNER OWNER LOCATION ST t48217 (46356.mcurie) PCM lws.mcurie alewife NSubm t48713 (46848.mcurie) third lws.mcurie u6670 NSubm t49200 (47518.mcurie) int566A lws.mcurie u61176 NSubm t49245 (47368.mcurie) xqcd_ho lws.mcurie snm NSubm t50349 (48480.mcurie) int650 lws.mcurie u61176 NSubm t50881 (49338.mcurie) lte34-0 lws.mcurie lungfish NSubm t51870 case17c scheduler.main salmon NQE Database NTerm t51871 case1c9 scheduler.main salmon NQE Database NFail t51872 case16c scheduler.main salmon NQE Database NPend t51873 (49967.mcurie) q_lsms lws.mcurie marlin NSubm t51875 case11c scheduler.main salmon NQE Database NPend t51877 (49970.mcurie) G08 lws.mcurie u66870 NSubm t51878 (49971.mcurie) qHsig.3 lws.mcurie bass NSubm t51881 (49975.mcurie) Jobge_b lws.mcurie carp NSubm t51884 (49979.mcurie) job16.a lws.mcurie adt NSubm t51885 (49980.mcurie) run_dyn lws.mcurie flounder NSubm t51886 (49981.mcurie) jupiter lws.mcurie grouper NSubm t51887 (49983.mcurie) JobCZ.b lws.mcurie tarpon NComp (output greatly abridged)

% qstat -a NQS BATCH REQUEST SUMMARY IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST mcurie job16.ag adt R mcurie akr520 u R mcurie case14c9 salmon R mcurie q_lsms marlin Cge mcurie JobCZ.bb tarpon Qge mcurie bitgc11 u Qge mcurie bitgc11 u Qge mcurie Job_a2 carp R mcurie script.2 sturgeon Qqs mcurie uo2_3h2o dorado Hop mcurie run010_A bluegill R mcurie sg3D10 aku Qce mcurie sg3D10 aku Qqu mcurie run_t4 flounder Cgg no pipe queue entries (output greatly abridged)

% qstat -f pe NQS BATCH QUEUE: Status: ENABLED/RUNNING Priority: 15 Total: 17 Running: 5 Queued: 12 Waiting: 0 Holding: 0 Arriving: 0 Exiting: 0 Queue: 13 User: 2 Group: 20 regular Miser Queue: unspecified Scheduling Window: 0:0.0 LIMIT ALLOCATED Memory Size unlimited kw Quick File Space 0b 0kw MPP Processor Elements PER-PROCESS PER-REQUEST type a Tape Drives unspecified (0) type b Tape Drives unspecified (0) type c Tape Drives unspecified (0) type d Tape Drives unspecified (0) (cont’d)

type e Tape Drives unspecified (0) type f Tape Drives unspecified (0) type g Tape Drives unspecified (0) type h Tape Drives unspecified (0) Core File Size unspecified (256mw) Data Size unspecified (256mw) Permanent File Space 20gb 25gb Memory Size 28mw 29mw Nice Increment 5 Quick File Space unspecified (0b) 0b Stack Size unspecified (256mw) CPU Time Limit 3600sec 7200sec Temporary File Space unspecified (0b) unspecified (0b) Working Set Limit unspecified (256mw) MPP Processor Elements 32 MPP Time Limit 15000sec 15000sec Shared Memory Limit unspecified (0mw) Shared Memory Segments unspecified (0) MPP Memory Size unspecified (256mw) unlimited Route: Pipe Only Users: Unrestricted System Time: secs User Time: secs (qstat -f output, cont’d from previous slide)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System21 Troubleshooting No task id returned –Typically means NQE down –message like “ Can’t connect ” Job doesn’t make it to NQS: try cqstatl – NFail usually indicates submission error – Nabort could be a system problem –No listing if many days old (NQE database is purged frequently) Stuck in NPend status –J90: Many jobs ahead of you? –T3E: over pipe queue limit?

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System22 Troubleshooting (cont’d) Stuck in NSubm : use qstat – Q : normal on T3E, rare on J90 –T3E: Hop can be allocation problem C (“checkpointed”) may be daily shuffling May need both pslist and qstat -m to sort it all out Job crashes –Read job log, stdout, stderr –...limit exceeded: ran out of time (or memory, or…) Job vanishes –Did machine(s) crash? If not, collect info and contact Consultants

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System23 Pointers Batch job is like a login session –Starts in your home directory –Uses your startup files –But doesn’t inherit environment (unless you use -x ) Environment variable ENVIRONMENT –Not set in interactive work, set to BATCH in batch jobs –Can exclude parts of startup files /usr/tmp faster than home directory –$TMPDIR vanishes (avoids littering) –Just one quota for $TMPDIR, rest of /usr/tmp/ –Can’t monitor batch J90 temp file systems

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System24 Pointers (cont’d) Don’t submit blindly –Debug executables, scripts first –Don’t trust inherited shell scripts –Spend time with man pages J90: large memory jobs should/must multitask T3E: reduce serial time in parallel jobs –“Stage” HPSS retrievals ( dmget ) –Submit follow-on serial jobs within your job