N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00.

Slides:



Advertisements
Similar presentations
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER TotalView on the T3E and IBM SP Systems NERSC User Services June 12, 2000.
Advertisements

Processes Management.
Lesson 10-Controlling User Processes. Overview Managing and processing processes. Managing jobs. Exiting/quitting when jobs have been stopped.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
DCC/FCUP Grid Computing 1 Resource Management Systems.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
J. Skovira 5/05 v11 Introduction to IBM LoadLeveler Batch Scheduling System.
Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Process Description and Control A process is sometimes called a task, it is a program in execution.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
UNIX Processes. The UNIX Process A process is an instance of a program in execution. Created by another parent process as its child. One process can be.
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Process Description and Control. Process concepts n Definitions – replaces task, job – program in execution – entity that can be assigned to and executed.
Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan
Scientific Computing Division Juli Rew CISL User Forum May 19, 2005 Scheduler Basics.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Using the Batch System1 Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training,
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.
Process Control. Module 11 Process Control ♦ Introduction ► A process is a running occurrence of a program, including all variables and other conditions.
Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.
Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
Using hpc Instructor : Seung Hun An, DCS Lab, School of EECSE, Seoul National University.
11/13/20151 Processes ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER INTRODUCTION TO THE T3E SYSTEM1 Introduction to the T3E Mark Durst NERSC/USG ERSUG Training,
Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.
UNICOS. When it comes to solving real-world problems, leading-edge hardware is only part of the solution. A complete solution also requires a powerful.
Unified scripts ● Currently they are composed of a main shell script and a few auxiliary ones that handle mostly the local differences. ● Local scripts.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Processes Dr. Yingwu Zhu. Process Concept Process – a program in execution – What is not a process? -- program on a disk - a process is an active object,
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
1  process  process creation/termination  context  process control block (PCB)  context switch  5-state process model  process scheduling short/medium/long.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Department of Computer Science Operating Systems OPS621S Semester 2.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Introduction to AFS IMSA Intersession 2003 Managing AFS Services Brian Sebby, IMSA ‘96 Copyright 2003 by Brian Sebby, Copies of these slides.
CSC414 “Introduction to UNIX/ Linux” Lecture 3
Agenda The Bourne Shell – Part I Redirection ( >, >>,
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Process Control Management Prepared by: Dhason Operating Systems.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
Advanced Computing Facility Introduction
Welcome to Indiana University Clusters
OpenPBS – Distributed Workload Management System
Chapter 3: Process Concept
Operating Systems (CS 340 D)
Intro to Processes CSSE 332 Operating Systems
Operating Systems (CS 340 D)
Compiling and Job Submission
Processor Management Damian Gordon.
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Sun Grid Engine.
Processor Management Damian Gordon.
Working in The IITJ HPC System
Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 NERSC Batch Systems LoadLeveler - IBM SP NQS/NQE - Cray T3E/J90’s This talk will focus on the MPP systems Using the batch system on the J90’s is similar to the T3E The IBM batch system: The Cray batch system: Batch differences between IBM and Cray:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 About the T3E 644 application processors (PEs) 33 command PEs Additional PEs for OS NQE/NQS jobs run on application PEs Interactive jobs (“mpprun” jobs) run on command PEs Single system image A single parallel job must run on a contiguous set of PEs A job will not be scheduled if there are enough idle PEs but they are fragmented throughout the torus

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 About the SP 256 compute nodes 8 login nodes Additional nodes for file system, network, etc. Each node has 2 processors that share memory Each node can have either 1 or 2 MPI tasks Each node runs full copy of AIX OS LoadLeveler jobs can run only on the compute nodes Interactive jobs (“poe” jobs) can run on either compute or login nodes

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 How To Use a Batch System Write a batch script –must use keywords specific to the scheduler –default values will be different for each site Submit your job –commands are specific to scheduler Monitor your job –commands are specific to scheduler –run limits are specific to site Check results when complete Call NERSC consultants when your job disappears :o)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 T3E Batch Terminology PE - processor element (a single CPU) Torus - the high-speed connection between PEs. All communication between PEs must go through the torus. Swapping - when a job is stopped by the system to allow a higher priority job run on that PE. The job may stay in memory. Also called “gang-scheduling”. Migrating - when a job is moved to a different set of PEs to better pack the torus Checkpoint - when a job is stopped by the system and an image is saved to be restarted at a later time.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 More T3E Batch Terminology Pipe Queue - a queue in the NQE portion of the scheduler. It determines which batch queues the job may be submitted to. The user must specify this on the cqsub command line if anything other than “regular”. Batch Queue - a queue on the NQS portion of the scheduler. The batch queues are served in a first-fit manner. The user should not specify any batch queue on the command line or in their script.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 NQS/NQE Developed by Cray Very complex set of scheduling parameters Complicated to understand Fragile Powerful and flexible Allows checkpoint/restart

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 What NQE Does Users submit jobs to NQE NQE assigns it a unique identifier called the taskid and stores it in a database The status of the job is “NPend” NQE examines various parameters and decides when to pass the job to the LWS The LWS then submits the job to an NQS batch queue (see next slide for NQS details) After job completes NQE stores the job information for about 4 hours

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 What NQS Does NQS receives a job from the LWS The job is placed in a batch queue which is determined by number of requested PEs and time The status of the job is now “NSubm” NQS batch queues are served in a first-fit manner When the job is ready to be scheduled, it is sent to the GRM (global resource manager) At this point the status of the job is “R03” The job may be stopped for checkpointing or swapping but still have a “running” status in NQS

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 NQS/NQE Commands cqsub - submit your job % cqsub -la regular script_file Task id t7225 inserted into database nqedb. cqstatl - monitor your NQE job qstat - monitor your NQS job cqdel - delete your queued or running job % cqdel t7225

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 Sample T3E Batch Script #QSUB -s /bin/csh #Specify C Shell for 'set echo' #QSUB -A abc #charge account abc for this job #QSUB -r sample #Job name #QSUB -eo -o batch_log.out #Write error and output to single file. #QSUB -l mpp_t=00:30:00 #Wallclock time #QSUB -l mpp_p=8 #PEs to be used (Required). ja #Turn on Job Accounting mpprun -n 8./a.out #Execute on 8 PEs reading data.in ja -s

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 Monitoring Your Job on the T3E % cqstatl -a | grep jimbob t4417 l441h4 scheduler.main jimbob NQE Database NPend t4605 (1259.mcurie) l513v8 lws.mcurie jimbob NSubm t4777 (1082.mcurie) l541l2 monitor.main jimbob NQE Database NComp t4884 (1092.mcurie) l543l1 lws.mcurie jimbob NSubm t4885 (1093.mcurie) l545l1 lws.mcurie jimbob Nsubm t4960 l546 scheduler.main jimbob NQE Database NPend % qstat -a | grep jimbob 1259.mcurie l513v8 jimbob R mcurie l543l1 jimbob R mcurie l545l1 jimbob Qge

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 Monitoring Your Job on the T3E (cont’) Use commands pslist (see next slide) and tstat to check running jobs Using ps on a command PE will list all instances of a parallel job because the T3E has a single system image % mpprun -n 4./a.out % ps -u jimbob PID TTYTIME CMD 7523 ?0:01csh 7568 ?12:13a.out ?12:13a.out ?12:13a.out ?12:13a.out

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 Monitoring Your Job on the T3E (cont’) S USER RK APID JID PE_RANG NPE TTY TIME CMD STATUS a user ? 02:50:32 sander b buffysum ? 02:57:45 osiris.e ACTIVE PEs = 631 q buffysum ? 00:42:28 osiris.e Swapped 1 of 16 r miyoung ? 03:52:11 vasp s bufyysum ? 00:18:16 osiris.e Swapped 1 of 40 t willow ? 00:53:03 MicroMag. u hal ? 00:26:09 alknemd ACTIVE PEs = 266 BATCH = 770 INTERACTIVE = 12 WAIT QUEUE: user uid gid acid Label Size ApId Command Reason Flags giles xlatqcdp Ap. limit a---- bobg Cmdft Ap. limit a---- jimbo pop.8x4 Ap. limit af---

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 Possible Job States on the T3E STJob StateDescription R03 RunningThe job is currently running. NSubm Submitted The job has been submitted to the NQS scheduler and is being considered to run. NPend Pending The job is still residing in the NQE database and is not being considered to run. This is probably because you already have 3 jobs in the queue. NComp Completed The job has completed. NTerm Terminated The job was terminated, probably due to an error in the batch script.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 Current Queue Limits on the T3E Pipe QBatch QMAX PETime debugdebug_small3233 min debug_medium12810 min productionpe16164 hr pe32324 hr pe64644 hr pe hr pe hr pe hr longlong hr long hr

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 Queue Configuration on the T3E Time (PDT)Action 7:00 amlong256 stopped pe256 stopped 10:00 pmpe512 started long 128, pe128 stopped and checkpointed pe64, pe32, pe16 run as backfill 1:00 ampe512 stopped and checkpointed long256, pe256, long 128, pe128 started

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 LoadLeveler Product of IBM Conceptually very simple Few commands and options available Packs system well with backfilling algorithm Allows MIMD jobs Does not have checkpoint/restart to favor certain jobs

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 SP/LoadLeveler Terminology Keyword - used to specify your job parameters (e.g. number of nodes and wallclock time) to the LoadLeveler scheduler Node - a set of 2 processors that share memory and a switch adapter. NERSC users are charged for exclusive use of a node. Job ID - the identifier for a LoadLeveler job, e.g. gs Switch - a high-speed connection between the nodes. All communication between nodes goes through the switch. Class - a user submits a batch job to a particular class. Each class has a different priority and different limits.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 What LoadLeveler Does Jobs are submitted directly to LoadLeveler The following keywords are set: –node_usage = not_shared –tasks_per_node = 2 The user can override tasks_per_node but not node_usage Incorrect keywords and parameters are passed silently to scheduler! NERSC only checks for valid repo and class names Prolog script creates $SCRATCH and $TMPDIR directories and environment variables –$SCRATCH is a global (GPFS) filesystem and $TMPDIR is local

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 LoadLeveler Commands llsubmit - submit your job % llsubmit script_file llsubmit: The job "gs01007.nersc.gov.101" has been submitted. llqs - monitor your job llq - get details about one of your queued or running jobs llcancel - delete your queued or running job % llcancel gs llcancel: Cancel command has been sent to the central manager.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23 Sample SP Batch Script #!/usr/bin/csh job_name = myjob account_no = repo_name output = myjob.out error = myjob.err job_type = parallel environment = COPY_ALL notification = complete network.MPI = css0,not_shared,us node_usage = not_shared class = regular tasks_per_node = 2 node = 32 wall_clock_limit= 01:00:00 queue./a.out < input

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Monitoring Your Job on the SP gseaborg% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time gs a240 buffy regular R 32 00:31:44 3/13 04:30 gs s1.x willow regular R 64 00:28:17 3/12 21:45 gs xdnull xander debug R 5 00:05:19 3/14 12:44 gs gs01009.nersc.g spike regular R :57:27 3/13 05:17 gs s2.x willow regular I 64 04:00:00 3/12 21:48 gs s3.x willow regular I 64 04:00:00 3/12 21:50 gs y1.x willow regular I 64 04:00:00 3/12 22:17 gs y2.x willow regular I 64 04:00:00 3/12 22:17 gs y3.x willow regular I 64 04:00:00 3/12 22:17 gs gs01001.nersc.g spike regular I :30:00 3/13 06:10 gs gs01009.nersc.g spike regular I :30:00 3/13 07:17

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 Monitoring Your Job on the SP (cont’) Issuing a ps command will show only what is running on that login node, not any instances of your parallel job If you could issue a ps command on a compute node running 2 MPI tasks of your parallel job, you would see: gseaborg% ps -u jimbob UIDPIDTTYTIMECMD :37a.out :00pmdv : :00LoadL_starter :28a.out :00pmdv :02poe :00

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26 Possible Job States on the SP STJob StateDescription R RunningThe job is currently running. I Idle The job is being considered to run. NQ Not Queued The job is not being considered to run. This is probably because you have submitted more than 10 jobs. ST Starting The job is starting to run. HU User Hold The user put the job on hold. You must issue the llhold -r command in order for it to be considered for scheduling. HS System Hold The job was put on hold by the system. This is probably because you are over disk quota in $HOME.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27 Current Class Limits on the SP CLASSNODETIMEPRIORITY debug1630 min20000 premium2564 hr10000 regular2564 hr5000 low2564 hr1 interactive820 min15000 Same configuration runs all the time.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 More Information Please see NERSC Web documentation The IBM batch system: The Cray batch system: Batch differences between IBM and Cray: