© 2014 IBM Corporation SLURM for Yorktown Bluegene/Q.

Slides:



Advertisements
Similar presentations
Grid Wizard Enterprise Basic Tutorial Using Web Control Panel.
Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Multiprocessor OS The functional capabilities often required in an OS for a multiprogrammed computer include the resource allocation and management schemes,
1 Chapter 5 Threads 2 Contents  Overview  Benefits  User and Kernel Threads  Multithreading Models  Solaris 2 Threads  Java Threads.
Manage Run Activities Cognos 8 BI. Objectives  At the end of this course, you should be able to:  manage current, upcoming and past activities  manage.
Processes CSCI 444/544 Operating Systems Fall 2008.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
Testing - an Overview September 10, What is it, Why do it? Testing is a set of activities aimed at validating that an attribute or capability.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Installing software on personal computer
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Guide to Linux Installation and Administration, 2e1 Chapter 8 Basic Administration Tasks.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Mastering the AS/400, Third Edition, author Jerry Fottral 1 Week 2 The System The AS/400 is a multi-user, multi-tasking system -- a system on which many.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Composition and Evolution of Operating Systems Introduction to Operating Systems: Module 2.
Computing and the Web Operating Systems. Overview n What is an Operating System n Booting the Computer n User Interfaces n Files and File Management n.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
Introduction to Using SLURM on Discover Chongxun (Doris) Pan September 24, 2013.
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Some Design Notes Iteration - 2 Method - 1 Extractor main program Runs from an external VM Listens for RabbitMQ messages Starts a light database engine.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Overview of the Automated Build & Deployment Process Johnita Beasley Tuesday, April 29, 2008.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
© Copyright 2014 TONE SOFTWARE CORPORATION. Confidential and Proprietary. All rights reserved. ® Administrator Training – Release Alarms Administration.
IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Basic Navigation in Oracle R12 BY: Muhammad Irfan.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
© Arbela Technologies Accounts Payable + Procurement & Sourcing Workflows.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Interacting with the cluster ssh, sftp, & slurm batch scripts
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
OpenPBS – Distributed Workload Management System
How to link a test to a launcher (in this case a shell launcher)
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 2: System Structures
Joker: Getting the most out of the slurm scheduler
Using the Parallel Universe beyond MPI
College of Engineering
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Process Description and Control
Multithreaded Programming
Why Background Processing?
CST8177 Scripting 2: What?.
Short Read Sequencing Analysis Workshop
Presentation transcript:

© 2014 IBM Corporation SLURM for Yorktown Bluegene/Q

© 2014 IBM Corporation2 SLURM on Wat2q Goals Setup a scheduler for the Yorktown Bluegene system to increase research utilization of the system. Become familiar with the Bluegene/Q SRM (system resource manager) interfaces as it is a model for future HPC control API’s. Divide the Yorktown system into multipl[‘e submidplane blocks. Develop scripts to allow users (optionally) to land on a specific submidplane block. Get slurm to run the bgas.pl script automatically based on information in the SLURM sbatch command used to queue a job. This requires that jobs be limited to running on complete partitions. SLURM by default will attempt to run a job on part of a submidplane partition if that partition is already booted. This is accomplished with prolog scripts.

© 2014 IBM Corporation3 SLURM Scheduling Jobs

© 2014 IBM Corporation4 SLURM Allocation Vs. Task Placement  Allocation is the selection of the resources needed for the job – Each job includes zero or more job steps (srun) – Each job step is comprised of one to multiple tasks – This is done by the “sbatch” command.  Task placement is the process of assigning a subset of the job’s allocated resources (cpus) to each task. –This is handled by the SLURM “srun” command invoked from within the script scheduled by “sbatch”.

© 2014 IBM Corporation5 Effectively this becomes a game of Tetris

© 2014 IBM Corporation6 Slurm documentation  Slurm docs can be found here: – –Typical commands: sacctdisplays accounting data for all jobs and job steps in the SLURM job accounting log. sbatchSubmit a batch job to SLURM. scancelUsed to signal jobs or job steps that are under the control of Slurm. scontrolUsed view and modify Slurm configuration and state sinfoview information about SLURM nodes and partitions smapgraphically view information about SLURM jobs, partitions, and set configurations parameters. squeueview information about jobs located in the SLURM scheduling queue. srunrun parallel jobs. sstatDisplay various status information of a running job/step. sview graphical user interface to view and modify SLURM state.

© 2014 IBM Corporation7 SLURM functions  SLURMD carries out five key tasks and has five corresponding subsystems: –Machine Status responds to SLURMCTLD requests for machine state information and sends asynchronous reports of state changes to help with queue control. –Job Status responds to SLURMCTLD requests for job state information and sends asynchronous reports of state changes to help with queue control. –Remote Execution starts, monitors, and cleans up after a set of processes (usually shared by a parallel job), as decided by SLURMCTLD (or by direct user intervention). –Stream Copy Service handles all STDERR, STDIN, and STDOUT for remote tasks. This may involve redirection, and it always involves locally buffering job output to avoid blocking local tasks. –Job Control propagates signals and job-termination requests to any SLURM-managed processes (often interacting with the Remote Execution subsystem).

© 2014 IBM Corporation8 Slurm software  SLURM daemons don’t execute directly on the compute nodes.  SLURM gets system state, allocates resources and other state from the Bluegene/Q control system.  This interface is entirely contained in a SLURM plugin ( src/plugings/select/bluegene ).  The user interacts bluegene with the following slurm commands. –sbatch. –srun. –scontrol. –squeue.

© 2014 IBM Corporation9 Slurm Architecture for Bluegene/Q

© 2014 IBM Corporation10 Job Launch Process

© 2014 IBM Corporation11 Sview of BlueGene system

© 2014 IBM Corporation12 Slurm naming conventions R00-M0bgq0000 R00-M1bgq0001 R01-M0bgq0010 R01-M1bgq0011 R00bgq[0000x0001] R01Bgq[0010x0011] R00R01Bgq[0000x0011] Slurm nameBgq name Slurm names things with torus coordinates Top level names use 4 dimension midplane coordinates. Submidplane partitions use 5 dimension torus coordinates. R00-M0-N00bgq0000[00000x11111] R00-M0-N01bgq0000[00200x11311] R00-M0-N02bgq0000[00020x11131] R00-M0-N03bgq0000[00220x11331] R00-M0-N04bgq0000[20000x31111] R00-M0-N05bgq0000[20200x31311] R00-M0-N06bgq0000[20020x31131] R00-M0-N07bgq0000[20220x31331] R00-M0-N08bgq0000[02000x12111] R00-M0-N09bgq0000[02200x13311] R00-M0-N10bgq0000[02020x13131] R00-M0-M11bgq0000[03330x13331] R00-M0-N12bgq0000[22000x33111] R00-M0-N13bgq0000[22200x33311] R00-M0-N14bgq0000[22020x33131] R00-M0-N15bgq0000[22220x33331] Slurm name Bgq name R01-M0-N00-128Bgq0010[00000x11331] Example larger blocks

© 2014 IBM Corporation13 Slurm queuing a JOB.  Use the sbatch command to queue a script that will run one or more jobs.  Within the script presented to the sbatch command do one or more “srun” commands. –The srun command will eventually cause a runjob command to be created.  For example: –This schedules the script rj01.sh to be run when a 64 node block on the partition “prod” is booted. sbatch –nodes=64 --partition=prod rj01.sh –Inside rj01.sh we have: #!/bin/bash srun --chdir=/bgusr/home1/bvt_scratch /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf –The srun will call runjob as follows: runjob --exe /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf --block RMP28Ap cwd /bgusr/home1/bvt_scratch

© 2014 IBM Corporation14 Queuing a job with only one script.  Using sbatch/srun to queue a job typically requires two scripts, one to queue the job, (sbatch) and one to run one or more jobs (srun) once the block is allocated.  One can do this with a single script with this simple boilerplate. ##!/bin/bash if [ -z "$SLURM_JOBID" ]; then sbatch --gid=bqluan --time=5:00 --nodes=128 --ntasks-per-node=32 -O --qos=umax-128 $0 else srun --chdir=/gpfs/DDNgpfs2/bqluan/mushroomP \ --output=equilibrate-4V-21-new.out --error=equilibrate-4V-22-new.namd \ /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 equilibrate-4V-22-new.namd fi  The above script is a re-expression of the following (original) run job script runjob --block R01-M0-N ranks-per-node 32 --cwd /gpfs/DDNgpfs2/bqluan/mushroomP \ --exe /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 \ --args equilibrate-4V-21-new.namd > equilibrate-4V-21-new.out 2> equilibrate-4V-21-new.err &

© 2014 IBM Corporation15 Srun/runjob decoder --cwd--chdir --exe(first field without an option) --label xx--label=xx --verbose --ranks-per-node--ntasks-per-node All other options--launcher_opts= Runjob option Srun option Launcher options is a catch-all for all other runjob options For example: --launcher-opts=“—timeout-300 –strace”

© 2014 IBM Corporation16 Partitions (SLURM queue names).  We have setup multiple basic slurm queues (partitions). –prod – regular production nodes (R00-M0, R00-M1, R01-M0, R01-M1). –bgas – full system bgas allocation (R00-M0, R00-M1, R01-M0, R01-M1).  There are a couple of midplane level reservations setup to run each day. –bgas_daily – active 3am to 3:30pm –bgas_full – 3:30 pm to 6pm. –The default queue/partition is the “prod” queue. –The queue/partition name is used by the prolog script to determine if it is necessary to switch the IO nodes to either BGAS or production.

© 2014 IBM Corporation17 SLURM small block divisions.  Block divisions as of May –bgq0000 (R00-M0) – divided into way blocks. –bgq0001 (R00-M1) – divided into 32,64,128,256 way (overlapping blocks) –Bgq0010 (R01-M0) – divided into,64,128,256 way (overlapping blocks) –Bgq0011 (R01-M1) – divided into,64,128,256 way (overlapping blocks) sbatch option “ --nodes=xx ” where xx is, either 32,64,128,256 will cause a job to land on one of the small block partitions. Slurm will pick which small block to run it on. Prolog scripts ensure that partial blocks are not used (i.e way jobs running on the same 64 way block at the same time. You can restrict which midplane that slurm will try to select its blocks from with the – nodelist=xxxx, where xxxx is bgq0000, bgq0001, bgq0010, or bgq0011.

© 2014 IBM Corporation18 Getting SLURM to run on a specific node card/block  To get slurm to land on a specific block we use the prolog script and the “nodelist” and “constraint” option for sbatch.  For example: sbatch --partition=prod –nodelist=bgq nodes=32 --constraint=N00-32  NOTE: –The --nodes option and the constraint must agree as to the size. –A sub-block of that size MUST exist on the nodelist requested.  Valid constraints are: –Nxx-32, where xx is –Nxx-64, where xx is 00,02,04,06,08,10,12,14 –Nxx-128, where xx is 00,04,08,12 –Nxx-256, where xx is 00,08  If the block is not capable of being scheduled the job will be canceled and a message will appear in the stdout file (slurm-$jobid.out).  Trying to use the higher number Nxx cards for 64 and 32 ways is discouraged, because the system will try to run the jobs on the Lower Number cards first and down each node card in turn until it lands on the card it needs to run on.

© 2014 IBM Corporation19 SLURM Job order.  If the user uses the –constraints parameter to select a specific node card, the order that jobs are submitted on may not be respected.  This is because the prolog scripts can reject the node SLURM first selects either due to it trying to run on a block larger than requested, or by a constraint. –When the job is rejected on a specific node, it gets re-queued and this will cause some reordering.  If Job order is required one can use the --singleton and --jobname options as follows: – sbatch --job-name=a --dependency=singleton -N32 --constraint=N01-32 rj01.s  Another way to do this is with the “ --dependency ”: –after:job_id[:jobid...] : This job can begin execution after the specified jobs have begun execution. –afterany:job_id[:jobid...] : This job can begin execution after the specified jobs have terminated. –afternotok:job_id[:jobid...]: This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc). –afterok:job_id[:jobid...] : This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).

© 2014 IBM Corporation20 SLURM – reservations.  Slurm can reserve an entire Midplane for jobs by a specific reservation id.  The current version can only reserve entire midplane blocks (not sub-midplane) –The September release of SLURM is supposed to have better sub-midplane capabilities for both node selection and reservations.  Creating a reseveration: scontrol create reservation user=myid starttime=now duration=120 \ nodes=bgq0001 –This will reply with a reservation id as follows: Reservation created: myid_5  Using the reservation: sbatch --reservation=myid_5 –nodes=64 my.script This web page outlines reservations in more detail

© 2014 IBM Corporation21 Reservation Time-limit interaction.  For each job in there queue there is an execution timelimit imposed on it.  The default for this normally comes from the queue name. –It can be overridden at various levels such as the sbatch command line. –The initial default for the SLURM queues is 1 hour, so to over ride it use the --time parameter on the sbatch as follows: sbatch –time=xxx nameofscript.sh The xxx value is in minutes, other forms of date/times can be found in the sbatch man page: “ man sbatch ”  The job will not run if the timelimit overlaps a node reservation. –So for example, if there is a reservation every day at 3:30 for the entire machine and the time limit associated for the job will over lap that full system reservation, the job won’t run. Until after the reservation is over. –If the time-limit exceeds the queue/partition time-limit the job will be left in the pending state indefinitely.

© 2014 IBM Corporation22 QOS settings.  QOS (quality of service settings), are used by SLURM to control limits on the amount of resources a given user/group/account/job can consume at any one time.  Our initial deployment of SLURM will associate a default QOS setting limiting each user to the total number of compute nodes that they previously had as a static allocation.  This will be used to keep users from consuming all of the machine by submitting multiple sbatch commands, but still allow a user to run 3 32 way jobs if their normal allocaiton was 128 nodes.  Each user will have a “default QOS” setting associated with their ID as well as a list of qos settings they are allowed to use. – umax-32 == user max nodes = 32 – umax-64 == user max nodes = 64 – umax-128 == user max nodes == 128 –…  One can select one of the authorized qos settings in the sbatch command line as follows: sbatch –qos=umax-128 –nodes=32 xx.sh –The above command would allow the user to run 4 32 way jobs in parallel, before the queue would back up his jobs behind other work.