Swarm on the Biowulf2 Cluster Dr. David Hoover, SCB, CIT, NIH September 24, 2015.

Slides:



Advertisements
Similar presentations
An Introduction to Gauss Paul D. Baines University of California, Davis November 20 th 2012.
Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Parasol Architecture A mild case of scary asynchronous system stuff.
HCC Workshop Department of Earth and Atmospheric Sciences September 23/30, 2014.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.
ISG We build general capability Job Submission on the Olympus Cluster J. DePasse; S. Brown, PhD; T. Maiden Pittsburgh Supercomputing Center Public Health.
David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf.
ATS Programming Short Course I INTRODUCTORY CONCEPTS Tuesday, Jan. 27 th, 2009 Essential Unix Commands.
ISG We build general capability Purpose After this tutorial, you should: Be comfortable submitting work to the batch queuing system of olympus and be familiar.
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
B i o w u l f : A Beowulf for Bioscience
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
HCC Workshop August 29, Introduction to LINUX ●Operating system like Windows or OS X (but different) ●OS used by HCC ●Means of communicating with.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.
How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.
Introduction to Using SLURM on Discover Chongxun (Doris) Pan September 24, 2013.
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
© 2014 IBM Corporation SLURM for Yorktown Bluegene/Q.
Unix Shell Basics Edited from Greg Wilson's "Software Carpentry"
Unified scripts ● Currently they are composed of a main shell script and a few auxiliary ones that handle mostly the local differences. ● Local scripts.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
If condition1 then statements elif condition2 more statements […] else even more statements fi.
Group, group, group One after the other: cmd1 ; cmd2 One or both: cmd1 && cmd2 Only one of them: cmd1 || cmd2 Cuddling (there):( cmd1 ; cmd2 ) Cuddling.
HUBbub 2013: Developing hub tools that submit HPC jobs Rob Campbell Purdue University Thursday, September 5, 2013.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
Parallel MATLAB jobs on Biowulf Dave Godlove, NIH February 17, 2016 While waiting for the class to begin, log onto Helix.
Modules, Compiling WRF, and Running on CHPC Clusters Adam Varble WRF Users Meeting 10/26/15.
+ Introduction to Unix Joey Azofeifa Dowell Lab Short Read Class Day 2 (Slides inspired by David Knox)
Computational chemistry packages (efficient usage issues?) Jemmy Hu SHARCNET HPC Consultant Summer School June 3, 2016 /work/jemmyhu/ss2016/chemistry/
Gridengine Configuration review ● Gridengine overview ● Our current setup ● The scheduler ● Scheduling policies ● Stats from the clusters.
Advanced Computing Facility Introduction
Linux & Joker – An Introduction
Compute and Storage For the Farm at Jlab
Outline Introduction/Questions
GRID COMPUTING.
Welcome to Indiana University Clusters
PARADOX Cluster job management
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
Using Longleaf ITS Research Computing
How to use the HPCC to do stuff
Part 1: Basic Commands/Utilities
ASU Saguaro 09/16/2016 Jung Hyun Kim.
Joker: Getting the most out of the slurm scheduler
Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node
BIMSB Bioinformatics Coordination
Command Line Interface for Beginners
Using Dogwood Instructor: Mark Reed
Helix - HPC/SLURM Tutorial
College of Engineering
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
High Performance Computing in Bioinformatics
Chapter 10: File-System Interface
Linux Shell Script Programming
gLite Job Management Christos Theodosiou
MPI MPI = Message Passing Interface
Introduction to High Performance Computing Using Sapelo2 at GACRC
Quick Tutorial on MPICH for NIC-Cluster
GPU and Large Memory Jobs on Ibex
Maxwell Compute Cluster
Presentation transcript:

Swarm on the Biowulf2 Cluster Dr. David Hoover, SCB, CIT, NIH September 24, 2015

What is swarm? Wrapper script that simplifies running individual commands on the Biowulf cluster

Documentation 'man swarm' 'swarm --help'

Differences in Biowulf 1 -> 2 Jobs are allocated cores, rather than nodes All resources must be allocated All swarms are job arrays

nod e socket core cpus, hyperthreads Architectural Digest

subjobs swarm swarm -t 2 0hostname ; uptime Simple swarm single-threaded swarm

subjobs swarm -t 3 swarm -t 4 0hostname ; uptime Simple swarm multi-threaded swarm

Basic swarm usage Create swarmfile (file.swarm) Submit and wait for output cd /home/user/dir1 ; ls –l cd /home/user/dir2 ; ls –l cd /home/user/dir3 ; ls –l cd /home/user/dir4 ; ls -l $ swarm –f file.swarm $ ls swarm_ _0.o swarm_ _1.o swarm_ _2.o swarm_ _0.e swarm_ _1.e swarm_ _2.e

Standard swarm options -f : swarmfile, list of commands to run -g : GB per process (NOT PER NODE OR JOB!) -t : threads/cpus per process (DITTO!) $ swarm –f file.swarm –g 4 –t 4 command –p 4 –i /path/to/input1 –o /path/to/output1 command –p 4 –i /path/to/input2 –o /path/to/output2 command –p 4 –i /path/to/input3 –o /path/to/output3 command –p 4 –i /path/to/input4 –o /path/to/output4

-t auto Autothreading still enabled This allocates an entire node to each process java –Xmx8g –jar /path/to/jarfile –opt1 –opt2 –opt3 $ swarm –f file.swarm –g 4 –t auto

-b, --bundle Bundling is slightly different swarms of > 1000 commands are autobundled --autobundle is deprecated

--singleout Concatenate all.o and.e into single files Not entirely reliable! CANCELLED and TIMEOUT will lose all output Better to use --logdir (described later)

Miscellaneous --usecsh --no-comment, --comment-char --no-scripts, --keep-scripts --debug, --devel, --verbose, --silent

--devel $ swarm --devel -f file.swarm -b 4 -g 8 -v SWARM ├── subjob 0: 4 commands (1 cpu, 8.00 gb) ├── subjob 1: 4 commands (1 cpu, 8.00 gb) ├── subjob 2: 4 commands (1 cpu, 8.00 gb) ├── subjob 3: 4 commands (1 cpu, 8.00 gb) subjobs, 16 commands, 0 output file 16 commands run in 4 subjobs, each command requiring 8 gb and 1 thread, running 4 processes serially per subjob sbatch --array=0-3 --job-name=swarm -- output=/home/user/test/swarm_%A_%a.o -- error=/home/user/test/swarm_%A_%a.e --cpus-per-task=1 -- mem= partition=norm --time=16:00:00 /spin1/swarm/user/I8DQDX4O.batch

No more.swarm directories swarm scripts are written to a central, shared area Each user has their own subdirectory $ tree /spin1/swarm/user /spin1/swarm/user ├── │ ├── cmd.0 │ ├── cmd.1 │ ├── cmd.2 │ └── cmd.3 └── batch

--license --license replaces -R or --resource $ swarm –f file.swarm --license=matlab

--module --module now requires comma-delimited list, rather than space delimited list: $ swarm –f file.swarm --module python,samtools,bwa,tophat

--gres --gres stands for "generic resources" Is used for allocating local disk space Replaces --disk /lscratch/$SLURM_JOBID Above example gives 50GB of scratch space See $ swarm –f file.swarm --gres=lscratch:50

-p, --processes-per-subjob For single-threaded commands -p can only be 1 or 2 subjobs swarm -p 2 0hostname ; uptime

--logdir Redirects.o and.e files to a directory The directory must first exist $ mkdir /data/user/trashbin $ swarm –f files.swarm --logdir /data/user/trashbin $ ls file.swarm $ ls /data/user/trashbin swarm_ _0.o swarm_ _1.o swarm_ _2.o swarm_ _0.e swarm_ _1.e swarm_ _2.e

--time All jobs must have a walltime now --time for swarm is per command, not per swarm or per subjob --time multiplied by bundle factor $ swarm -f file.swarm --devel --time=01:00:00 32 commands run in 32 subjobs, each command requiring 1.5 gb and 1 thread, allocating 32 cores and 64 cpus sbatch --array= time=01:00:00 /spin1/swarm/hooverdm/iMdaW6dO.batch $ swarm -f file.swarm --devel --time=01:00:00 -b 4 32 commands run in 8 subjobs, each command requiring 1.5 gb and 1 thread, running 4 processes serially per subjob sbatch --array= time=04:00:00 /spin1/swarm/hooverdm/zYrUbkiO.batch

Primary sbatch options --job-name --dependency --time, --gres --partition --qos

ALL sbatch options --sbatch Type 'man sbatch' for more information $ swarm -f file.swarm --sbatch "--mail-type=FAIL -- export=var=100,nctype=12 --workdir=/data/user/test"

--prologue and --epilogue way too difficult to implement conflicts with --prolog and --epilog options to srun

--W block=true sbatch does not allow blocking must use srun instead

-R, --resource gpfs is available on all nodes Replaced by a combination of --license, --gres, and --constraint

Examples single-threaded commands multi-threaded commands large memory, single-threaded $ swarm –f file.swarm $ swarm –f file.swarm –t 4 $ swarm –f file.swarm –g 32

Examples >10K single-threaded commands wait for it and deal with the output... $ mkdir /data/user/bigswarm $ swarm –f file.swarm --job-name bigswarm --logdir /data/user/bigswarm $ cd /data/user/bigswarm $ cat *.e > bigswarm.err $ cat *.o > bigswarm.out $ rm *.{e,o}

Examples Large temp files $ swarm –f file.swarm --gres=lscratch:200 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 1 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 2 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 3 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 4

Examples Dependencies $ sbatch script_1.sh $ swarm -f file.swarm --dependency=afterany: $ swarm -f file2.swarm --dependency=afterany: $ sbatch sbatch_2.sh --dependency=afterany:

Examples Long-running processes $ swarm -f file.swarm --time=4-00:00:00

Defaults and Limits 1,000 subjobs per swarm 4,000 jobs per user max 30,000 jobs in slurm max 1.5 GB/process default, 1 TB max 0 GB/disk, 800 GB max 4 hours walltime, 10 days max batchlim and freen

Monitoring Swarms current and running jobs: – squeue – sjobs – jobload historical – sacct – jobhist

Stopping Swarms scancel

Complex Examples Rerunnable swarm -e tests if the file exists [[ -e file1.flag ]] || ( command1 && touch file1.flag ) [[ -e file2.flag ]] || ( command2 && touch file2.flag ) [[ -e file3.flag ]] || ( command3 && touch file3.flag ) [[ -e file4.flag ]] || ( command4 && touch file4.flag )

Complex Examples Very long command lines cd /data/user/project; KMER="CCCTAACCCTAACCCTAA"; \ jellyfish count -C -m ${#KMER} \ -t 32 \ -c 7 \ -s \ -o /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic \ <(samtools bam2fq /data/user/bam/0A4HMC/DNA/genomic/39sHMC_genomic.md.bam ); \ echo ${KMER} | jellyfish query /lscratch/$SLURM_JOBID/39sHMC_Tumor_genomic_0 \ > 39sHMC_Tumor_genomic.telrpt.count

Complex Examples Comments --comment-char, --no-comment # This is for the first file command -i infile1 –o outfile1 –p 2 –r /path/to/file # This is for the next file command –i infile2 –o outfile2 –p 2 –r /path/to/another/file

Complex Examples Environment variables Defined BEFORE the job; passed to the swarm: Defined WITHIN the job; part of the swarmfile export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 1 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 2 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 3 export TMPDIR=/lscratch/$SLURM_JOBID ; command –opt 4 $ swarm –f file.swarm --sbatch "-- export=FILE=/path/to/file,DIR=/path/to/dir,VAL=12345"

Blank Space for More Examples

Questions? Comments?