Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh ssh.fsl.byu.edu You will be logged in to an interactive node.

Slides:



Advertisements
Similar presentations
Running Jobs on Franklin Richard Gerber NERSC User Services NERSC User Group Meeting September 19, 2007.
Advertisements

Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Saguaro User Environment Fall 2013 Outline Get an account Linux Clusters Initial login Modules Compiling Batch System Job Monitoring System & Memory.
Using the Argo Cluster Paul Sexton CS 566 February 6, 2006.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Batch Queuing Systems The Portable Batch System (PBS) and the Load Sharing Facility (LSF) queuing systems share much common functionality in running batch.
Using Clusters -User Perspective. Pre-cluster scenario So many different computers: prithvi, apah, tejas, vayu, akash, agni, aatish, falaq, narad, qasid.
An overview of Torque/Moab queuing. Topics ARC topology Authentication Architecture of the queuing system Workflow Job Scripts Some queuing strategies.
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.
Using Parallel Computing Resources at Marquette
Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.
Understanding the Basics of Computational Informatics Summer School, Hungary, Szeged Methos L. Müller.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.
 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.
Zeus Users’ Quickstart Training January 27/30, 2012.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Intro to Linux/Unix (user commands) Box. What is Linux? Open Source Operating system Developed by Linus Trovaldsa the U. of Helsinki in Finland since.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.
Introduction to Using SLURM on Discover Chongxun (Doris) Pan September 24, 2013.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.
Batch Systems In a number of scientific computing environments, multiple users must share a compute resource: –research clusters –supercomputing centers.
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Introduction to Parallel Computing Presented by The Division of Information Technology Computer Support Services Department Research Support Group.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
Introduction to HPC Workshop March 1 st, Introduction George Garrett & The HPC Support Team Research Computing Services CUIT.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Advanced Computing Facility Introduction
Hackinars in Bioinformatics
GRID COMPUTING.
Specialized Computing Cluster An Introduction
Auburn University
Welcome to Indiana University Clusters
PARADOX Cluster job management
Introduction to the New High Performance Facility at Sick Kids
Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.
HPC usage and software packages
OpenPBS – Distributed Workload Management System
MPI Basics.
How to use the HPCC to do stuff
Using Paraguin to Create Parallel Programs
Joker: Getting the most out of the slurm scheduler
Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node
Architecture & System Overview
Postdoctoral researcher Department of Environmental Sciences, LSU
Intro to UNIX System and Homework 1
Paul Sexton CS 566 February 6, 2006
Compiling and Job Submission
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Requesting Resources on an HPC Facility
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
Presentation transcript:

Using the BYU Supercomputers

Resources

Basic Usage After your account is activated: – ssh ssh.fsl.byu.edu You will be logged in to an interactive node – Jobs that run on the supercomputer are submitted to the batch queuing system You can develop code on the interactive nodes

Running Jobs The process – User creates a shell script that will: tell the scheduler what is needed run the user’s job – User submits the shell script to the batch scheduler queue – Machines register with the scheduler offering to run jobs – Scheduler allocates jobs to machines and tracks the jobs – The shell script is run on the first node of the group of nodes assigned to a job – When finished, all stdout and stderr are collected back and given to the user in files

Scheduling Jobs Basic commands – qsub scheduling_shell_script qsub –q anynode scheduling_shell_script qsub –q test scheduling_shell_script – showq [-u username] – qdel jobnumber – checkjob [-v] jobnumber

Job Submission Scripts #!/bin/bash #PBS -l nodes=1:ppn=4,walltime=00:05:00 #PBS -l pmem=256mb #PBS -M #PBS -m ae #PBS -N Hello echo "$USER: Please change the -M to your address before submitting. Then remove this line."; exit 1 cd hello echo "The root node is `hostname`" echo "Here are all the nodes being used" cat $PBS_NODEFILE echo "From here on is the output from the program" mpirun hello #!/bin/bash #PBS -l nodes=1:ppn=4,walltime=00:05:00 #PBS -l pmem=256mb #PBS -M #PBS -m ae #PBS -N Hello echo "$USER: Please change the -M to your address before submitting. Then remove this line."; exit 1 cd hello echo "The root node is `hostname`" echo "Here are all the nodes being used" cat $PBS_NODEFILE echo "From here on is the output from the program" mpirun hello -M address -msend on (a=abort, b=begin, e=end) -ldefine resources -Njobname -l procs=4 (any 4 processors)

Viewing Your Jobs bash-2.05a$ showq ACTIVE JOBS JOBNAME USERNAME STATE PROC REMAINING STARTTIME m1015i taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i taskman Running 1 18:39:00 Wed Aug 14 08:06:24 … m1015i taskman Running 1 21:33:42 Wed Aug 14 11:01:06 m1015i taskman Running 1 23:43:05 Wed Aug 14 13:10:29 m1015i dvd Running 4 2:15:10:38 Wed Aug 14 04:38:02 m1015i mdt36 Running 8 2:23:14:21 Wed Aug 7 12:41:45 … m1015i jar65 Running 4 9:04:07:44 Tue Aug 13 17:35:08 m1015i jar65 Running 4 9:08:28:16 Tue Aug 13 21:55:40 m1015i to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 m1015i to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 35 Active Jobs 150 of 184 Processors Active (81.52%) 26 of 34 Nodes Active (76.47%) IDLE JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME m1015i jl447 Idle 2 5:00:00:00 Tue Aug 13 07:08:09 m1015i dvd Idle 8 3:00:00:00 Tue Aug 13 10:45:18 … 23 Idle Jobs NON-QUEUED JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 58 Active Jobs: 35 Idle Jobs: 23 Non-Queued Jobs: 0

The process -bash-3.2$ qsub hello.pbs fslsched.fsl.byu.edu -bash-3.2$ showq -u qos active jobs JOBID USERNAME STATE PROCS REMAINING STARTTIME 0 active jobs 0 of 2968 processors in use by local jobs (0.00%) 343 of 371 nodes active (92.45%) eligible jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME qos Idle 4 00:05:00 Wed Jan 6 10:27:52 1 eligible job blocked jobs JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1

The process -bash-3.2$ checkjob -v job (RM job ' fslsched.fsl.byu.edu') AName: Hello State: Idle Creds: user:qos group:qos account:qos class:batch WallTime: 00:00:00 of 00:05:00 SubmitTime: Wed Jan 6 10:27:52 (Time Queued Total: 00:03:08 Eligible: 00:02:47) NodeMatchPolicy: EXACTNODE Total Requested Tasks: 4 Total Requested Nodes: 4 Req[0] TaskCount: 4 Partition: ALL TasksPerNode: 1 NodeCount: 4 UMask: 0000 OutputFile: m5int02:/fslhome/qos/hello/Hello.o ErrorFile: m5int02:/fslhome/qos/hello/Hello.e Partition List: ALL,base,SHARED SrcRM: base DstRM: base DstRMJID: fslsched.fsl.byu.edu Submit Args: hello.pbs Flags: RESTARTABLE,FSVIOLATION Attr: FSVIOLATION,checkpoint StartPriority: 1644 PE: 4.00 Node Availability for Partition base available for 2 tasks - m5-8-[5]:m5-18-[15] rejected for Class - m5-20-[5-16]:m5f-1-[1-2]:mgpu-1-[1]:m5-21-[1-16]:mgpu-1-[2] rejected for State - m5-1-[1-16]:m5-2-[1-16]:m5-3-[1-16]:m5-4-[1-16]:m5-5-[1-16]:m5-6-[1-16]:m5-7-[1-16]:m5-8-[1-16]:m5-9-[1-16]: m5-10-[1-16]:m5-11-[1-16]:m5-12-[1-16]:m5-13-[1-16]:m5-14-[1-16]:m5-15-[1-16]:m5-16-[1-16]:m5-17-[1-16]:m5-18-[1-16]:m5-19-[1-16]: m5-20-[1-12]:m5q-2-[1-16]:m5q-1-[1-16] NOTE: job cannot run in partition base (insufficient idle nodes available: 2 < 4)

Developing Code Normal linux code development tools – gcc, g++, gdb, etc. Intel compiler – icc, ifort Editing – vi – emacs – edit on your own machine and transfer Parallel code development – icc –openmp – gcc –fopenmp – mpicc You may need to run – mpi-selector --list – mpi-selector --set fsl_openmpi_intel (check the name) – mpi-selector --unset (go to default)

Output stderr and stdout from each node are collected into files – Jobname.oJOBNUM – Jobname.eJOBNUM -bash-3.2$ cat Hello.o The root node is m local Here are all the nodes being used m m m5-5-1 m5-5-5 From here on is the output from the program I am running on m local I am running on m5-5-1.local I am running on m local I am running on m5-5-5.local I am proc 0 of 4 running on m local I am proc 2 of 4 running on m5-5-1.local I am proc 1 of 4 running on m local I am proc 3 of 4 running on m5-5-5.local 14:01:33 up 78 days, 4:54, 0 users, load average: 6.00, 5.97, :01:33 up 78 days, 4:54, 0 users, load average: 6.06, 5.91, 4.98 USER TTY FROM IDLE JCPU PCPU WHAT Sending messages Receiving messages 14:01:33 up 78 days, 4:55, 0 users, load average: 7.11, 6.45, 6.41 USER TTY FROM IDLE JCPU PCPU WHAT 14:01:33 up 78 days, 4:55, 0 users, load average: 4.07, 6.21, 7.51 USER TTY FROM IDLE JCPU PCPU WHAT Sending messages Receiving messages 2: 0: Hello 2: 1: Hello 2: 3: Hello 1: 0: Hello 1: 2: Hello 1: 3: Hello 0: 1: Hello 0: 2: Hello 3: 0: Hello 3: 1: Hello 3: 2: Hello 0: 3: Hello -bash-3.2$ cat Hello.o The root node is m local Here are all the nodes being used m m m5-5-1 m5-5-5 From here on is the output from the program I am running on m local I am running on m5-5-1.local I am running on m local I am running on m5-5-5.local I am proc 0 of 4 running on m local I am proc 2 of 4 running on m5-5-1.local I am proc 1 of 4 running on m local I am proc 3 of 4 running on m5-5-5.local 14:01:33 up 78 days, 4:54, 0 users, load average: 6.00, 5.97, :01:33 up 78 days, 4:54, 0 users, load average: 6.06, 5.91, 4.98 USER TTY FROM IDLE JCPU PCPU WHAT Sending messages Receiving messages 14:01:33 up 78 days, 4:55, 0 users, load average: 7.11, 6.45, 6.41 USER TTY FROM IDLE JCPU PCPU WHAT 14:01:33 up 78 days, 4:55, 0 users, load average: 4.07, 6.21, 7.51 USER TTY FROM IDLE JCPU PCPU WHAT Sending messages Receiving messages 2: 0: Hello 2: 1: Hello 2: 3: Hello 1: 0: Hello 1: 2: Hello 1: 3: Hello 0: 1: Hello 0: 2: Hello 3: 0: Hello 3: 1: Hello 3: 2: Hello 0: 3: Hello

Policies Per User Policies Max Jobs Running soft limit of 400, hard limit of 525 Max Processors Running soft limit of 440, hard limit of 550 Max Jobs Eligible hard limit of 768 Max Processors Eligible hard limit of 1600 Per Research Group Policies Max Processors Running soft limit of 512, hard limit of 630 Per Job Policies In addition to the other policies, each job is subject to the following limitations: Max Total Running Time Requested No job will be allowed to request more than 16 days of total running time. NOTE: Most high-performance computing facilities limit this to between 24 and 72 hours. Max CPU Time Requested CPU Time is the product of CPU count and total running time requested. Currently, this is the equivalent of 128 processors for 14 days, or 1792 processor-days. For example, a job could use 256 processors for 7 days, or 384 processors for 112 hours.

Backfill Scheduling time Job A Job B 10 node system Job C A BCD Job D

Backfill Scheduling Requires real time limit to be set More accurate (shorter) estimate gives more chance to be running earlier Short jobs can move through system quicker Uses system better by avoiding waste of cycles during wait