Presentation is loading. Please wait.

Presentation is loading. Please wait.

HPCC New User Training Getting Started on HPCC Resources

Similar presentations


Presentation on theme: "HPCC New User Training Getting Started on HPCC Resources"— Presentation transcript:

1 HPCC New User Training Getting Started on HPCC Resources
High Performance Computing Center HPCC New User Training Getting Started on HPCC Resources Eric Rees & Hui (Amy) Wang High Performance Computing Center November 2017

2 HPCC User Training Agenda
Resources and Policy Overview Logging In and Using the Clusters Transferring Data Installing Software Getting Help

3 Resources and Policy Overview
HPCC User Training Resources and Policy Overview

4 Current HPCC Resources
Quanah Cluster Lustre Storage Commissioned in 2017 Running CentOS 7.3 Currently consists of 467 Nodes 16,812 Cores (36 cores/node) 87.56 TB Total RAM (192 GB/node) Xeon E5-2695v4 Broadwell Processors Omnipath (100 Gbps) Fabric Benchmarked at 485 Teraflops/sec High speed parallel file system 2.4 PB of storage space /home - 57 TB /lustre/work TB /lustre/scratch PB /lustre/hep – 850 TB (Restricted Access) A teraflop (Tflop) is a unit of computing speed equal to one million million (1012) floating-point operations per second. 1 PB = 1024 TB, so 2.4 PB is about 2,457TB.

5 Current HPCC Resources
Hrothgar Cluster (West) Hrothgar Cluster (Ivy) Commissioned in 2011 Running CentOS 6.9 Currently consists of 301 Nodes 3,612 Cores (12 cores/node) 7.05 TB Total RAM (24 GB/node) Xeon X5660 Westmere Processors DDR 20 GB/second InfiniBand fabric Commissioned in 2014 Running CentOS 6.9 Currently consists of 122 Nodes 1,920 Cores (20 cores/node) 7.62 TB Total RAM (64 GB/node) Xeon E5-2670v2 Ivy Bridge Processors QDR 40 GB/second InfiniBand fabric A teraflop (Tflop) is a unit of computing speed equal to one million million (1012) floating-point operations per second.

6 File System and Quota Limits
Users have access to three storage areas Home - /home/<eraider> User quota: 150GB Backed up nightly Never purged Work - /lustre/work/<eraider> User quota: 700GB Never backed up nor purged Scratch - /lustre/scratch/<eraider> User quote: None Never backed up Purged monthly Area Quota Backup Purged /home/<eraider> 150 GB Yes No /lustre/work/<eraider> 700 GB /lustre/scatch/<eraider> None Monthly Out of quota: Account locked!

7 Login nodes (Quanah, Hrothgar and Ivy)
HPCC Policies Login nodes (Quanah, Hrothgar and Ivy) No jobs are allowed to run on the login node. SSH Access No direct SSH access allowed to the nodes. Software Installation Requests Software requests are handled on a case by case basis. Requesting software does not guarantee it will be installed “cluster-wide”. May take two or more weeks to complete your request.

8 Scratch Purge Policy HPCC Policies Scratch will be purged monthly.
Automatic removal of all files not accessed within the past year Purge will aim to drop scratch usage below 65% This may remove files accessed in past year User may not run “touch” or similar commands for the purpose of circumventing this policy. Doing so can result in a ban.

9 Logging In and Using the Clusters
HPCC User Training Logging In and Using the Clusters

10 Account Request User Guides Queue Status Getting Started
Faculty/Staff account Student account Research Partner account User Guides The best place to get detailed information regarding using the clusters Queue Status Current view of the pending and running jobs on each cluster

11 Logging into HPCC Resources
User Guide: On or Off Campus? On Campus: Wired TTU network & TTUnet wireless network Off Campus: Any other network connection, including: TTUHSC networks TTUguest wireless network Logging in from Off Campus Log in via the SSH gateway Establish a VPN -

12 User Environments (Post-Upgrade)
Environment Settings User Environments (Post-Upgrade) .bashrc Bash script that runs whenever you start an interactive login. Often used to set up user environments. Your old .bashrc file has been replaced with a new one. Original .bashrc can be found at “~/.bashrc-old” While you may add to your .bashrc file, do not remove any default settings! SoftENV (Hrothgar) SoftENV has been replaced with Modules Your .soft file has been kept so you can reference it. (~/.soft) Modules The primary way to change your user environment. Running on Quanah and Hrothgar When dealing with the .bashrc (especially when copying over from your .bashrc-old file, please make sure all commands work before adding them to your .bashrc. Keep in mind that a bad .bashrc can prevent you from logging in or running jobs!

13 Using Modules Modules commands: Environment Settings
User Guide: Modules commands: module avail module list module load <module_name> module unload <module_name> module spider <keyword> module purge # module avail : used to list available modules. This will update as more modules are added # module list : list all modules loaded # module load module_name : used to load a module # module unload module_name : used to unload a module # module spider keyword : used to find a module and what prerequisites are needed # module purge : will unload all modules the user has loaded # module swap loaded_module new_module : Will swap modules and their dependencies # module help : Will print a help menu to the screen

14 Modules Tips and Recommendations
Environment Settings Modules Tips and Recommendations Try to keep all of your module load commands as part of your job submission scripts. Makes debugging and changing between experiments easier. Prevents collisions or accidently running jobs with the wrong environment. Provides yourself and collaborators with a way of tracking the exact software and versions used in any given experiment. Always contain the version number of a module in the module load command. Makes version tracking easier. Prevents unanticipated changes in version during an experiment. This will make it easier for you to track which experiments and jobs used which versions of different software, which will in turn make writing your research paper's "methods" section easier. Example: Use module load nwchem/6.6-intel instead of just module load nwchem Modules Tips and Recommendations Try to keep all of your module load commands as part of your job submission scripts. Makes debugging and changing between experiments easier. Prevents collisions or accidently running jobs with the wrong environment. Provides yourself and collaborators with a way of tracking the exact software and versions used in any given experiment. Always contain the version number of a module in the module load command. This will make it easier for you to track which experiments and jobs used which versions of different software, which will in turn make writing your research paper's "methods" section easier. Prevents unanticipated changes in version during an experiment. We may add a new version of your software while you are running, meaning 1 job could be on one version and the next job could be on a newer version. Example: Use module load nwchem/6.6-intel instead of just module load nwchem

15 Submission script layout
Submitting Jobs How to submit jobs User Guide: Submission script layout -V instructs the scheduler to keep the current environment settings -cwd instructs the scheduler to start the commands from the current directory -S /bin/bash instructs the scheduler to use the /bin/bash as the shell for the batch session -N <job name> sets the name for the job. Can be referenced in the script using $JOB_NAME -o $JOB_NAME.o$JOB_ID indicates the name of the standard output file. -e $JOB_NAME.e$JOB_ID indicates the name of the standard error file. -q <queue name> instructs the scheduler to use the queue defined by <queue name>. -pe <parallel environment> instructs the scheduler to use the parallel environment defined by <parallel environment>. -P <cluster name> instructs the scheduler to use the project/cluster defined by <cluster name>.

16 Parallel Environments on Quanah Parallel Environments on Hrothgar
Submitting Jobs Parallel Environments on Quanah Parallel Environments on Hrothgar Omni queue -pe mpi Must request in increments of 36. All MPI jobs should use this PE. -pe fill Can request any number of job slots. Slots are not guaranteed to be on one node. -pe sm Can request between 1 and 36. Slots are guaranteed to be on one node. West queue -pe west - Must request in increments of 12. Ivy queue -pe ivy - Must request in increments of 20. Community Clusters -pe mpi - Can request any number of job slots. -pe fill - Can request any number of job slots. -pe sm - Can request between 1 and max number of cores per node. Quanah Parallel Environments: MPI Reserves entire nodes at a time to a user job. This is why you must take increments of 36. Fill Reserves slots in a fill pattern, meaning it will attempt to fill one node then “spill over” onto the next node until all slots are reserved. This means your requests cores are NOT guaranteed to be on the same node. This PE should rarely be used or only used for MPI jobs with small memory constraints. SM Reserves slots such that all of the slots you requested are from the same node. Hrothgar Parallel Environment -pe west Works only on the “west” queue. Reserves entire nodes at a time to a user job. This is why you must take increments of 12. -pe ivy Works only on the “ivy” queue. Reserves entire nodes at a time to a user job. This is why you must take increments of 20.

17 Quanah Hrothgar How to submit jobs Example Script Submit job
Submitting Jobs How to submit jobs User Guide: Example Script Set the name of our job to “MPI_TEST_JOB” Requests to run on the Quanah cluster using the omni queue. Sets the parallel environment to “mpi” and requests 36 cores. Submit job qsub <job submission script> #!/bin/sh #$ -V #$ -cwd #$ -S /bin/bash #$ -N MPI_Test_Job #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -q omni #$ -pe mpi 36 #$ -P quanah module load intel impi mpirun --machinefile machinefile.$JOB_ID -np $NSLOTS ./mpi_hello_world Quanah #!/bin/sh #$ -V #$ -cwd #$ -S /bin/bash #$ -N MPI_Test_Job #$ -o $JOB_NAME.o$JOB_ID #$ -e $JOB_NAME.e$JOB_ID #$ -q west #$ -pe west 12 #$ -P hrothgar module load intel impi mpirun --machinefile machinefile.$JOB_ID -np $NSLOTS ./mpi_hello_world This script sets the name of the project (-N) to "MPI_Test_Job", requests the job be run on the "omni" queue (-q), requests the job be assigned 36 cores using the "mpi" parallel environment (-pe), and requests that the jobs all be scheduled to the "quanah" cluster (-P). Next the script runs the command "module load intel impi".  This will execute before the mpi job is run, allowing the correct compiler and mpi implementation to be loaded prior to executing the mpirun command.  For more information on how modules work, please review our "Software Environment Setup" guide. Hrothgar

18 How to submit jobs Submit job Check job status Submitting Jobs
User Guide: Submit job qsub <job submission script> Check job status Command: qstat job-ID, prior, name, user, state, submit/start at, queue, jclass, slots, ja-task-ID “r”: running “qw”: waiting in the queue “E”: error This script sets the name of the project (-N) to "MPI_Test_Job", requests the job be run on the "omni" queue (-q), requests the job be assigned 36 cores using the "mpi" parallel environment (-pe), and requests that the jobs all be scheduled to the "quanah" cluster (-P). Next the script runs the command "module load intel impi".  This will execute before the mpi job is run, allowing the correct compiler and mpi implementation to be loaded prior to executing the mpirun command.  For more information on how modules work, please review our "Software Environment Setup" guide.

19 Interactive Login to a Node
Start an interactive job using QLOGIN qlogin -P <cluster> -q <queue> -pe <pe> <#> Examples: qlogin -P quanah -q omni -pe sm 1 qlogin -P hrothgar -q west -pe west 12 qlogin -P hrothgar -q ivy -pe ivy 20 While using QLOGIN Make sure the prompt changes to compute-#-# If it doesn’t change, check your qlogin and try again. Make sure to run “exit” when you are finished Keep in mind: Runtime limits apply to qlogin! This script sets the name of the project (-N) to "MPI_Test_Job", requests the job be run on the "omni" queue (-q), requests the job be assigned 36 cores using the "mpi" parallel environment (-pe), and requests that the jobs all be scheduled to the "quanah" cluster (-P). Next the script runs the command "module load intel impi".  This will execute before the mpi job is run, allowing the correct compiler and mpi implementation to be loaded prior to executing the mpirun command.  For more information on how modules work, please review our "Software Environment Setup" guide.

20 Check Cluster Usage/Status
Viewing the current queue status and running/pending jobs For a queue overview, run the command: “qstat -g c” Visit the queue status webpage: (Updates every 2 minutes) Job Runtime Limit Cluster Queue Runtime Limit # of Cores per node Memory per node Quanah omni 48 hours 36 cores 192 GB Hrothgar ivy 20 cores 64 GB west 12 cores 24 GB serial 120 hours community cluster queues (Chewie, R2D2, Yoda, ancellcc, blawzcc, caocc, dahlcc, phillipscc, tangcc, tang256cc) No limit vary

21 Job output When debugging: Debugging Failed Jobs
Standard: $JOB_NAME.o$JOB_ID Error: $JOB_NAME.e$JOB_ID When debugging: Check the output files for errors Check the output of qacct –j <job_ID> failed exit_status maxvmem start_time & end_time (<runtime limit) iow Contact Support: Check the output files for errors If there is anything clearly wrong, then start there. If the error message is vague, then continue on. Check the output of qacct –j <job_ID> failed – This will likely be =1 if the job crashed. exit_status – Google the exit_status if it is not 0. maxvmem – If this is approaching the maximum memory per node, then that suggests you ran out of memory. start_time & end_time (<runtime limit) – Just be sure you didn’t get killed by the scheduler for running too long. Iow – If this number of comparatively high, then this suggests the problem was with our network.

22 Interactive GUI using Linux/Mac Interactive GUI using Window
X Windows Interactive GUI using Linux/Mac Install X Server (Xquartz) Add the following options to your normal ssh command: -nNT -L 2222:<cluster>.hpcc.ttu.edu:22 Example: ssh -nNT -L 2222:quanah.hpcc.ttu.edu:22 Run a test command like xclock. Interactive GUI using Window Install X Server (Xming) Start the X Server application. Turn on X11 forwarding in your client of choice. Log into the cluster. Run a test command like xclock. Matlab Interactive Job Example: From Quanah: qlogin –q omni –P quanah module load matlab matlab

23 HPCC User Training Transferring Data

24 Transfer files using Globus Connect
Transferring Data Transfer files using Globus Connect User Guide: Whenever possible, refrain from using: scp, sftp, rsync, Or any other data transfer tool. Why use Globus? Globus Connect service is well connected to the campus network. The data transfer nodes are better positioned for transferring user data. Globus connect service eliminates the data transfer load from the cluster login node. Globus connect works with Linux, Mac and Windows and is controlled through a web GUI. Numerous other sites (including TACC) support Globus Connect data transfers.

25 Software Installations
HPCC User Training Software Installations

26 Software Installations (Examples)
Local R Package Installations Local Python Package Installations Install an R package into your home folder: module load intel R R install.packages(‘<package name>’) Example: install.packages(‘readr’) Select a mirror Done! The R application will ask if you want to install it locally the first time you do this. Install a Python package into your home folder: module load intel python pip install --user <package name> Example: pip install --user matplotlib Done!

27 Software Installations
Frustrated with compiling? For software you know doesn’t need to be system wide but you want help getting installed Contact us at Need your software accessible to other users or believe the software would help the general community? Submit a software request! Software Installation Requests Software requests are handled on a case by case basis. Requesting software does not guarantee it will be installed “cluster-wide”. May take two or more weeks to complete your request.

28 HPCC User Group Meeting
Getting Help

29 Best ways to get help Visit our website - www.hpcc.ttu.edu
Getting Help Best ways to get help Visit our website - Most user guides have been updated New user guides are being added User guides are being updated with the current changes to Hrothgar. Submit a support ticket Send an to Visit the HPCC Help Desk Located in ESB room 141 Open from 9am – 4pm daily

30


Download ppt "HPCC New User Training Getting Started on HPCC Resources"

Similar presentations


Ads by Google