Saguaro User Environment Fall 2013 Outline Get an account Linux Clusters Initial login Modules Compiling Batch System Job Monitoring System & Memory.

Slides:

Advertisements

Similar presentations

Using the Argo Cluster Paul Sexton CS 566 February 6, 2006.

Advertisements

Workshop: Using the VIC3 Cluster for Statistical Analyses Support perspective G.J. Bex.

Network for Computational Nanotechnology (NCN) Purdue, Norfolk State, Northwestern, UC Berkeley, Univ. of Illinois, UTEP Basic Portable Batch System (PBS)

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

PBS Job Management and Taskfarming Joachim Wagner

Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.

IT MANAGEMENT OF FME, 21 ST JULY  THE HPC FACILITY  USING PUTTY AND WINSCP TO ACCESS THE SERVER  SENDING FILES TO THE SERVER  RUNNING JOBS 

Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.

High Performance Computing

Job Submission on WestGrid Feb on Access Grid.

Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh ssh.fsl.byu.edu You will be logged in to an interactive node.

Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.

HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.

 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.

Zeus Users’ Quickstart Training January 27/30, 2012.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Lab System Environment

How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.

Introduction to Using SLURM on Discover Chongxun (Doris) Pan September 24, 2013.

Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.

Batch Systems In a number of scientific computing environments, multiple users must share a compute resource: –research clusters –supercomputing centers.

Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.

ARCHER Advanced Research Computing High End Resource

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.

Introduction to Parallel Computing Presented by The Division of Information Technology Computer Support Services Department Research Support Group.

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Using ROSSMANN to Run GOSET Studies Omar Laldin ( using materials from Jonathan Crider, Harish Suryanarayana ) Feb. 3, 2014.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

Advanced Computing Facility Introduction

GRID COMPUTING.

Specialized Computing Cluster An Introduction

Auburn University

Welcome to Indiana University Clusters

PARADOX Cluster job management

Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.

HPC usage and software packages

Welcome to Indiana University Clusters

How to use the HPCC to do stuff

BIOSTAT LINUX CLUSTER By Helen Wang October 29, 2015.

ASU Saguaro 09/16/2016 Jung Hyun Kim.

Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node

Architecture & System Overview

CommLab PC Cluster (Ubuntu OS version)

HPCC New User Training Getting Started on HPCC Resources

Postdoctoral researcher Department of Environmental Sciences, LSU

Welcome to our Nuclear Physics Computing System

Paul Sexton CS 566 February 6, 2006

Compiling and Job Submission

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

Welcome to our Nuclear Physics Computing System

High Performance Computing in Bioinformatics

Introduction to High Performance Computing Using Sapelo2 at GACRC

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Presentation transcript:

Saguaro User Environment Fall 2013

Outline Get an account Linux Clusters Initial login Modules Compiling Batch System Job Monitoring System & Memory Information Interactive mode Improving performance

Anti-outline Shell scripting Available software Installing software Debugging (parallel or otherwise) For any assistance with these – please contact us at any time:

Get an account a2c2.asu.edu - “Getting started” Faculty and academic staff get 10K hours every year. Students link to faculty account For assistance installing/using software on the cluster: Login via ssh, putty (Windows). ssh saguaro.fulton.asu.edu

Generic Cluster Architecture File Server PC+ internet PC+ Server PC Switch … Ethernet Myrinet, IB, Quadrics,... FCAL, SCSI,… lonestar.tacc.utexas.edu (Adv. HPC System) Switch GigE, Infiniband

Initial Login Login with SSH ssh saguaro.fulton.asu.edu Do not run your programs on the login nodes Connects you to a login node Don’t overwrite ~/.ssh/authorized_keys – Feel free to add to it if you know how to use it – SSH is used for job start up on the compute nodes. Mistakes can prevent your jobs from running For X forwarding, use ssh –X

Packages (saguaro) Modules are used to setup your PATH and other environment variables They are used to setup environments for packages & compilers saguaro% module {lists options} saguaro% module avail {lists available packages} saguaro% module load {add one or more packages} saguaro% module unload {unload a package} saguaro% module list {lists loaded packages} saguaro% module purge {unloads all packages} Multiple compiler families available, so make sure you are consistent between libraries, source and run script!

MPP Platforms Local Memory Interconnect Processors Clusters are Distributed Memory platforms. Each processor has its own memory. Use MPI on these systems. … … …

SMP Platforms Shared Memory Banks Memory Interface Processors The Saguaro nodes are shared-memory platforms. Each processor has equal access to a common pool of shared memory. Saguaro queues have 8, 12, and 16 cores per node. One node has 32 cores. … … for(i; i<N; i++) a[i] = b[i]+s*c[i]

Compile (saguaro) After linking (via module command) Different compilers unfortunately means different flags saguaro% module load intel saguaro% icc –openmp –o program program.c saguaro% module purge saguaro% module load gcc/4.4.4 saguaro% gcc –fopenmp –o program program.c saguaro% gfortran –fopenmp –o program program.f Multiple compiler families available, so make sure you are consistent between libraries, source and run script!

Batch Submission Process internet Server Head C1C2C3C4 Submission: qsub job Queue: Job Script waits for resources on Server Master: Compute Node that executes the job script, launches ALL MPI processes Launch: contact each compute node to start executable (e.g. a.out) Launch mpirun Master Queue Compute Nodes mpiexec a.out

Torque and Moab and Gold Torque – The resource manager ( qstat ) Moab – The scheduler ( checkjob ) Gold – The accounting system ( mybalance ) Images.fanpop.com

Batch Systems Saguaro systems use Torque for batch queuing and Moab for scheduling Batch jobs are submitted on the front end and are subsequently executed on compute nodes as resources become available. “qsub scriptname”. Charged only for the time your job runs. Interactive jobs for keyboard input/GUI’s. You will be logged into a compute node. “qsub –I”. Charged for every second connected. Order of job execution depends on a variety of parameters: –Submission Time –Backfill Opportunities: small jobs may be back-filled while waiting for bigger jobs to complete –Fairshare Priority: users who have recently used a lot of compute resources will have a lower priority than those who are submitting new jobs –Advanced Reservations: jobs my be blocked in order to accommodate advanced reservations (for example, during maintenance windows) –Number of Actively Scheduled Jobs: there are limits on the maximum number of concurrent processors used by each user

Commonly Used TORQUE Commands qsubSubmit a job qstatCheck on the status of jobs qdelDelete running and queued jobs qholdSuspend jobs qrlsResume jobs qalterChange job information man pages for all of these commands

TORQUE Batch System VariablePurpose PBS_JOBIDBatch job id PBS_JOBNAMEUser-assigned (-J) name of the job PBS_TASKNUMNumber of slots/processes for a parallel job PBS_QUEUEName of the queue the job is running in PBS_NODEFILEFilename containing list of nodes PBS_O_WORKDIRJob working directory

Batch System Concerns Submission (need to know) –Required Resources –Run-time Environment –Directory of Submission –Directory of Execution –Files for stdout/stderr Return – Notification Job Monitoring Job Deletion –Queued Jobs –Running Jobs

TORQUE: OpenMP Job Script #!/bin/bash #PBS -l nodes=1:ppn=8 #PBS -N hello #PBS -j oe #PBS -o $PBS_JOBID #PBS -l walltime=00:15:00 module load openmpi/1.7.2-intel-13.0 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=8./hello # of cores Job name Join stdout and stderr Execution commands Output file nameMax Run Time (15 minutes)

Batch Script Debugging Suggestions Echo issuing commands –(“set -x” or “set echo” for bash or csh). Avoid absolute pathnames –Use relative path names or environment variables ($HOME, $PBS_O_WORKDIR) Abort job when a critical command fails. Print environment –Include the "env" command if your batch job doesn't execute the same as in an interactive execution. Use “./” prefix for executing commands in the current directory –The dot means to look for commands in the present working directory. Not all systems include "." in your $PATH variable. (usage:./a.out). Track your CPU time

Job Monitoring (showq utility) saguaro% showq ACTIVE JOBS JOBID JOBNAME USERNAME STATE PROC REMAINING STARTTIME _90_96x6 vmcalo Running 64 18:09:19 Fri Jan 9 10:43: naf phaa406 Running 16 17:51:15 Fri Jan 9 10:25: N phaa406 Running 16 18:19:12 Fri Jan 9 10:53:46 23 Active jobs 504 of 556 Processors Active (90.65%) IDLE JOBS JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME poroe8 xgai Idle :00:00 Thu Jan 8 10:17: meshconv019 bbarth Idle 16 24:00:00 Fri Jan 9 16:24:18 3 Idle jobs BLOCKED JOBS JOBID JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME _90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09: _90_96x6 vmcalo Deferred 64 24:00:00 Thu Jan 8 18:09:11 17 Blocked jobs Total Jobs: 43 Active Jobs: 23 Idle Jobs: 3 Blocked Jobs: 17

Job Monitoring (qstat command) saguaro$ qstat -s a job-ID prior name user state submit/start at queue slots NAMD xxxxxxxx r 01/09/ :13: tf7M.8 xxxxxxxx r 01/09/ :36: f7aM.7 xxxxxxxx r 01/09/ :33: ch.r32 xxxxxxxx r 01/09/ :56: NAMD xxxxxxxx qw 01/09/ :23: f7aM.8 xxxxxxxx hqw 01/09/ :03: tf7M.9 xxxxxxxx hqw 01/09/ :06: Basic qstat options: -u username Display jobs belonging to specified user (\* or ‘*’ for all) -r/-f Display extended job information Also check out the qmem and qpeek utilities.

TORQUE Job Manipulation/Monitoring To kill a running or queued job (takes ~30 seconds to complete): qdel qdel -p (Use when qdel alone won’t delete the job) To suspend a queued job: qhold To resume a suspended job: qrls To alter job information in queue: qalter To see more information on why a job is pending: checkjob -v (Moab) To see a historical summary of a job: qstat -f (TORQUE) To see available resources: showbf (Moab)

System Information Saguaro is a ~5K processor Dell Linux cluster, –>45 trillion floating point operations per second –>9TB aggregate RAM –>400TB aggregate disk –Infiniband Interconnect –Located in GWC 167 Saguaro is a true *cluster* architecture –Most nodes have 8 processors and 16GB of RAM. –Programs needing more resources *must* use parallel programming –Normal, single processor applications *do not go faster* on Saguaro

lonestar.tacc.utexas.edu HardwareComponentsCharacteristics Compute Nodes Dell M600 Dell M610 Dell M620 Dell M710HD Dell R900 Dell R “Harpertown” Nodes 64 “Nehalem” Nodes 32 “Westmere” nodes 32 “WestmereEP” nodes 11 “Tigerton” nodes 1 SMP node w/32 procs, 1TB RAM 2.00/2.93 GHz 8MB/20MB Cache 16/96 GB Mem/node* 8/16 cores/node* SCRATCH File System I/O Nodes Dell Filers – DataDirect 6 I/O Nodes Lustre File System 415 TB Login and Service Nodes3 login & 2 datamover nodes Assorted service nodes 2.0Ghz, 8GB mem 2.67Ghz, 148GB Interconnect (MPI) InfiniBand 24-port leafs FDR and DDR core switches DDR - FDR Fat Tree Topology Ethernet (1GE/10GE ) OpenFlow and SDN switching 10G/E to campus research ring Saguaro Summary

Available File Systems Scratch Home Archive NFS Compute Node Lustre All Nodes Login nodes and desktop systems Data Compellent NetApp DDN

File System Access & Lifetime Table Mount pointUser Access Limit Lifetime /home (NetApp)50GB quotaProject /scratch (DDN)no quota30 days /archive (NetApp)Extendable5 years /data (Compellent)Extendable

Interactive Mode Useful commands – qsub –I –l nodes=2 Interactive mode, reserving 2 processors – qsub –I -X Interactive mode with X forwarding – screen Detach interactive jobs (reattach with –r) – watch qstat Watch job status

Improving Performance Saguaro is a heterogeneous system – know what hardware you are using: qstat -f more /proc/cpuinfo /proc/meminfo echo $PBS_NODELIST CPUPart #IB data rate CPUs per node GB per node Node ids WestmereEP5670QDR1296s63 SandyBridgeE5-2650FDR1664s61-s62 Harpertown5440DDR816s23-s42 Tigerton7340DDR1664fn1-fn12* Nehalem5570DDR824s54-s57 Westmere5650Serial1248s58-s59 { on node

TORQUE: OpenMP Job Script II #!/bin/bash #PBS -l nodes=1:ppn=8:nehalem #PBS -lwalltime=00:15:00 #PBS -m abe #PBS -M module load intel cd $PBS_O_WORKDIR export OMP_NUM_THREADS=8 export OMP_DYNAMIC=TRUE time./hello # of cores, architecture Send mail when job aborts/begins/ends address walltime