Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group.

Slides:

Advertisements

Similar presentations

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Advertisements

Koç University High Performance Computing Labs Hattusas & Gordion.

CCPR Workshop Lexis Cluster Introduction October 19, 2007 David Ash.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

VIPBG LINUX CLUSTER By Helen Wang Sept. 10, 2014.

IT MANAGEMENT OF FME, 21 ST JULY  THE HPC FACILITY  USING PUTTY AND WINSCP TO ACCESS THE SERVER  SENDING FILES TO THE SERVER  RUNNING JOBS 

Introduction to HPC Workshop October Introduction Rob Lane HPC Support Research Computing Services CUIT.

Tamnun Hardware. Tamnun Cluster inventory – system Login node (Intel 2 E GB ) – user login – PBS – compilations, – YP master Admin.

Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.

Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2011assignprelim.1 Assignment Preliminaries ITCS 6010/8010 Spring 2011.

BIOSTAT LINUX CLUSTER By Helen Wang October 10, 2013.

Research Computing with Newton Gerald Ragghianti Newton HPC workshop Sept. 3, 2010.

 Accessing the NCCS Systems  Setting your Initial System Environment  Moving Data onto the NCCS Systems  Storing Data on the NCCS Systems  Running.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to the HPCC Jim Leikert System Administrator High Performance Computing Center.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

BIOSTAT LINUX CLUSTER By Helen Wang October 11, 2012.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

HPC at HCC Jun Wang Outline of Workshop1 Overview of HPC Computing Resources at HCC How to obtain an account at HCC How to login a Linux cluster at HCC.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Introduction to the HPCC Dirk Colbry Research Specialist Institute for Cyber Enabled Research.

CCPR Workshop Introduction to the Cluster July 13, 2006.

Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.

Software Tools Using PBS. Software tools Portland compilers pgf77 pgf90 pghpf pgcc pgCC Portland debugger GNU compilers g77 gcc Intel ifort icc.

Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Introduction to Parallel Computing Presented by The Division of Information Technology Computer Support Services Department Research Support Group.

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

Introduction to HPC Workshop March 1 st, Introduction George Garrett & The HPC Support Team Research Computing Services CUIT.

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Using ROSSMANN to Run GOSET Studies Omar Laldin ( using materials from Jonathan Crider, Harish Suryanarayana ) Feb. 3, 2014.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

BIOSTAT LINUX CLUSTER By Helen Wang October 6, 2016.

Advanced Computing Facility Introduction

Compute and Storage For the Farm at Jlab

GRID COMPUTING.

Specialized Computing Cluster An Introduction

PARADOX Cluster job management

INTRODUCTION TO VIPBG LINUX CLUSTER

HPC usage and software packages

INTRODUCTION TO VIPBG LINUX CLUSTER

OpenPBS – Distributed Workload Management System

How to use the HPCC to do stuff

BIOSTAT LINUX CLUSTER By Helen Wang October 29, 2015.

Tamnun Cluster inventory – compute nodes, cores/RAM

Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node

Architecture & System Overview

CommLab PC Cluster (Ubuntu OS version)

Welcome to our Nuclear Physics Computing System

Introduction to HPC Workshop

Introduction to TAMNUN server and basics of PBS usage

Compiling and Job Submission

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

Welcome to our Nuclear Physics Computing System

Advanced Computing Facility Introduction

Tamnun Hardware.

Introduction to High Performance Computing Using Sapelo2 at GACRC

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

Presentation transcript:

Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

TAMNUN LINKS Registration: Documentation and Manuals: Help Pages & Important Documents Accessible from external Network

3 Tamnun Cluster inventory – system Login node (Intel 2 E GB ) –user login –PBS –compilations, –YP master Admin node (Intel 2 E GB ) –SMC NAS node (NFS, CIFS) (Intel 2 E5620 ) –1 st enclosure – 60 slots, 60 x 1TB drives –2 nd enclosure - 60 slots, 10 x 3TB drives Network Solution: - 14 QDR Infiniband switches with 2:1 blocking topology - 4 GiGE switches for the management network

Tamnun Cluster inventory – compute nodes (1) Tamnun consists of public cluster, available for general Technion users and private sub-clusters purchased by Technion researchers Public Cluster Specifications: 80 Compute Nodes consisting of two 2.40 GHz six core Xeon Intel processors: 960 cores with 8GB DDR3 memory per core 4 Graphical Processing Units (GPU): 4 servers with NVIDIA TeslaM2090 GPU Computing Modules, 512 CUDA cores Storage: 36 nodes with 500 GB and 52 nodes with 1 TB Sata Drives, 4 nodes with fast 1200 GB SAS drives, raw NAS storage capacity is 50 TB.

Tamnun Cluster inventory – compute nodes (2) Nodes n001 – n028 - RBNI (public) Nodes n029 – n080 - Minerva (public) Nodes n097 – n100 - “Gaussian” nodes with large and fast drive (public) Nodes gn001 – gn004 - GPU (public) Nodes gn005 – gn007 - GPU (private nodes of Hagai Perets) Nodes n081 – n096, sn001 - private cluster (Dan Mordehai) Nodes n101 – n108 - private cluster (Oded Amir) Nodes n109 – n172, n217 – n232 - private cluster (Steven Frankel) Nodes n173 – n180 - private cluster (Omri Barak) Nodes n181 – n184 - private cluster (Rimon Arieli) Nodes n185 – n192 - private cluster (Shmuel Osovski) Nodes n193 – n216 - private cluster (Maytal Caspary) Nodes n233 – n240 - private cluster (Joan Adler) Nodes n241 – n244 - private cluster (Ronen Talmon) Node sn002 - private node (Fabian Glaser)

TAMNUN connection - general guidelines 1.Connection via server TX and GoGlobal (also from abroad) ndows.txt 2. Interactive usage: compiling, debugging and tuning only! 3. Interactive CPU time limit = 1 hour 4. Tamnun login node: use to submit jobs via PBS batch queues to Compute nodes (see next pages on PBS) 5. Default quota = 50 GB, check with quota –vs username Secure file transfer: outside Technion: scp to TX, WinScp to/from PC, inside Technion: use WinScp 7. dropbox usage is not allowed on Tamnun! 8. No automatic data backup is available on Tamnun, see

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring of the computational workload on a set of computers Queuing: Users submit tasks or “jobs” to the resource management system where they are queued up until the system is ready to run them. Scheduling: The process of selecting which jobs to run, when, and where, according to a predetermined policy. Aimed at balance competing needs and goals on the system(s) to maximize efficient use of resources Monitoring : Tracking and reserving system resources, enforcing usage policy. This includes both software enforcement of usage limits and user or administrator monitoring of scheduling policies

Important PBS Links on Tamnun PBS User Guide Basic PBS Usage Instructions help/TAMNUN_PBS_Usage.pdf Detailed Description of the Tamnun PBS Queues help/TAMNUN_PBS_Queues_Description.pdf PBS scripts examples

Current Public Queues on TAMNUN AccessDefinitionQueue NamePriorityDescription nano training Max wall time 168 h ROUTING queue nano_h_pHighRBNI minerva training Max wall time 168 ROUTING queue minerva_h_pHighMinerva All users Max wall time 24 h ROUTING queue all_l_pLowGeneral All users Thu 17:00 – Sun 08:00 Max wall time 63 h Max user CPU 192 np_weekendLowNon-prime time All usersMax wall time 72 hgpu_l_pHigh GPU gaussianMax wall time 168 h Max user CPU 48 Max job CPU 12 gaussian_ldHighGaussian LD All usersMax wall time 24 h Max user CPU 48 general_ldLowGeneral LD

Submitting jobs to PBS: qsub command qsub command is used to submit a batch job to PBS. Submitting a PBS job specifies a task, requests resources and sets job attributes, which can be defined in an executable scriptfile. The syntax of qsub recommended on TAMNN : > qsub [options] scriptfile PBS script files ( PBS shell scripts, see the next page) should be created in the user’s directory To obtain detailed information about qsub options, please use the command: > man qsub Job Identifier (JOB_ID) Upon successful submission of a batch job PBS returns a job identifier in the following format: > sequence_number.server_name > tamnun

The PBS shell script sections Shell specification: #!/bin/sh PBS directives: used to request resources or set attributes. A directive begins with the default string “#PBS”. Tasks (programs or commands) - environment definitions - I/O specifications - executable specifications NB! Other lines started with # are comments

PBS script example for multicore user code #!/bin/sh #PBS -N job_name #PBS -q queue_name #PBS -M #PBS -l select=1:ncpus=N #PBS -l select=mem=P GB #PBS -l walltime=24:00:00 PBS_O_WORKDIR=$HOME/mydir cd $PBS_O_WORKDIR./program.exe output.file Other examples see at

Checking the job/queue status: qstat command qstat command is used to request the status of batch jobs, queues, or servers Detailed information: > man qstat qstat output structure (see on Tamnun) Useful commands > qstat –a all users in all queues (default) > qstat -1n all jobs in the system with node names > qstat -1nu username all user’s jobs with node names > qstat –f JOB_ID extended output for the job > Qstat –Q list of all queues in the system > qstat –Qf queue_name extended queue details  qstat –1Gn queue_name all jobs in the queue with node names

Removing job from a queue: qdel command qdel used to delete queued or running jobs. The job's running processes are killed. A PBS job may be deleted by its owner or by the administrator Detailed information: > man qdel Useful commands > qdel JOB_ID deletes job from a queue > qdel -W force JOB_ID force delete job

Checking a job results and Troubleshooting Save the JOB_ID for further inspection Check error and output files: job_name.eJOB_ID; job_name.oJOB_ID Inspect job’s details (also after N days ) : > tracejob [-n N]JOB_ID Job in E state - occupies resources, will be deleted Running interactive batch job: > qsub –I pbs_script Job sent to execution node, PBS directives executed, job awaits user’s command Checking the job on an execution node: > ssh node_name > hostname > top /u user - shows user shows processes ; /1 – CPU usage > kill -9 PID remove job from the node > ls –rtl /gtmp check error, output and other files under user ownership Output can be copied from the node to the home directory

Monitoring the system pbsnodes used to query the status of hosts Syntax: > pbsnodes node_name/node_list Shows extended information on a node: resources available, resources used queues list, busy/free status, jobs list xpbsmon & provides a way to graphically display the various nodes that run jobs. With this utility, you can see what job is running on what node, who owns the job, how many nodes assigned to a job, status of each node (color-coded and the colors are user-modifiable), how many nodes are available, free, down, reserved, offline, of unknown status, in use running multiple jobs or executing only 1 job. Detailed information and tutorials: > man xpbsmon