Bruce Pullig Solution Architect

Slides:



Advertisements
Similar presentations
TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D. Advanced Session: Outline Cluster Architecture File System.
Advertisements

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Linux Intro Linux, the do it yourself OS Linux, successor to MINIX Linux, Unix for the masses (PC users) History:
Linux Boot Up Process Bootstrapping –Bootstrapping is the standard term for “ starting up a computer”. During bootstrapping, the kernel is loaded into.
Processes & Daemons Chapter IV / Part III. Commands Internal commands: alias, cd, echo, pwd, time External commands, code is in a file: grep, ls, more.
Installing and running COMSOL on a Windows HPCS2008(R2) cluster
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
File System and Directory Structure in Linux. What is File System In a computer, a file system is the way in which files are named and where they are.
Booting and boot levels
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Zellescher Weg 12 Trefftz-Building – HRSK/151 Phone Guido Juckeland Center for Information Services.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Batch Systems In a number of scientific computing environments, multiple users must share a compute resource: –research clusters –supercomputing centers.
SUSE Linux Enterprise Server Administration (Course 3037) Chapter 6 Manage Linux Processes and Services.
1 Chapter 2.1 : Processes Process concept Process concept Process scheduling Process scheduling Interprocess communication Interprocess communication Threads.
Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.
Network Queuing System (NQS). Controls batch queues Only on Cray SV1 Presently 8 queues available for general use and one queue for the Cray analyst.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Ch11: Syslog and Logfiles Presented by: Apichana Thiantanawat 06/11/02.
Beowulf Software. Monitoring and Administration Beowulf Watch 
Lab 3 + Using the Terminal 1. "Under Linux there are GUIs (graphical user interfaces). where you can point and click and drag, and hopefully get work.
Getting Started on Emerald Research Computing Group.
Cluster Computing Applications for Bioinformatics Thurs., Sept. 20, 2007 process management shell scripting Sun Grid Engine running parallel programs.
RT-LAB Electrical Applications 1 Opal-RT Technologies Use of the “Store Embedded” mode Solution RT-LAB for PC-104.
17 Establishing Dial-up Connection to the Internet Using Windows 9x 1.Install and configure the modem 2.Configure Dial-Up Adapter 3.Configure Dial-Up Networking.
Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.
Batch Systems P. Nilsson, PROOF Meeting, October 18, 2005.
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
CSC414 “Introduction to UNIX/ Linux” Lecture 3
Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
South African Grid Training WORKER NODE Albert van Eck UFS - ICTS 17 November 2009 Slides by GIUSEPPE PLATANIA.
Linux Administration – Finding You Way on the Command Line The Linux File Directory or Tree.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Getting Started with Linux
1.1.2 OneOs Downloading Software Upgrade
First proposal for a modification of the GIS schema
Unix Scripts and PBS on BioU
OpenPBS – Distributed Workload Management System
JLab Auger Auger is the interface to JLab’s data analysis cluster (“the farm”) Controls batch job submissions Manages input/output from jobs Provides details.
Security aspects of the CREAM-CE
Chapter 11 – Processes and Services
Data Server S/W Upgrade Samsung Electronics Co., Ltd.
Tango Administrative Tools
Chapter 2: System Structures
Lecture 9: ADB Topics: Basic ADB Commands.
How to Fix the Automatic Repair Loop in Windows 8.1
Bomgar Remote support software
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Introduction to HPC Workshop
CMU Access via Launch Cluster Management Utility GUI.
Bruce Pullig Solution Architect
Compiling and Job Submission
(Chapter 2) John Carelli, Instructor Kutztown University
Privilege Separation in Condor
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Windows Processes and Services
Time Gathering Systems Secure Data Collection for IBM System i Server
CSC 140: Introduction to IT
High Performance Computing in Bioinformatics
Parallel computation with R & Python on TACC HPC server
Linux Shell Script Programming
Condor-G Making Condor Grid Enabled
Lecture 11: ADB Topics: Basic ADB Commands.
Working in The IITJ HPC System
Presentation transcript:

Bruce Pullig Solution Architect bpullig@slb Bruce Pullig Solution Architect bpullig@slb.com LSF Configuration and Basic Troubleshooting

LSF Installation: LSF 7.0.6 is installed on the head node (hpecml001) which is then available to all compute nodes. The path to the install is: /usr/local/lsftop A startup script (lsf) on the individual nodes is located in /etc/init.d and is installed as a service which allows automatic startup upon reboot. It can be controlled by running the following as root: service lsf { start | stop | restart | status | force_reload } Example: [root@hpecml001 ~]# cd /etc/init.d [root@hpecml001 init.d]# service lsf status Show status of the LSF subsystem lim (pid 5300) is running... res (pid 5302) is running... sbatchd (pid 5304) is running... [root@hpecml001 init.d]#

{ { { LSF Queues: normal Parallel Jobs priority-normal Parallel Jobs –Higher Priority serial Serial Jobs priority-serial Serial Jobs –Higher Priority emdc_normal Parallel Jobs emdc_priority-normal Parallel Jobs –Higher Priority emdc_serial Serial Jobs emdc_priority-serial Serial Jobs –Higher Priority urc_normal Parallel Jobs urc_priority-normal Parallel Jobs –Higher Priority urc_serial Serial Jobs urc_priority-serial Serial Jobs –Higher Priority { Qatar { EMDC emdc_group { URC urc_group

LSF Queue Configuration File: /usr/local/lsftop/conf/lsbatch/hpecml001/configdir/lsb.queues

LSF Queue Configuration File: To make queue file changes to take affect, run the following on the head node (hpecml001) as lsfadmin, ecl, or root: badmin reconfig

LSF Monitoring compute nodes bhosts Monitoring queues/jobs   Monitoring compute nodes bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hpecml001 closed - 8 0 0 0 0 0 hpecnl001 ok - 8 0 0 0 0 0 hpecnl002 ok - 8 0 0 0 0 0 hpecnl003 ok - 8 0 0 0 0 0 hpecnl004 ok - 8 0 0 0 0 0 hpecnl005 ok - 8 0 0 0 0 0 hpecnl006 ok - 8 0 0 0 0 0 *********************SNIP************************ hpecnl024 ok - 8 0 0 0 0 0 Monitoring queues/jobs bjobs -a -u all 1290 msaxena DONE Eclipse hpecml001 hpecnl001 *e_THP5020 Mar 22 17:38 1291 dzajac DONE Eclipse hpecml001 hpecnl004 *PERM_HM_2 Mar 22 17:46 msaxena DONE Eclipse hpecml001 hpecnl008 *PERM_HM_2 Mar 22 17:52 pulligb RUN Eclipse hpecml001 8* hpecnl004 PULLIG9 Mar 23 08:43 8* hpecnl005 8* hpecnl006 8* hpecnl007 8* hpecnl008 8* hpecnl009

LSF Jobs completed over 60 minutes ago bhist –a –u <username>   Jobs completed over 60 minutes ago bhist –a –u <username> Killing jobs bkill <jobid> -bash-3.2$ bkill 1293 Job <1293> is being terminated

LSF To obtain more info for troubleshooting bjobs –l <jobID>   To obtain more info for troubleshooting bjobs –l <jobID>  Job <1292>, Job Name <TIMEDEP_E300_2>, User <dzajac>, Status <DONE>, Queue <Eclipse>, Command <cd /data/PTCI/Eclipse/dzajac ; ./TIMEDEP_E300_2.175224> Thu Mar 22 17:52:24: Submitted from host <hpecml001>, CWD </data/PTCI/Eclip se/dzajac>, Requested Resources <select[type= =any] rusage[eclipse=1:compositional=1]>; Thu Mar 22 17:52:28: Started on <ho10>, Execution Home </home/dzajac> , Execution CWD </data/PTCI/Eclipse/dzajac >; Thu Mar 22 17:53:57: Done successfully. The CPU time used is 61.5 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE ATTACHMENT 0 dzajac Mar 22 17:52 EF_SPOOLER_URI Y Datafile name Location of datafile and relevant files for debugging

LSF Log Files Log files for each system are found under: /usr/local/lsftop/log There is a log file for each daemon that controls an LSF function. The system name is appended to the daemon name. Examples: [root@hpecml001 log]# ls -lahrt | grep hpecml001 -rw-r--r-- 1 lsfadmin root 0 Jan 14 2011 melim.log.hpecml001 -rw-r--r-- 1 lsfadmin root 354 Oct 19 2011 pim.log.hpecml001 -rw-r--r-- 1 lsfadmin root 41K Oct 18 12:01 mbatchd.log.hpecml001 -rw-rw-rw- 1 lsfadmin root 7.0K Nov 27 12:21 res.log.hpecml001 -rw-r--r-- 1 lsfadmin root 3.1K Nov 27 12:21 sbatchd.log.hpecml001 -rw-r--r-- 1 lsfadmin root 775M Nov 27 12:21 lim.log.hpecml001 -rw-r--r-- 1 lsfadmin lsfadmin 47K Nov 27 16:46 mbschd.log.hpecml001 [root@hpecml001 log]# [root@hpecml001 log]# ls -lahrt | grep hpecnl009 -rw-r--r-- 1 lsfadmin root 0 Jan 14 2011 pim.log.hpecnl009 -rw-rw-rw- 1 lsfadmin root 2.1K Feb 18 2011 res.log.hpecnl009 -rw-r--r-- 1 lsfadmin root 264 Feb 18 2011 sbatchd.log.hpecnl009 -rw-r--r-- 1 lsfadmin root 2.2K Feb 18 2011 lim.log.hpecnl009

LSF Infiniband Issues If you suspect Infiniband issues, ping another node on the ib0 device. IP range is: x.x.x.x If the other system doesn’t respond, check the OpenSM service on the head node. Restart if necessary by running: service opensmd restart