Bruce Pullig Solution Architect bpullig@slb Bruce Pullig Solution Architect bpullig@slb.com LSF Configuration and Basic Troubleshooting
LSF Installation: LSF 7.0.6 is installed on the head node (hpecml001) which is then available to all compute nodes. The path to the install is: /usr/local/lsftop A startup script (lsf) on the individual nodes is located in /etc/init.d and is installed as a service which allows automatic startup upon reboot. It can be controlled by running the following as root: service lsf { start | stop | restart | status | force_reload } Example: [root@hpecml001 ~]# cd /etc/init.d [root@hpecml001 init.d]# service lsf status Show status of the LSF subsystem lim (pid 5300) is running... res (pid 5302) is running... sbatchd (pid 5304) is running... [root@hpecml001 init.d]#
{ { { LSF Queues: normal Parallel Jobs priority-normal Parallel Jobs –Higher Priority serial Serial Jobs priority-serial Serial Jobs –Higher Priority emdc_normal Parallel Jobs emdc_priority-normal Parallel Jobs –Higher Priority emdc_serial Serial Jobs emdc_priority-serial Serial Jobs –Higher Priority urc_normal Parallel Jobs urc_priority-normal Parallel Jobs –Higher Priority urc_serial Serial Jobs urc_priority-serial Serial Jobs –Higher Priority { Qatar { EMDC emdc_group { URC urc_group
LSF Queue Configuration File: /usr/local/lsftop/conf/lsbatch/hpecml001/configdir/lsb.queues
LSF Queue Configuration File: To make queue file changes to take affect, run the following on the head node (hpecml001) as lsfadmin, ecl, or root: badmin reconfig
LSF Monitoring compute nodes bhosts Monitoring queues/jobs Monitoring compute nodes bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hpecml001 closed - 8 0 0 0 0 0 hpecnl001 ok - 8 0 0 0 0 0 hpecnl002 ok - 8 0 0 0 0 0 hpecnl003 ok - 8 0 0 0 0 0 hpecnl004 ok - 8 0 0 0 0 0 hpecnl005 ok - 8 0 0 0 0 0 hpecnl006 ok - 8 0 0 0 0 0 *********************SNIP************************ hpecnl024 ok - 8 0 0 0 0 0 Monitoring queues/jobs bjobs -a -u all 1290 msaxena DONE Eclipse hpecml001 hpecnl001 *e_THP5020 Mar 22 17:38 1291 dzajac DONE Eclipse hpecml001 hpecnl004 *PERM_HM_2 Mar 22 17:46 msaxena DONE Eclipse hpecml001 hpecnl008 *PERM_HM_2 Mar 22 17:52 pulligb RUN Eclipse hpecml001 8* hpecnl004 PULLIG9 Mar 23 08:43 8* hpecnl005 8* hpecnl006 8* hpecnl007 8* hpecnl008 8* hpecnl009
LSF Jobs completed over 60 minutes ago bhist –a –u <username> Jobs completed over 60 minutes ago bhist –a –u <username> Killing jobs bkill <jobid> -bash-3.2$ bkill 1293 Job <1293> is being terminated
LSF To obtain more info for troubleshooting bjobs –l <jobID> To obtain more info for troubleshooting bjobs –l <jobID> Job <1292>, Job Name <TIMEDEP_E300_2>, User <dzajac>, Status <DONE>, Queue <Eclipse>, Command <cd /data/PTCI/Eclipse/dzajac ; ./TIMEDEP_E300_2.175224> Thu Mar 22 17:52:24: Submitted from host <hpecml001>, CWD </data/PTCI/Eclip se/dzajac>, Requested Resources <select[type= =any] rusage[eclipse=1:compositional=1]>; Thu Mar 22 17:52:28: Started on <ho10>, Execution Home </home/dzajac> , Execution CWD </data/PTCI/Eclipse/dzajac >; Thu Mar 22 17:53:57: Done successfully. The CPU time used is 61.5 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE ATTACHMENT 0 dzajac Mar 22 17:52 EF_SPOOLER_URI Y Datafile name Location of datafile and relevant files for debugging
LSF Log Files Log files for each system are found under: /usr/local/lsftop/log There is a log file for each daemon that controls an LSF function. The system name is appended to the daemon name. Examples: [root@hpecml001 log]# ls -lahrt | grep hpecml001 -rw-r--r-- 1 lsfadmin root 0 Jan 14 2011 melim.log.hpecml001 -rw-r--r-- 1 lsfadmin root 354 Oct 19 2011 pim.log.hpecml001 -rw-r--r-- 1 lsfadmin root 41K Oct 18 12:01 mbatchd.log.hpecml001 -rw-rw-rw- 1 lsfadmin root 7.0K Nov 27 12:21 res.log.hpecml001 -rw-r--r-- 1 lsfadmin root 3.1K Nov 27 12:21 sbatchd.log.hpecml001 -rw-r--r-- 1 lsfadmin root 775M Nov 27 12:21 lim.log.hpecml001 -rw-r--r-- 1 lsfadmin lsfadmin 47K Nov 27 16:46 mbschd.log.hpecml001 [root@hpecml001 log]# [root@hpecml001 log]# ls -lahrt | grep hpecnl009 -rw-r--r-- 1 lsfadmin root 0 Jan 14 2011 pim.log.hpecnl009 -rw-rw-rw- 1 lsfadmin root 2.1K Feb 18 2011 res.log.hpecnl009 -rw-r--r-- 1 lsfadmin root 264 Feb 18 2011 sbatchd.log.hpecnl009 -rw-r--r-- 1 lsfadmin root 2.2K Feb 18 2011 lim.log.hpecnl009
LSF Infiniband Issues If you suspect Infiniband issues, ping another node on the ib0 device. IP range is: x.x.x.x If the other system doesn’t respond, check the OpenSM service on the head node. Restart if necessary by running: service opensmd restart