Download presentation
Presentation is loading. Please wait.
1
Bruce Pullig Solution Architect bpullig@slb
Bruce Pullig Solution Architect LSF Configuration and Basic Troubleshooting
2
LSF Installation: LSF is installed on the head node (hpecml001) which is then available to all compute nodes. The path to the install is: /usr/local/lsftop A startup script (lsf) on the individual nodes is located in /etc/init.d and is installed as a service which allows automatic startup upon reboot. It can be controlled by running the following as root: service lsf { start | stop | restart | status | force_reload } Example: ~]# cd /etc/init.d init.d]# service lsf status Show status of the LSF subsystem lim (pid 5300) is running... res (pid 5302) is running... sbatchd (pid 5304) is running... init.d]#
3
{ { { LSF Queues: normal Parallel Jobs
priority-normal Parallel Jobs –Higher Priority serial Serial Jobs priority-serial Serial Jobs –Higher Priority emdc_normal Parallel Jobs emdc_priority-normal Parallel Jobs –Higher Priority emdc_serial Serial Jobs emdc_priority-serial Serial Jobs –Higher Priority urc_normal Parallel Jobs urc_priority-normal Parallel Jobs –Higher Priority urc_serial Serial Jobs urc_priority-serial Serial Jobs –Higher Priority { Qatar { EMDC emdc_group { URC urc_group
4
LSF Queue Configuration File:
/usr/local/lsftop/conf/lsbatch/hpecml001/configdir/lsb.queues
5
LSF Queue Configuration File:
To make queue file changes to take affect, run the following on the head node (hpecml001) as lsfadmin, ecl, or root: badmin reconfig
6
LSF Monitoring compute nodes bhosts Monitoring queues/jobs
Monitoring compute nodes bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hpecml closed hpecnl ok hpecnl ok hpecnl ok hpecnl ok hpecnl ok hpecnl ok *********************SNIP************************ hpecnl ok Monitoring queues/jobs bjobs -a -u all msaxena DONE Eclipse hpecml001 hpecnl001 *e_THP5020 Mar 22 17:38 dzajac DONE Eclipse hpecml001 hpecnl004 *PERM_HM_2 Mar 22 17:46 msaxena DONE Eclipse hpecml001 hpecnl008 *PERM_HM_2 Mar 22 17:52 pulligb RUN Eclipse hpecml001 8* hpecnl004 PULLIG9 Mar 23 08:43 8* hpecnl005 8* hpecnl006 8* hpecnl007 8* hpecnl008 8* hpecnl009
7
LSF Jobs completed over 60 minutes ago bhist –a –u <username>
Jobs completed over 60 minutes ago bhist –a –u <username> Killing jobs bkill <jobid> -bash-3.2$ bkill 1293 Job <1293> is being terminated
8
LSF To obtain more info for troubleshooting bjobs –l <jobID>
To obtain more info for troubleshooting bjobs –l <jobID> Job <1292>, Job Name <TIMEDEP_E300_2>, User <dzajac>, Status <DONE>, Queue <Eclipse>, Command <cd /data/PTCI/Eclipse/dzajac ; ./TIMEDEP_E300_ > Thu Mar 22 17:52:24: Submitted from host <hpecml001>, CWD </data/PTCI/Eclip se/dzajac>, Requested Resources <select[type= =any] rusage[eclipse=1:compositional=1]>; Thu Mar 22 17:52:28: Started on <ho10>, Execution Home </home/dzajac> , Execution CWD </data/PTCI/Eclipse/dzajac >; Thu Mar 22 17:53:57: Done successfully. The CPU time used is 61.5 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched loadStop EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE ATTACHMENT dzajac Mar 22 17:52 EF_SPOOLER_URI Y Datafile name Location of datafile and relevant files for debugging
9
LSF Log Files Log files for each system are found under:
/usr/local/lsftop/log There is a log file for each daemon that controls an LSF function. The system name is appended to the daemon name. Examples: log]# ls -lahrt | grep hpecml001 -rw-r--r-- 1 lsfadmin root Jan melim.log.hpecml001 -rw-r--r-- 1 lsfadmin root Oct pim.log.hpecml001 -rw-r--r-- 1 lsfadmin root K Oct 18 12:01 mbatchd.log.hpecml001 -rw-rw-rw- 1 lsfadmin root K Nov 27 12:21 res.log.hpecml001 -rw-r--r-- 1 lsfadmin root K Nov 27 12:21 sbatchd.log.hpecml001 -rw-r--r-- 1 lsfadmin root M Nov 27 12:21 lim.log.hpecml001 -rw-r--r-- 1 lsfadmin lsfadmin 47K Nov 27 16:46 mbschd.log.hpecml001 log]# log]# ls -lahrt | grep hpecnl009 -rw-r--r-- 1 lsfadmin root Jan pim.log.hpecnl009 -rw-rw-rw- 1 lsfadmin root K Feb res.log.hpecnl009 -rw-r--r-- 1 lsfadmin root Feb sbatchd.log.hpecnl009 -rw-r--r-- 1 lsfadmin root K Feb lim.log.hpecnl009
10
LSF Infiniband Issues If you suspect Infiniband issues, ping another node on the ib0 device. IP range is: x.x.x.x If the other system doesn’t respond, check the OpenSM service on the head node. Restart if necessary by running: service opensmd restart
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.