Bruce Pullig Solution Architect bpullig@slb Bruce Pullig Solution Architect bpullig@slb.com EnginFrame, LSF, and ECLIPSE Overview
Why use LSF? Without LSF: Difficult to determine which cores in cluster are in use -This can change before you can submit your job. -Choosing incorrectly can cause inefficiency or job failure. No queuing -You must wait until ALL resources are available before submitting your job. -low-importance jobs can’t be scheduled for non-working hours. With LSF: Jobs run when resources (including memory, cores, licenses) are available. Jobs can be scheduled for off-hours. Jobs can be run efficiently.
4-core parallel job Efficient use of hardware
8-core parallel job Efficient use of hardware
4 serial jobs Efficient use of hardware…but
4 serial jobs and 1 very inefficient 8-core parallel job
Dividing cluster “virtually” helps prevents inefficiency. Can overlap if configured carefully.
Default queues: Typical LSF Installation: Installed on shared storage which is then available to all nodes. Script on individual nodes under /etc/rc3.d is installed as a service and started up automatically upon reboot. Default queues: normal (For normal low priority jobs, running only if hosts are lightly loaded.) owners (For owners of some machines) priority (Jobs submitted for this queue are scheduled as urgent jobs. Jobs in this queue can preempt jobs in lower priority queues. Premption is incompatible with ECLIPSE) short (For short jobs that would not take much CPU time. Killed if they run more than 15 minutes) idle (Run only if the machine is idle and very lightly loaded.) license (For licensed package. Scheduled to run with moderate priority.) night (For large heavy duty jobs, running during off hours and weekends. Scheduled with higher priority.) chkpnt_rerun_queue (Incompatible with ECLIPSE)
Recommended queues: ParallelWork (Normal priority queue, limited to Parallel nodes) SerialWork (Normal priority queue, limited to Serial nodes) MR_Serial (Low priority queue for MR jobs, limited to Serial nodes) MR_Parallel (Low priority queue for MR jobs, limited to Parallel nodes)
License Requirements Simulations require one Black Oil (eclipse) license per job. Parallel jobs also require 1 Parallel (parallel) license per core. Other feature licenses may be required depending upon features used. If there are insufficient licenses available, LSF will queue the job until ALL resources, including licenses, are available. ECLIPSE and Parallel licenses are checked by default, but all other features must be identified in the users .DATA file.
Data Storage
LSF Monitoring compute nodes bhosts Monitoring queues/jobs Monitoring compute nodes bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV ho02 ok - 6 0 0 0 0 0 ho03 ok - 6 0 0 0 0 0 *********************SNIP************************ ho16 ok - 6 0 0 0 0 0 Monitoring queues/jobs bjobs -a -u all 1290 msaxena DONE Eclipse holsf01 ho11 *e_THP5020 Mar 22 17:38 1291 dzajac DONE Eclipse holsf01 ho10 *PERM_HM_2 Mar 22 17:46 msaxena DONE Eclipse holsf01 ho10 *PERM_HM_2 Mar 22 17:52 pulligb RUN Eclipse holsf01 6*ho45 PULLIG9 Mar 23 08:43 6*ho12 6*ho13 6*ho14 6*ho15 2*ho02
LSF To obtain more info for troubleshooting bjobs –l <jobID> To obtain more info for troubleshooting bjobs –l <jobID> Job <1292>, Job Name <TIMEDEP_E300_2>, User <dzajac>, Status <DONE>, Queue <Eclipse>, Command <cd /data/PTCI/Eclipse/dzajac ; ./TIMEDEP_E300_2.175224> Thu Mar 22 17:52:24: Submitted from host <holsf01>, CWD </data/PTCI/Eclip se/dzajac>, Requested Resources <select[type= =any] rusage[eclipse=1:compositional=1]>; Thu Mar 22 17:52:28: Started on <ho10>, Execution Home </home/dzajac> , Execution CWD </data/PTCI/Eclipse/dzajac >; Thu Mar 22 17:53:57: Done successfully. The CPU time used is 61.5 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE ATTACHMENT 0 dzajac Mar 22 17:52 EF_SPOOLER_URI Y Datafile name Location of datafile and relevant files for debugging
LSF Jobs completed over 60 minutes ago bhist –a –u <username> Jobs completed over 60 minutes ago bhist –a –u <username> Killing jobs bkill <jobid> -bash-3.2$ bkill 1293 Job <1293> is being terminated queue setup vi /lsftop/conf/lsbatch /CLUSTER_NAME/configdir/lsb.queues badmin reconfig
LSF Infiniband Issues If you suspect Infiniband issues, ping another node on the ib0 device. IP range is: x.x.x.x If the other system doesn’t respond, check the OpenSM service on the head node. Restart if necessary by running: service opensmd restart
EnginFrame Access via: http://HEADNODE:8080
Submitting Simulation Job
Monitoring Simulation Job
Cluster Info
My Jobs
All Jobs
ECLIPSE Launching Simulation from Command Line eclrun –s holsf01 –q <QUEUENAME> <application> <DATASET> Example eclrun –s holsf01 –q Eclipse ParallelWork BIG_RESERVOIR Application will be: eclipse, e300, or frontsim Modifing benchmarks to run on x-cores Change PARALLEL section in .DATA file eclipse PARALLEL 32 'DISTRIBUTED' / E300 36 /
ECLIPSE Troubleshooting Check the following for errors: <SIMULATION_NAME>.OUT <SIMULATION_NAME>.PRT <SIMULATION_NAME>.LOG <SIMULATION_NAME>.ECLRUN Simulation name is datafile name without .DATA Run either a benchmark or sample datafiles to rule out dataset issues. (/ecl/benchmarks/)
ECLIPSE The commands for running the benchmarks are below: Run the benchmarks for various configurations from the command line. Files will be run from the following structure: ECLIPSE E100 Parallel: /ecl/benchmarks/parallel/data ECLIPSE E100 Serial: /ecl/benchmarks/serial/e100 ECLIPSE E300 Serial: /ecl/benchmarks/serial/e300 ECLIPSE E300 Parallel: /ecl/benchmarks/2MMbenchmark/E300 Schlumberger sample datasets are also included in file structure: ECLIPSE Sample: /ecl/benchmarks/sample The commands for running the benchmarks are below: ECLIPSE E100 Parallel benchmarks: eclrun –s Houston –q eclipse –u <users_name> eclipse ONEM ECLIPSE E100 Serial benchmarks: eclrun –s Houston –q eclipse –u <user_name> eclipse E100 ECLIPSE E300 Parallel benchmarks: eclrun –s Houston –q eclipse –u <user_name>e300 MMx ECLIPSE E300 Serial benchmarks: eclrun –s Houston –q eclipse –u <user_name> e300 E300 If you submit ticket with Schlumberger, they will ask for *.OUT, *.DATA, *.LOG, *.PRT