Research Computing Environment at the University of Alberta Diego Novillo Research Computing Support Group University of Alberta April 1999
29 April Computing Environment SGI Origin 2000, 42 CPUs, 10Gb RAM Mix of interactive and batch jobs 2 CPUs for interactive activity 40 CPUs used by batch jobs Batch jobs managed by LSF (Platform)
How is the system being used?
29 April Monthly System Utilization (CPU days) Monthly System Utilization (CPU days) Theoretical max
29 April Average wait time in queue (hours) Average wait time in queue (hours) Started using load thresholds Need to balance parallel jobs
29 April System usage by job type
29 April Some thoughts on usage Scalar use is predominant (so far) We are starting to push the system Jobs are waiting too long in the queue Need to modify queue policies –Lower runtime limits –Checkpoint/restart –Limit on number of jobs submitted
Using LSF
29 April Job queues Parallel queue par –High priority –Slot-based: up to 32 processors –Jobs are never suspended Sequential queue nic –Low priority –threshold-based: up to 95% system utilization –Jobs can be preempted by parallel jobs
29 April Job queues II Two special queues –npseq For sequential jobs that do not wish to be preempted Very low priority Only 4 slots available –special Jobs that need to run longer than system limit Only 1 slot available Must be approved by committee
29 April Fairshare system Jobs are scheduled according to priorities Priorities are dynamic and based on –Number of shares –Past usage (currently 2 weeks of history) –Type of job (parallel jobs higher priority) Resource availability also important
29 April Getting started Complete LSF documentation online Man pages also available Add one line to your login files source /usr/local/lsf/cshrc.lsf ( C shell ) or. /usr/local/lsf/profile.lsf ( Bourne shell )
29 April Submitting jobs % bsub [options] pgm args -q name Which queue to use -n num How many processors -o file Output file Queue defaults to ‘nic’. If no output file is given, results are mailed to you.
29 April Watching jobs % bjobs [options] -l All the details -p Only pending jobs (and why) -a All jobs (even finished ones) -uall All the jobs in the system jobid Just the job with this id
29 April Manipulating jobs % bkill jobid Kills the job (can also send signal) % bstop jobid Suspends the job (even if not running) % bresume jobid Resumes the job
29 April Getting usage statistics We keep monthly stats in our web page For current information % bacct [opts] Total usage for your jobs. Can specify dates and jobs % priorities (or bhpart -r ) Lists all the priorities for different groups
29 April Monitoring load on the system % bqueues Shows queues and how loaded they are % lsload Quick glance at the load on the system Also GUI tools ( xlsbatch, xlsmon ) Please use sparingly as they add to interactive load on the system.
29 April Contact Information Visit our home page Questions and comments