BIMSB Bioinformatics Coordination 18-01-2010 Sun GridEngine BIMSB Bioinformatics Coordination 18-01-2010
TOC Why use a scheduler Concepts Use cases NFS bottleneck Getting infos Q/A
Why BIMSB cluster consisting of 100 4-core nodes and 5 “bignodes” (16 cores, 256GB RAM) Use of computing resources as optimal as possible “Fair” sharing of resources wanted Manual managment of “who can run what on which node” not feasible
Concepts list of pending jobs queue1 queue2 queue3
Concepts ... list of pending jobs node1 node2 node3 node4 nodeX queue1
Concepts ... list of pending jobs node1 node2 node3 node4 nodeX queue1 scheduler job job job queue2 job job job queue3 job job ...
Concepts (cont.) Queues defined: standard, standard-big (slots = cores) high, high-big (slots=cores/4) longrun (1 slot per host) interactive (1 slot per host) Resources Measured resources (load, free mem. ) Admin. defined Consumables: licenses Boolean: “has scratch disk”, “special software installed”
Job prioritization 1.0 * priority (self or administrator) 0.1 * urgency (waiting time, resource consumption) 0.01 * ticket (past cluster usage) => scheduling priority lot of fine tuning possible
Run a simple command: qsub [options] -b y <my command> common options: -v <ENV_VAR> -l <resource>[=limit] e.q. -l virtual_free=80G -cwd -e <filename> -o <filename> -N <job_name> -q <quename>[@<node>]
Run a cluster script: qsub [options] <my script> Cluster script: #!/bin/sh #$ some option #$ another option my_command another_command
Interactive usage: # qrsh [options] # qrsh [options]<command> Can also start an X11 application: Add “ForwardX11” to your ~/.ssh/config file
Parallel applications Different ways to run jobs in parallel: Multithreaded (single node) Multiprocess (single node) make based (single/multi node) Array jobs (multi node) MPI jobs (multi node)
Single node parallelism Program is multi-threaded / multi-process: # qsub -pe smp <slots> <my_program> <programm options> make sure <my_program> only runs <slots> number of processes or threads, programs usually have a option for this Make based: # qsub -pe smp <slots> make -j <slots> <my_make_target> If variable number of slots requested, use $NSLOTS
Multi node parallelism Make based: # qmake -pe make <slots> <make_target> Requests a fixed number of slots for whole runtime of qmake # qmake <make_target> -j <slots> Variable number of slots depending on current runnable subjobs
Multi node parallelism (cont.) MPI (Message Passing Interface): Programming interface for inter-process communication Installed implementation: OpenMPI Application needs to be programmed to use this # qsub -pe orte <slots> -b y "mpiexec -n <slots> <mpiprogram>" mpiblast installed, see http://bbc.mdc-berlin.de/howto/
Multi node parallelism (cont.) Array jobs Ideal if you can split up your input or task into multiple similar parts # qsub -t 1-<slots> <my job> Starts <slots> number of <my job> differing only in $SGE_TASK_ID
NFS bottleneck All important data is on fileservers Fileservers have bandwidth of ~ 200 MB/s Nodes bandwidth ~ 100 MB/s If many nodes read from same fileserver at same time, bandwidth exhaustion => Worst case: all jobs stalled
NFS bottleneck (cont.) 2 Strategies against: Delayed job start File stageing Delayed start, simple method: Estimate how long reading input takes Delay start of jobs accordingly Unreliable
NFS bottleneck (cont.) Delayed start, more difficult method: Put jobs on hold # qsub -h u <job> When a job/task is finished with reading input, release next one qrls <job_id.task_id> Needs some scripting and tuning, good for recurring tasks and pipelines
NFS bottleneck (cont.) File staging: Before start of job, copy input files to local directory on all nodes /tmp/<my_dir> on all nodes /scratch on some bignodes (-l scratch) Make sure no NFS bottleneck in staging phase !!!
“NoNo’s“ You can ssh into all nodes for controlling your job status but not for running computations !!! If you qrsh/qlogin into a node, start a job in background and logout, the job keeps running but you give back slots -> Don’t !!! Don’t run multiprocess / multithreaded programs without requesting slots
Getting info Ganglia: http://141.80.186.22/ganglia/?c=BIMSB
Getting info (cont.) About the cluster queues in general: # qstat -g c About scheduler decisions # qstat -j [<job_id>] About a finished job # qacct -j <job_id> About jobs on a specific host # qhost -h <hostname> -j About job priorisation # qstat -pri
In case of an error Contact me, best by email: andreas.kuntzagk@mdc-berlin.de Submit following informations: What did you do Full command line Which node Which directory Expected behaviour Experienced behaviour Full error messages
Q/A