Installing Galaxy on a cluster : issues around the DB server, queue system, external authentication, etc. Nikolay Vazov University Center for Information Technologies (USIT) University of Oslo Norway
Swiss Galaxy Workshop, Wednesday, October 3rd, Bern The new UiO hpc-cluster: Abel http://www.uio.no/english/services/it/research/hpc/abel/index.html (in operation since October 1st 2012) Some facts about the Abel cluster: Ranked 96 in the world top 500 list with 178.6 TFlops 178.6 TFlops correspond roughly to ca 2700 PCs the fastest i Noway, 3rd in Scandinavia Ranked 68 in the Green500 list 652 compute nodes and 20 administration nodes All compute nodes on the cluster have a minimum 64 GB RAM, 16 physical CPU cores and are connected by FDR (56 Gbps) Infiniband. 10.432 cores used for computing: correspond to 2.600 quad-core PCs 400 TB shared disk Compute nodes with 350 TB local discs Compute nodes have a total of 48 TB RAM Power consumption 230KW (full load) Trivia: all the nodes were mounted in 14 hours (appr. 1'15” per node!) Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
The existing service – the Bioportal Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Bioportal features - jobs Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Bioportal features - files Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Galaxy in Abel - configuration cluster External authentication (FEIDE) Locally registered users node node node node node node Interface between Galaxy and SLURM – DRMAA job scheduler - SLURM Apache proxy PostgreSQL DB server Located on a different host Paster (WSGI) SSL connection Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Galaxy in Abel - configuration cluster External authentication (FEIDE) Locally registered users node node node node node node Interface between Galaxy and SLURM – DRMAA job scheduler - SLURM Apache proxy PostgreSQL DB server Located on a different host Paster (WSGI) SSL connection Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Job scheduling with Galaxy Galaxy – specifies the job runners DRMAA library - generic interface to various scheduling systems SLURM – schedules the jobs ( client/server) Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Job scheduling Galaxy -> DRMAA -> SLURM Galaxy server is outside the cluster. We prefer this situation to the Galaxy server being a part of the cluster. Galaxy, DRMAA and SLURM are located on an nfs mounted partition. Galaxy: universe_wsgi.ini # -- Job Execution # Comma-separated list of job runners to start. local is always started. If # ... The runners currently available are 'pbs' and 'drmaa'. start_job_runners = drmaa # The URL for the default runner to use when a tool doesn't explicitly define a # runner below. default_cluster_job_runner = drmaa:/// Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Job scheduling Galaxy -> DRMAA -> SLURM export DRMAA_PATH=/kevlar/proje cts/drmaa/lib/libdrmaa.so.1.0.2 export SLURM_DRMAA_CONF=/et c/slurm_drmaa.conf hpc-dev01 etc# cat slurm_drmaa.conf Job_categories: { default: "-A staff -p normal --mem-per- cpu=1000 -- comment=hello", } Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Job scheduling Galaxy -> DRMAA -> SLURM Plus a couple of changes (http://mdahlo.blogspot.no/2011/06/galaxy-on-uppmax.html) in the DRMAA egg (drmaa-0.4b3-py2.6.egg) Find munge Display the web form to specify node, cores, memory, partition, etc. Parse the data from the web form and set up a string into <path-to-galaxy>/database/pbs/slurm_settings.tmp Create a real sbatch file, add missing parameters, module load, etc, and send the job to the cluster Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Swiss Galaxy Workshop, Wednesday, October 3rd, Bern Job scheduling Galaxy -> DRMAA -> SLURM (thanks to Katerina Michalickova) SLURM (client has to be installed on the mounted partition) : /etc/slurm/slurm.conf hpc-dev01 slurm# cat slurm.conf ## slurm.conf: main configuration file for SLURM ## $Id: slurm_2.2.conf,v 1.30 2011/09/20 15:13:58 root Exp $ ## FIXME: check GroupUpdate*, TopologyPlugin, ## UnkillableStepProgram, UsePAM ### ### Cluster ClusterName=titan # NOW abel SlurmctldPort=6817 SlurmdPort=6818 TmpFs=/work TreeWidth=5 ## Timers: #default: MessageTimeout=10 ## FIXME: should be reduced when/if we see that slurmd is behaving: #SlurmdTimeout=36000 WaitTime=0 ### Slurmctld ControlMachine=blaster.teflon.uio.no SlurmUser=slurm StateSaveLocation=/tmp Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
SSL to the PostgreSQL server (thanks to Nate Coraor) Downloaded and recompiled an psycopg2-2.0.13 egg In universe_wsgi.ini database_connection = postgres://<dbuser>:<password>@<dbhost>:5432 /<dbname>?sslmode=require Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Authentication (thanks to Roland Hedberg) pysaml2-0.4.0/ Modify lib/galaxy/web/controllers/user.py Authentication working, but can not capture the POST from the IdP any help is appreciated :) Swiss Galaxy Workshop, Wednesday, October 3rd, Bern
Swiss Galaxy Workshop, Wednesday, October 3rd, Bern Thank you http://www.uio.no/english/services/it/research/ hpc/abel/index.html n.a.vazov@usit.uio.no Swiss Galaxy Workshop, Wednesday, October 3rd, Bern