Upgrade D0 farm
Reasons for upgrade RedHat 7 needed for D0 software New versions of –ups/upd v4_6 –fbsng v1_3f+p2_1 –sam Use of farm for MC and analysis Integration in farm network
MC production on farm Input: requests Request translated in mc_runjob macro Stages: 1.mc_runjob on batch server (hoeve) 2.MC job on node 3.SAM store on file server (schuur)
farm server file server node SAM DB datastore fbs(rcp,sam) fbs(mcc) mcc request mcc input mcc output 1.2 TB 40 GB FNAL SARA control data metadata fbs job: 1 mcc 2 rcp 3 sam 100 cpu’s
farm server file server node SAM DB datastore fbs(rcp[,sam]) fbs(mcc) mcc request mcc input mcc output 1.2 TB 40 GB FNAL SARA control data metadata fbs job: 1 mcc 2 rcp 100 cpu’s cron: sam
fbsuser:cp fbsuser:mcc fbsuser: rcp willem:sam hoevenode schuur fbsuser: mc_runjob fbs submit data control cron
SECTION mcc EXEC=/d0gstar/curr/minbias /batch NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/curr/minbias /stdout STDERR=/d0gstar/curr/minbias /stdout SECTION rcp EXEC=/d0gstar/curr/minbias /batch_rcp NUMPROC=1 QUEUE=IOQ DEPEND=done(mcc) STDOUT=/d0gstar/curr/minbias /stdout_rcp STDERR=/d0gstar/curr/minbias /stdout_rcp
#!/bin/sh. /usr/products/etc/setups.sh cd /d0gstar/mcc/mcc-dist. mcc_dist_setup.sh mkdir -p /data/curr/minbias cd /data/curr/minbias cp -r /d0gstar/curr/minbias /*. touch /d0gstar/curr/minbias /.`uname -n` sh minbias sh `pwd` > log touch /d0gstar/curr/minbias /`uname -n` /d0gstar/bin/check minbias #!/bin/sh i=minbias if [ -f /d0gstar/curr/$i/OK ];then mkdir -p /data/disk2/sam_cache/$i cd /data/disk2/sam_cache/$i node=`ls /d0gstar/curr/$i/node*` node=`basename $node` job=`echo $i | awk '{print substr($0,length-8,9)}'` rcp -pr $node:/data/dest/d0reco/reco*${job}*. rcp -pr $node:/data/dest/reco_analyze/rAtpl*${job}*. rcp -pr $node:/data/curr/$i/Metadata/*.params. rcp -pr $node:/data/curr/$i/Metadata/*.py. rsh -n $node rm -rf /data/curr/$i rsh -n $node rm -rf /data/dest/*/*${job}* touch /d0gstar/curr/$i/RCP fi batch runs on node batch_rcp runs on schuur
#!/bin/sh locate(){ file=`grep "import =" import_${1}_${job}.py | awk -F \" '{print $2}'` sam locate $file | fgrep -q [ return $? }. /usr/products/etc/setups.sh setup sam SAM_STATION=hoeve export SAM_STATION tosam=$1 LIST=`cat $tosam` for job in $LIST do cd /data/disk2/sam_cache/${job} list='gen d0g sim' for i in $list do until locate $i || (sam declare import_${i}_${job}.py && locate ${i}) do sleep 60; done done list='reco recoanalyze' for i in $list do sam store --descrip=import_${i}_${job}.py --source=`pwd` return=$? echo Return code sam store $return done echo Job finished... declare gen, d0g, sim store reco, recoanalyze runs on schuur called by fbs or cron
Filestream Fetch input from sam Read input file from schuur Process data on node Copy output to schuur
rcp d0exe rcp sam hoevenode schuur mc_runjob fbs submit data control cron attach filestream
Analysis on farm Stages: –Read files from sam –Copy files to node(s) –Perform analysis on node –Copy files to file server –Store files in sam
farm server file server node SAM DB datastore 1.2 TB 40 GB FNAL SARA control (fbs) data metadata 100 cpu’s 1.sam + rcp 2.analyze 3.rcp + sam fbs(1), fbs(3) fbs(2)
triviaal node-2 fbsuser:rcp fbsuser: analysis program willem:sam input output
SECTION sam EXEC=/home/willem/batch_sam NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout #!/bin/sh. /usr/products/etc/setups.sh setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal rcp -r /stage/triviaal/sam_cache/boo node-2:/data/test >> log batch.jdf batch_sam
farm server file server node SAM DB datastore 1.2 TB 40 GB FNAL SARA control (fbs) data metadata 100 cpu’s 1.sam 2.rcp + analyze + rcp 3.rcp + sam fbs(1), fbs(3) fbs(2)
triviaal node-2 fbsuser: rcp analysis program rcp willem:sam input output fbsuser:fbs submit
SECTION sam EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout #!/bin/sh uname -a date rsh -l fbsuser triviaal fbs submit ~willem/batch_node.jdf
#!/bin/sh. /usr/products/etc/setups.sh setup fbsng setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal fbs submit /home/willem/batch_node.jdf SECTION sam EXEC=/home/willem/batch NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout SECTION ana EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout #!/bin/sh rcp -pr server:/stage/triviaal/sam_cache/boo /data/test. /d0/fnal/ups/etc/setups.sh setup root -q KCC_4_0:exception:opt:thread setup kailib root -b -q /d0gstar/test.C { gSystem->cd("/data/test/boo"); gSystem->Exec("pwd"); gSystem->Exec("ls -l"); }
# # This file sets up and runs a SAM project. # import os, sys, string, time, signal from re import * from globals import * import run_project from commands import * ######################################### # # Set the following variables to appropriate values # Consult database for valid choices sam_station = "triviaal" # Consult Database for valid choices project_definition = "op_moriond_p1014" # A particular snapshot version, last or new snapshot_version = 'new' # Consult database for valid choices appname = "test" version = "1" group = "test" # The maximum number of files to get from sam max_file_amt = 5 # for additional debug info use "--verbose" #verbosity = "--verbose" verbosity = "" # Give up on all exceptions give_up = 1 def file_ready(filename): # Replace this python subroutine with whatever # you want to do # to process the file that was retrieved. # This function will only be called in the event of # a successful delivery. print "File ",filename," has been delivered!" # os.system('cp '+filename+' /stage/triviaal/sam') return get_file.py
Disk partitioning hoeve /d0 /fnal /d0dist /d0usr /mcc /mcc-dist/mc_runjob /curr /ups /db/etc/prd /fnal -> /d0/fnal /d0usr -> /fnal/d0usr /d0dist -> /fnal/d0dist /usr/products -> /fnal/ups /fbsng
ana_runjob Is analogous to mc_runjob Creates and submits analysis jobs Input –get_file.py with SAM project name Project defines files to be processed –analysis script
Integration with grid (1) At present separate clusters: –D0, LHCb, Alice, DAS cluster hoeve and schuur in farm network
Present network layout hoeve schuur switch node router hefnet surfnet ajax NFS
New network layout farmrouter switch D0 LHCb hefnet lambda hoeve schuur alice ajax NFS booder
New network layout farmrouter switch D0 LHCb hefnet lambda hoeve schuur alice ajax NFS booder das-2
Server tasks hoeve –software server –farm server schuur –fileserver –sam node booder –home directory server –in backup scheme
Integration with grid (2) Replace fbs with pbs or condor –pbs on Alice and LHCb nodes –condor on das cluster Use EDG installation tool LCGF –Install d0 software with rpm Problem with sam (uses ups/upd)
Integration with grid (3) Package mcc in rpm Separate programs from working space Use cfg commands to steer mc_runjob Find better place for card files Input structure now created on node
Grid job #!/bin/sh macro=$1 pwd=`pwd` cd /opt/fnal/d0/mcc/mcc-dist. mcc_dist_setup.sh cd $pwd dir=/opt/fnal/d0/mcc/mc_runjob/py_script python $dir/Linker.py script=$macro willem]$ cat test.pbs # PBS batch job script #PBS -o /home/willem/out #PBS -e /home/willem/err #PBS -l nodes=1 # Changing to directory as requested by user cd /home/willem # Executing job as requested by user./submit minbias.macro PBS jobsubmit
RunJob class for grid class RunJob_farm(RunJob_batch) : def __init__(self,name=None) : RunJob_batch.__init__(self,name) self.myType="runjob_farm" def Run(self) : self.jobname = self.linker.CurrentJob() self.jobnaam = string.splitfields(self.jobname,'/')[-1] comm = 'chmod +x ' + self.jobname commands.getoutput(comm) if self.tdconf['RunOption'] == 'RunInBackground' : RunJob_batch.Run(self) else : bq = self.tdconf['BatchQueue'] dirn = os.path.dirname(self.jobname) print dirn comm = 'cd ' + dirn + '; sh ' + self.jobnaam + ' `pwd` >& stdout' print comm runcommand(comm)
To be decided Location of minimum bias files Location of MC output
Job status Job status is recorded in –fbs –/d0/mcc/curr/ –/data/mcc/curr/
SAM servers On master node: –station –fss On master and worker nodes: –stager –bbftp