Charles Maguire et (beaucoup d’) al.

Charles Maguire et (beaucoup d’) al.
Operational Scripts for Reconstructing Run7 Min Bias PRDFs at Vanderbilt’s ACCRE Farm Charles Maguire et (beaucoup d’) al. (web site updated last week for main control script documentation, cautoRun7) May 30, 2007 Local Group Meeting

Overview of the Five Script Functions
FDT transfer of PRDFs from firebird to ACCRE All firebird PRDF scripts are in the /mnt/eon0/bnl/control directory (or eon1) All ACCRE PRDF scripts are in the /home/phnxreco/prdfTransfering directory Database updating (new as of May 30) Submission of PBS jobs for reconstructing PRDFs FDT/gridFTP transfer of outputs to RCF All firebird nanoDST scripts are in /mnt/eon0/bnl/control directory (or eon1) All ACCRE nanoDST scripts are in /home/phnxreco/nanoDstTransfering Scripts for monitoring the above functions All scripts are committed in CVS 113 entries (4 added since last week for gridFTP monitoring) May 30, 2007 Local Group Meeting

FDT Transfer of PRDFs to ACCRE
Computer nodes involved vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) firebird acts as the FDT server node (bnl account) vmps01 acts as the FDT client (phnxreco acccount) CRON scripts on vpac15 (vpac15 chosen because vupac nodes can do s) 15 * * * * /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl >& /var/lib/pgsql/monitoring/inputFireBirdOccupancyPerlEon0.log & checks the percent occupancy of the eon0 disk every hour a corresponding cron job at 20 minutes after the hour checks eon1 25 * * * * /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.csh >& /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.log & checks whether there are new PRDF files to be transferred from eon0 The script calls /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.pl on vpac15 a corresponding cron job at 30 minutes after the hour checks eon1 All four of these scripts will issue s on what they have found Exact details of the scripts will be posted on the WWW, just as for cautoRun7 Scripts on the firebird node inputFireBirdOccupancyEon0.csh called by /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl Script returns the % occupancy on the eon0 buffer disk, used by perl script to send an to me inputStatusCheckEon0.csh called by vpac15 ../inputFireBirdNewFilesEon0.pl This script calls the inputStatusCheckFilesEon0.pl script in the /mnt/eon0/bnl/control directory The inputStatusCheckEon0.pl determines if there are PRDFs to be transferred If there are PRDFs to be transferred an fdtPRDFServerEon0.csh script starts on firebird, and an fdtPRDFClientEon0.csh client starts on vmps01 at ACCRE May 30, 2007 Local Group Meeting

FDT Transfer of PRDFs to ACCRE
Computer nodes involved (repeated from previous slide) vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) firebird acts as the FDT server node (bnl account) vmps01 acts as the FDT client (phnxreco account) Scripts on the firebird node (repeated from previous slide) inputFireBirdOccupancyEon0.csh called by /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl Script returns the % occupancy on the eon0 buffer disk, perl script send an to me inputStatusCheckEon0.csh called by vpac15 ../inputFireBirdNewFilesEon0.pl This script calls the inputStatusCheckFilesEon0.pl script in the /mnt/eon0/bnl/control directory The inputStatusCheckEon0.pl determines if there are PRDFs to be transferred If there are PRDFs to be transferred an fdtPrdfServerEon0.csh script starts on firebird, and this script calls the fdtStartPrdfFClientEon0.csh script with parameter on vmps01 at ACCRE Scripts on the vmps01 node /home/phnxreco/prdfTransfering/fdtPrdfClient.csh actual copies PRDF files to the /gpfs3 area this script is started by the fdtStartPrdfClientEon0.csh script using input after a 15 second delay After the fdtPrdfClientEon0.csh finished copying it exits by calling the confirmTransferAndThenEraseEon0.pl script The confirmTransferAndThenEraseEon0.pl verifies that all the PRDF files have been copied correctly to the /gpfs3 area. If so, the PRDF files are deleted from the eon0 area on firebird The confirmTransferAndThenEraseEon0.pl script on vmps01 calls the /mnt/eon0/bnl/control/haveBeenTransferredList.ch script on the firebird node to get a list of files which were supposed to have been transferred The haveBeenTransferredEraseEon0.csh script will be called to do the actual file deletion on firebird May 30, 2007 Local Group Meeting

FDT Transfer of PRDFs to ACCRE Inventory and Location of PRDF Files at ACCRE
Computer nodes involved (repeated from previous slides) vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) firebird acts as the FDT server node (bnl account) vmps01 acts as the FDT client (phnxreco account) Three locations at ACCRE for files copied from firebird (important fact) 17 TBytes at /blue/phenix/RUN7PRDF/auauMinBias200GeV (ITS Blue-Arc platform) Now 95% full, no more to be added The /blue/phenix area is the current input file source area for the cautoRun7 scripts 20 TBytes at /gpfs3/RUN7PRDF/auauMinBias200GeV (ACCRE owned disks) Now 70% full, current destination area firebird -> ACCRE The /gpfs3/RUN7PRDF area is also the top directory for the output files 8 TBytes at /gpfs2/RUN7PRDF/auauMinBias200GeV (ACCRE owned disks) Now 80% full, original destination area firebird -> ACCRE before /gpfs3 All the RUN7 PRDF files on /gpfs2 have been copied to /blue/phenix The /gpfs2 area is used for simulation project output as well Major action item to be done We don’t yet monitor the /gpfs3 occupancy At some point we will have to delete PRDFs to make room for new PRDFs or ask for more disk space May 30, 2007 Local Group Meeting

Database Updating Sequence of database updating
cron job on rftpexp01 with maguire account runs at 00:05 EDT Uses gridFTP to deliver 3 restore files to firebird /mnt/eon0/rhic/databaseUpdating Second cron job at 01:05 confirms that the transfers were successful cron job on firebird rhic account acts as FDT server process to vpac04 at 11:35 CDT cron job on vpac04 postgres account acts as FDT client process 30 seconds later Client job first removes any restoreFilesAlreadyUsed signal file from /rhic2/pgsql/dbRun7 Client job makes a /rhic2/pgsql/dbRun7/restoreFilesTransferInProgress signal file FDT copies 3 restore files files to /rhic2/pgsql/dbRun7 After copy is completed a restoreFilesTransferCompleted signal file is created Client job deletes the /rhic2/pgsql/dbRun7/restoreFilesTransferInProgress signal file vpac15 cron job runs startRestoreFromDumpsABC.csh every hour (except 6 - midnight) startRestoreFromDumpsABC checks to see if it is OK to start at restore job If it is OK to start a restore job then the RestoreFromDumpsABC.csh script is run Operation of the RestoreFromDumpsABC.csh script Checks which cycle among A, B, C is to be updated (e.g., daq_A or daq_B or daq_C) After update is complete, a new .odbc.ini is copied to phnxreco account on ACCRE Similarly, new versions of the checkcalib and checkrun are also copied for cautoRun7 May 30, 2007 Local Group Meeting

Submission of PBS Jobs to do Reconstruction
Submission of PBS jobs for reconstructing PRDFs Controlled by the cautoRun7 master script, e.g. submit 200 jobs www Script is launched every 30 minutes, at 5 and 35 minutes after the hour using a phxreco cron job on vmps18 (nothing important about vmps18) General outline of the operations for cautoRun7 Check if OK to launch 200 new jobs: will not run if jobs already running Will not run if the output from the previous cycle is not yet at RCF Checks that database is accessible Harvests the current production output into a “newstore” area Checks which jobs succeeded and which failed, removes temporary work areas Makes a list of run numbers for the next production cycle (complex logic) Submits a new set of 200 jobs (number 200 is in the submit.pl script) After all new jobs are submitted, the transfer process to RCF is begun May 30, 2007 Local Group Meeting

Submission of PBS Jobs to do Reconstruction
Submission of PBS jobs for reconstructing PRDFs (previous slide) Controlled by the cautoRun7 master script, e.g. submit 200 jobs Scripts for manual monitoring the reconstruction jobs (last week) Look in the .cshrc file of the /home/phnxreco account for definitions Special alias commands (always uppercase letters) STATPBS (means qstat | grep phnxreco) lists queued jobs (running and waiting) SHOWPBS (means showq | grep phnxreco) lists queued jobs, different format WAITING means /home/phnxreco/prdfTransfering/checkPBSWaiting.csh Shows jobs which are actually running and those which are waiting Jobs can be waiting as idle or deferred (lowered priority, complex scheduling priorities) RUNNINGPBS showq | grep phnxreco | grep -c Running ; showq | grep phnxreco | grep -c phnxreco Produces three lines of output: completed, running and total jobs in queue JOBSPBS perl -w /gpfs3/RUN7PRDF/prod/run7/jobStatisticsPBS.pl Produces detailed summary of the last major job submission SCANDBFAIL perl -w $CVSRUN7/scanForDBFailures.pl Used immediately after a 200 job submission Determines jobs which had initial DB access failures (problem should be fixed now) May 30, 2007 Local Group Meeting

FDT/gridFTP Transfer of Output to RCF
FDT/gridFTP transfer of outputs to RCF The transfer of the output to RCF proceeds in two stages FDT transfer from ACCRE to firebird disks (eon0 or eon1) gridFTP transfer from firebird to RCF (using maguire gridFTP certificate) The process is started at the end of the cautoRun7 script The gridFTP transfer to RCF of all the output files must succeed before any new production jobs are submitted by the next cautoRun7 cycle FDT transfer ACCRE -> firebird vmps02 node is used as the server, firebird node is used as the client The FDT can go to either eon0 or eon1, whichever is less busy/full ~770 GBytes of output for 200 jobs, FDT at 45 Mbytes/second ===> ~5 hours gridFTP transfer firebird -> RCF slower transfer rate ~20 Mbytes/second ===> ~11 hours ===> 16 hours total transfer 16 hours is well matched to the ~20 hour cycle of the compute jobs maguire cron job on rftpexp01 monitors for the successful transfer of all files May 30, 2007 Local Group Meeting

Issues for gridFTP Transfer of Output to RCF
FDT/gridFTP transfer of outputs to RCF (previous slide) The transfer of the output to RCF proceeds in two stages: FDT and gridFTP The gridFTP transfer to RCF of all the output files must succeed before any new production jobs are submitted by the next cautoRun7 cycle gridFTP transfer firebird -> RCF (previous slide) slower transfer rate ~20 Mbytes/second ===> ~11 hours ===> 16 hours total transfer 16 hours is well matched to the ~20 hour cycle of the compute jobs maguire cron job on rftpexp01 monitors for the successful transfer of all files Unsettled issues for gridFTP transfer to RCF No fixed, large volume disk area available at RCF for this Run7 reco output Output and monitoring scripts have to be manually edited for a new RCF disk area We need to automate the process of selecting an available disk area at RCF Disastrous slowdown (~1 Mbyte/second) to data59 luckily caught on Saturday morning Slowdown not present for other RCF disks, was able to switch to 20 MB/sec May 30, 2007 Local Group Meeting

Scripts for Monitoring Other Scripts (to be written)
Need scripts to check if the signal files have become too old fdtInProgress on either eon0 or eon1 (part of PRDF transferring) control areas fdtInProgress on /home/phnxreco/nanoDstTransfering (< 8 hours) gridFtpInProgress on /home/phnxreco/nanoDstTransfering (< 15 hours) gridFtpInProgress on eon0 or eon1 control areas (< 15 hours) cautoRun7InProgress on /gpfs3/RUN7PRDF/prod/run7 (< 1 hours) Need script to check if PBS jobs have crashed If early crash, possibly resubmit Identify node where the job crashed, and notify ACCRE people May 30, 2007 Local Group Meeting

Major Action Items to be Done by VU Crew
Write adaptive software for knowing which RCF disk to use for output Catalog output locations at RCF into the FROG database (Irina) Also catalog locally what we have already done Prepare to switch to /gpfs3 as new source input area /gpfs3 is 65% full now with PRDFs and reconstructed output We should delete 90% of reconstructed output from /gpfs3 , save 10% Must be careful to maintain empty files for the makelist script used by cautoRun7 Must write a new “safeDelete” script for this purpose Delete already reconstructed PRDFs from /blue/phenix Can we write these PRDFs to tape/backup (ITS contact)? How much would that cost? Use /blue/phenix as the destination area for new PRDFs Develop a WWW site which provides a snapshot of project status disk space used on eon0, eon1, /gpfs3, /blue/phenix date and size of the last PRDF transfer from 1008 number of reco jobs already done, status of current jobs, gridFTP transfer rate disk situation at RCF, current output destination, next output destination www site will be looked at by SA2 to determine if there is a problem May 30, 2007 Local Group Meeting

Charles Maguire et (beaucoup d’) al.

Similar presentations

Presentation on theme: "Charles Maguire et (beaucoup d’) al."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Charles Maguire et (beaucoup d’) al.

Similar presentations

Presentation on theme: "Charles Maguire et (beaucoup d’) al."— Presentation transcript:

Similar presentations

About project

Feedback