ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005
Pile-up tasks Jobs defined in 3 tasks: –210 dc lumi10.A2_z_mumu.task –307 dc lumi10.A0_top.task –308 dc lumi10.A3_z_tautau.task Input files with min. bias were distributed to selected sites using DQ, 700GB Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal 1 GB RAM per job required
5 sites involved Number of jobs per site golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite
Status JOBSTATUS NJOBS failed 3702 finished 5703 pending 323 running 64 TASKDONE%DONEALL jobs have JOBSTATUS finished and CURRENTSTATE ABORTED - probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04
Why so big differences in the efficiency? PRAGUE: 48% TW: 70% ATTEMPT NJOBS ATTEMPT NJOBS Jobs with Attempt = 1 AllGoodFailedEff % TW Prague Other differences: RB on TW lexor running on UI on TW many signal files stored on SE on TW
Failures Not easy to get cause of failure from proddb –VALIDATIONDIAGNOSTIC quite difficult to parse by script: – t2-wn-36.roma1.infn.it 1 0m2.360s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn /WMS_t2-wn- 36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe- Ia3xLA/dc simul.A2_z_mumu._01477.pool.root: the server sent an error response: /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._01477.pool.root.1: Invalid argument. EDGFileCatalog: level[Always] Disconnected No log for stageout phase –mw failures: Job RetryCount (0) hit
Some Jobs with many Attempts JOBDEFINITIONID= –Attempt 1: 09-NOV-04 t2-wn- 42.roma1.infn.it 1 0m43.250s Transformation error: Problem report [Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================ Problem report [Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase –... –Attempt 11: 15-DEC-04 goliasx76.farm.particle.cz 1 0m41.460s Transformation error: Problem report [Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================ Problem report [Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase
JOBDEFINITIONID= Attempt 1: t2-wn-37.roma1.infn.it 1 0m2.830s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected No log for stageout phase Attempt 2: lcg00172.grid.sinica.edu.tw 2 0m23.660s Transforma tion error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase... Attempt 9: goliasx44.farm.particle.cz 2 0m23.340s Transformati on error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase
JOBDEFINITIONID= Attempt 1: t2-wn-48.roma1.infn.it 2 66m58.650s Transformation error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase Attempt 2: lcg00144.grid.sinica.edu.tw 2 66m56.800s Transform ation error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase the same up to attempt 5 Attempt 6: mw failure Attempt 7: goliasx60.farm.particle.cz 0 152m53.780s ???
Jobs properties no exact relation between a job in the oracle db and an entry in the PBS log file STARTTIME and ENDTIME are just hints Some jobs on golias: –1232 finished jobs in December registered in proddb –1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values Nodes: 3.06 GHz Xeon, 2GB RAM Histos based on information from PBS log files
some jobs (6) successfully ran on machine with only 1GB RAM but the wallTime was 20h – probably a lot of swapping
WN -> SE -> NFS server WN has the same NFS mount – could it be used directly?
Conclusions no job name in the local batch system – difficult to identify version of the lexor executor should be in the proddb proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8) a study of log files should be done before increasing MAXATTEMPT proddb should be cleaned