Presentation is loading. Please wait.

Presentation is loading. Please wait.

Update on gLite WMS tests

Similar presentations


Presentation on theme: "Update on gLite WMS tests"— Presentation transcript:

1 Update on gLite WMS tests
Andrea Sciabà WLCG-OSG-EGEE Operations meeting September 21, 2006

2 Testing the gLite WMS RB installed with gLite 3.0.2 + various patches
Dedicated machine at CERN (rb102.cern.ch) 2 × Xeon 3.0 GHz 4 GB of RAM 3 RAID1 partitions for better I/O performance Closely monitored by GD, FIO and JRA1 people Tests run by CMS, GD, ATLAS

3 CMS Test description Application Run on CMS Tier-1’s and Tier-2’s
Fake analysis jobs (~30’ of CPU time) Run on CMS Tier-1’s and Tier-2’s Different submission methods Network server WMProxy Bulk submission Submission from 1-3 UI’s in parallel VOMS proxies Myproxy renewal on Deep resubmission off Shallow resubmission ≤ 3

4 Latest results (I) No. of jobs = 3 UI × 33 CEs × 200 jobs/collection  jobs ~2.5 hours to submit all jobs ~0.5 sec/job Submission failed for 6 collections ~17 hours to dispatch all jobs Equivalent to ~26000 jobs/day

5 Latest results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr ce01-lcg.cr.cnaf.infn.it ce01-lcg.projects.cscs.ch ce03-lcg.cr.cnaf.infn.it ce04-lcg.cr.cnaf.infn.it ce04.pic.es ce101.cern.ch ce102.cern.ch ce103.cern.ch ce104.cern.ch ce105.cern.ch ce106.cern.ch ceitep.itep.ru cmslcgce.fnal.gov cmsrm-ce01.roma1.infn.it dgc-grid-40.brunel.ac.uk egeece.ifca.org.es grid-ce1.desy.de grid-ce2.desy.de grid10.lal.in2p3.fr grid109.kfki.hu gridba2.ba.infn.it gridce.iihe.ac.be gridce.pi.infn.it gw39.hep.ph.ic.ac.uk lcg00125.grid.sinica.edu.tw lcg02.ciemat.es lcg06.sinp.msu.ru lcgce01.gridpp.rl.ac.uk lcgce01.jinr.ru polgrid1.in2p3.fr t2-ce-02.lnl.infn.it

6 Failure reasons Application errors Maradona errors
“Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ” errors The WMS could not submit the job to a gLite CE Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error Mkfifo /tmp/…: File Exists Unspecified gridmanager error Normally a batch system problem Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) Authentication failed with Belgian CE (CRL expired) Negligible fractions of other errors Could not upload a sandbox file Got a job held event, reason: Globus error 124: old job manager is still alive Gatekeeper unreachable

7 Efficiency table (I) CE Efficiency Main failure reason
cclcgceli02.in2p3.fr 1 ce01-lcg.cr.cnaf.infn.it 0.61 Application error ce01-lcg.projects.cscs.ch ce03-lcg.cr.cnaf.infn.it 0.98 ce04.pic.es ce101.cern.ch ce102.cern.ch ce105.cern.ch ce106.cern.ch ceitep.itep.ru cmslcgce.fnal.gov cmsrm-ce01.roma1.infn.it dgc-grid-40.brunel.ac.uk egeece.ifca.org.es 0.95 Gatekeeper down grid-ce0.desy.de grid-ce1.desy.de grid-ce2.desy.de

8 Efficiency table (II) CE Efficiency Main failure reason
grid10.lal.in2p3.fr Application error grid109.kfki.hu 0.94 gridba2.ba.infn.it gridce.iihe.ac.be CRL expired gridce.pi.infn.it 1 gw39.hep.ph.ic.ac.uk lcg00125.grid.sinica.edu.tw lcg02.ciemat.es 0.82 Unspecified gridmanager error lcg06.sinp.msu.ru 0.99 Waiting (mkfifo error) lcgce01.gridpp.rl.ac.uk lcgce01.jinr.ru polgrid1.in2p3.fr t2-ce-02.lnl.infn.it

9 Conclusions Very small fraction of failed jobs due to the WMS
Only those remaining in Waiting status (O(100)) All other failures are due either to the application, to the CE or to authentication problems (expired CRL) Performance seems to indicate a maximum rate of ~26000 jobs/day “Job Robot” jobs, it may be different for other kinds of jobs The WMS looks reasonably fine now


Download ppt "Update on gLite WMS tests"

Similar presentations


Ads by Google