Update on gLite WMS tests

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Experience with the gLite Workload Management.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Enabling Grids for E-sciencE Workload Management System on gLite middleware Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi (Vietnam)
Grid job submission using HTCondor Andrew Lahiff.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Security and Job Management.
INFSO-RI Enabling Grids for E-sciencE The gLite Workload Management System Elisabetta Molinari (INFN-Milan) on behalf of the JRA1.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM and ICE Massimo Sgaravatto – INFN Padova.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The usage of the gLite Workload Management.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
VO VOCE - Availability and Stability of Resources Enabling Grids for E-sciencE VO VOCE - Availability and Stability of Resources.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
Introduction to Computing Element HsiKai Wang Academia Sinica Grid Computing Center, Taiwan.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract IST Report from.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
INFSO-RI Enabling Grids for E-sciencE CREAM, WMS integration and possible deployment scenarios Massimo Sgaravatto – INFN Padova.
JRA1/Job Submission and Monitoring Moreno Marzolla INFN Padova.
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
Practical using C++ WMProxy API advanced job submission
WLCG IPv6 deployment strategy
CEMon
The Workload Management And Logging Bookkeeping System
gLite: status and perspectives
ALICE Workload Model – WMS and CREAM
L’analisi in LHCb Angelo Carbone INFN Bologna
CREAM and ICE Test Results
U.S. ATLAS Grid Production Experience
Workload Management System on gLite middleware
WP1 WMS release 2: status and open issues
Farida Naz Andrea Sciabà
Workload Management System ( WMS )
Latest WMS news and more
Summary on PPS-pilot activity on CREAM CE
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
The CREAM CE: When can the LCG-CE be replaced?
Introduction to Grid Technology
The Workload Management System
Workload Management System
Nicolas Jacq LPC, IN2P3/CNRS, France
Process Description and Control
1 VO User Team Alarm Total ALICE ATLAS CMS
JRA2 Pisa, Tuesday, 25 October 2005
OpenGATE meeting/Grid tutorial, mars 9nd 2005
Short update on the latest gLite status
5. Job Submission Grid Computing.
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
LCG Job Reliability Julia Andreeva, Benjamin Gaidioz, Juha Herrala, Birger Koblitz, Massimo Lamanna, Pablo Saiz, Andrea Sciaba`
WMS Options: DIRAC and GlideIN-WMS
The GENIUS portal and the GILDA t-Infrastructure
gLite Job Management Christos Theodosiou
Job Description Language
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
WLCG Status – 1 Use remains consistently high
Job Submission M. Jouvin (LAL-Orsay)
The LHCb Computing Data Challenge DC06
Presentation transcript:

Update on gLite WMS tests Andrea Sciabà WLCG-OSG-EGEE Operations meeting September 21, 2006

Testing the gLite WMS RB installed with gLite 3.0.2 + various patches Dedicated machine at CERN (rb102.cern.ch) 2 × Xeon 3.0 GHz 4 GB of RAM 3 RAID1 partitions for better I/O performance Closely monitored by GD, FIO and JRA1 people Tests run by CMS, GD, ATLAS

CMS Test description Application Run on CMS Tier-1’s and Tier-2’s Fake analysis jobs (~30’ of CPU time) Run on CMS Tier-1’s and Tier-2’s Different submission methods Network server WMProxy Bulk submission Submission from 1-3 UI’s in parallel VOMS proxies Myproxy renewal on Deep resubmission off Shallow resubmission ≤ 3

Latest results (I) No. of jobs = 3 UI × 33 CEs × 200 jobs/collection  20000 jobs ~2.5 hours to submit all jobs ~0.5 sec/job Submission failed for 6 collections ~17 hours to dispatch all jobs Equivalent to ~26000 jobs/day

Latest results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0

Failure reasons Application errors Maradona errors “Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ” errors The WMS could not submit the job to a gLite CE Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error Mkfifo /tmp/…: File Exists Unspecified gridmanager error Normally a batch system problem Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) Authentication failed with Belgian CE (CRL expired) Negligible fractions of other errors Could not upload a sandbox file Got a job held event, reason: Globus error 124: old job manager is still alive Gatekeeper unreachable

Efficiency table (I) CE Efficiency Main failure reason cclcgceli02.in2p3.fr 1 ce01-lcg.cr.cnaf.infn.it 0.61 Application error ce01-lcg.projects.cscs.ch ce03-lcg.cr.cnaf.infn.it 0.98 ce04.pic.es ce101.cern.ch ce102.cern.ch ce105.cern.ch ce106.cern.ch ceitep.itep.ru cmslcgce.fnal.gov cmsrm-ce01.roma1.infn.it dgc-grid-40.brunel.ac.uk egeece.ifca.org.es 0.95 Gatekeeper down grid-ce0.desy.de grid-ce1.desy.de grid-ce2.desy.de

Efficiency table (II) CE Efficiency Main failure reason grid10.lal.in2p3.fr Application error grid109.kfki.hu 0.94 gridba2.ba.infn.it gridce.iihe.ac.be CRL expired gridce.pi.infn.it 1 gw39.hep.ph.ic.ac.uk lcg00125.grid.sinica.edu.tw lcg02.ciemat.es 0.82 Unspecified gridmanager error lcg06.sinp.msu.ru 0.99 Waiting (mkfifo error) lcgce01.gridpp.rl.ac.uk lcgce01.jinr.ru polgrid1.in2p3.fr t2-ce-02.lnl.infn.it

Conclusions Very small fraction of failed jobs due to the WMS Only those remaining in Waiting status (O(100)) All other failures are due either to the application, to the CE or to authentication problems (expired CRL) Performance seems to indicate a maximum rate of ~26000 jobs/day “Job Robot” jobs, it may be different for other kinds of jobs The WMS looks reasonably fine now