The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.
VOMS Alessandra Forti HEP Sysman meeting April 2005.
LCG Plans for Chrsitmas Shutdown John Gordon, STFC-RAL GDB December 10 th, 2008.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Status of PDC’06 Latchezar Betev TF meeting – September 28, 2006.
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
Status of the Production and Nagios news ALICE TF Meeting 29/07/2010.
PDC’06 – production status and issues Latchezar Betev TF meeting – May 04, 2006.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Status of the Production ALICE TF MEETING 11/02/2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
Placeholder ES 1 CERN IT EGI Technical Forum, Experiment Support group AAI usage, issues and wishes for WLCG Maarten Litmaath CERN.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
CREAM CE: upgrades in the system  Migration of the ALICE production queue in the CREAM CE: DONE  From pps-cream-fzk.gridka.de:8443/cream-pbs-pps to.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Status of the SL5 migration ALICE TF Meeting
WLCG IPv6 deployment strategy
ALICE Workload Model – WMS and CREAM
U.S. ATLAS Grid Production Experience
Workload Management System ( WMS )
Latest WMS news and more
Status of the Production
Summary on PPS-pilot activity on CREAM CE
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
INFN-GRID Workshop Bari, October, 26, 2004
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
The CREAM CE: When can the LCG-CE be replaced?
Update on gLite WMS tests
WLCG Management Board, 16th July 2013
Nicolas Jacq LPC, IN2P3/CNRS, France
CRAB Server CRAB (CMS Remote Analysis Builder)
CREAM-CE/HTCondor site
1 VO User Team Alarm Total ALICE ATLAS CMS
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
AliEn central services (structure and operation)
WMS Options: DIRAC and GlideIN-WMS
lundi 25 février 2019 FTS configuration
IPv6 update Duncan Rand Imperial College London
The LHCb Computing Data Challenge DC06
and Forecasting Resources
Presentation transcript:

The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009

Outlook  The last ALICE production run began the 21st of December  This new run includes the latest AliEnv2.16 previously deployed at all sites  This is the 1st AliEn version which fully deprecates the RB usage  HOWEVER: ALICE has been using the WMS submission mode in most of the sites since months in previous AliEn versions  It implements also a CREAM-CE module for CREAM submisions  3500 jobs daily run in average  The following slides collects the experiences of the experiment during the Christmas period and also the conclusions of the post- mortem meeting between IT/GS members and WMS experts

General summary of the production  36 sites have participated in the Christmas production (50%), all T1 sites OK  WMS reconfiguration needed during the vacations period  This will be the focus of the Alice report in the following slides

The Services: WMS (I)  ALICE counts currently on 3 dedicated WMS nodes at CERN  wms103 and wms109 (gLite3.0)  wms204 (8 cores machine, gLite3.1)  These nodes play a central role in the whole ALICE production  The 29th of December Maarten announces to ALICE a huge backlog in wms204  A large number of jobs were being submitted through this node  As result the further job submission processing were slower and slower

Some results seen during Christmas  i.fl: it represents the number of unprocessed entries in the input.fl file, the input queue for the Workload Manager service  q.fl: it represents the number of unprocessed entries in the queue.fl file, the input queue for the Job Controller service  i.fl q.fl

Example: wms204 backlog

The Services: WMS (II)  Where this was hapenning?  Basically 2 T2 sites were catching a huge number of jobs: MEPHI in Russia and the T2 is Prague  Why this was hapenning?  Normally several reasons can drive to this situation:  The destination queue is not available  The submitted jobs are then kept for a further retry: (up to 2 retries: unmatched requests are discarded after 2 hours)  But ALICE has set the Shallow resubmission to cero and explicitly asked the WMS experts to set the nodes avoiding any possible resubmission  Any configuration problem at the site keeps on submitting jobs  Since these jobs are visible nowhere, they do not exist for ALICE and therefore, the system keeps submitting and submitting  In any case the submission regime of ALICE is not so high to provoque such a huge backlogs in nodes as wms204  The previous reasons can be ingredientes to the problem, but cannot be the only reason for such a load  On wms204 the matchmaking became very slow due to unknown causes; the developers have been involved

Effects of this high load  ALICE was seing jobs in status READY and WAITING for a long time  The experiment still does not consider READY and WAITING as problematic status so it keeps on submitting and submitting… SNOWBALL: creating huge backlogs  Request: Could the WMS be configured to avoid new submissions once it gets in such a state?  Proposed during the post-mortem meeting with the WMS experts, it could be in place for the end of February 2009 (earliest)  Why these job status?  the Workload Manager has to deal with all preceding requests in the "i.fl" queue at least once (a submission that does not match will be retried later if it has not expired by that time);  the Job Controller has to deal with all preceding requests in the "q.fl" queue.

The WMS: ALICE Procedures  ALICE stopped immediatelly the submission through wms204 at all sites putting the highest weight on wms103 and wms109  The situation was solved in wms204 but appeared in wms103 and wms109  wms103 and wms109 (gLite 3.0) had a different problem that could not be explained satisfactorily either  In addition access to wms117 was also ensured to ALICE for this period  The node developed the same symptoms as wms204  As result a continuous care of the WMS has been followed during this period changing the wms in production when needed

Possible source of problems  ALICE jdl construction?  The experiment has always defined simple jdl files for their agents  BDII overloaded?  It should be then affecting all VOs while performing the matchmaking  In addition several tests were made while quering the BDII and obtaining positive results  Network problems?  During several days?... And afecting ALICE only?  Overloading myproxy server  Indeed it was found a high load of myproxy by ALICE  However this seems to be uncorrelated with the WMS issue  Although an overload on myproxy server can slow down the WMS processing, this should then be visible for all WMS of all VOs

How to solve the myproxy server issue  Faster machines have been already requested to replaced the current nodes of myproxy server  Proposed during the Christmas period the request has been already done  In addition ALICE is currently changing the submission procedure to ensure a proxy delegation request once per hour  In case of any problem at a VOBOX, this procedure can ensure a 'frugal' myproxy server usage  The new submission procedure will have a beta version this week at Subatech (France)

Conclusions  Still pending the issue with the WMS: We still cannot conclude why such a big backlogs have been created during this vacation period  Two new have been already announced: wms214 and wms215 in addition to wms204  All of them with independent LB  8 core machines  Glite3.1  wms103 and wms109 will be fully deprecated end of February  At this moment and due to an AliRoot upate ALICE is not in full production  As soon as the experiment restarts production we will follow carefully the evolution of the 3 nodes reporting any further issue to the developers

Final remarks  ALICE has a lack of WMS  France still is not providing any WMS which can be put in production  WMS provided at RDIG, Italy, NL-T1, FZK and RAL  CERN WMS play a central role for many ALICE sites and are always a failover for the sites, even if a local WMS is available  ALICE wishes to thank the IT/GS (Maarten and Patricia in particular) for the efficient support during the Christmas running