CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

Slides:



Advertisements
Similar presentations
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Advertisements

Development of test suites for the certification of EGEE-II Grid middleware Task 2: The development of testing procedures focused on special details of.
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
A module to customize CREAM jobs according to site policies Tsukuba, KEK, 21 December 2010 Sylvain Reynaud JWGEN :
CREAM John Gordon GDB November CREAM number of sites now – gstat2 says 24. Batch systems supported Experiment Tests Feedback from sites. Evaluation.
16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
Grid Workload Management Massimo Sgaravatto INFN Padova.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
Status of the Production and Nagios news ALICE TF Meeting 29/07/2010.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
GDB March User-Level, VOMS Groups and Roles Dave Kant CCLRC, e-Science Centre.
Analysis trains – Status & experience from operation Mihaela Gheata.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
The GridPP DIRAC project DIRAC for non-LHC communities.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
EGEE-III INFSO-RI Enabling Grids for E-sciencE JRA1 and SA3 All Hands Meeting December 2009, CERN, Geneva Product Teams –
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
CREAM CE: upgrades in the system  Migration of the ALICE production queue in the CREAM CE: DONE  From pps-cream-fzk.gridka.de:8443/cream-pbs-pps to.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM: current status and next steps EGEE-JRA1.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
HTCondor Accounting Update
CEMon
ALICE Workload Model – WMS and CREAM
U.S. ATLAS Grid Production Experience
Summary on PPS-pilot activity on CREAM CE
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
The CREAM CE: When can the LCG-CE be replaced?
Savannah to Jira Migration
1 VO User Team Alarm Total ALICE ATLAS CMS
TCG Discussion on CE Strategy & SL4 Move
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Data Management cluster summary
CPU Scheduling G.Anuradha
Presentation transcript:

CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE), Catalin Condurache (RAL), Sergio Fantinel (INFN-Legnaro), Stefano Lusso (INFN-Torino), Patricia Méndez Lorenzo (CERN, IT/GS), Francesco Noferini (INFN-CNAF), Derek Ross (RAL) and Massimo Sgaravatto (CREAM development team, INFN-Padova)

Thanks This talk includes the feedback and the contributions from Subatech: Jean-Michel Barbet INFN-Torino: Stefano Lusso and Stefano Bagnasco INFN-CNAF: Francesco Noferini RAL-LCG2: Catalin Condurache and Derek Ross INFN-Legnaro: Sergio Fantinel INFN-Padova and CREAM CE developers team: Massimo Sgaravatto 11/11/09CREAM: ALICE Experience 2

CREAM-CE: Deployment status Current CREAM-CE service Production version: CREAM1.5 (glite-CREAM ) Deployed in production by the 6th of October ( patch #3259 for SLC4/i386) Features: Important bug and security fixes (pointed by the GSVG) Migration of sites to CREAM1.5 was highly encouraged by that time and ALICE fully support it for all sites providing this service for the experiment Outlook of this talk: During the last GDB (14/10/09) we made a list of all issues reported by the site admins in terms of CREAM-CE and based on the experiences gained with the ALICE production Now (one month of operations later) we have collected the feedback from several sites already using CREAM1.5 11/11/09CREAM: ALICE Experience 3

CREAM-CE: Future version Future CREAM-CE service Production version: CREAM1.6 Status ready for certification/certified (expected) by December 2009 TASK #9734: PATCHES #3179 and #  Release 1.6 of CREAM CE for sl5_x86_64  YAIM-CREAM-CE for release 1.6 of CREAM CE Features: Many of the issues reported during the last GDB (and not included in CREAM1.5) will be now solved 11/11/09CREAM: ALICE Experience 4

CREAM-CE: site admins and developers reports (I) Purge issues: ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized ALICE REQUIREMENT: Method to purge jobs in a non terminal status CREAM STATUS: CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser candidates for CREAM1.6 BUG #55078 : « Possible final state not considered in BLParserPBS and BUpdaterPBS » CURRENT STATUS: Integration Candidate included in patch #3179 BUG #54949 : « Some job can remain in running state when BLParser is restarted for both lsf and pbs » CURRENT STATUS: Integration candidate included in patch #3179 There is an specific bug which covers the ALICE requirement BUG #55420 : « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) CURRENT STATUS: Integration Candidate included in patch #3179 CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before) 11/11/09CREAM: ALICE Experience 5

CREAM-CE: site admins and developers feedback (I) Purge issues: Site admin reports Desynchronization issues has not been observed recently at sites running CREAM1.5 Several sites have used the script created by the CREAM developers to purge manually the CREAM DB Very good feedback on regard with this toolkit It requires however a manual operation and the purge criteria variates from site to site 11/11/09CREAM: ALICE Experience 6

CREAM-CE: site admins and developers report (II) DISK SPACE issues: Areas to monitor and purge or clean ALICE REPORT: The local mysql DB grown up to 2.5 GB CREAM STATUS: Issue associated to mysql engine. While deleting entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB CURRENT RISK FOR ALICE: low ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox CREAM STATUS: Solved in CREAM1.5 #48144: « Problems with purge in CREAM when the mapped group name is different than the VO name » RISK FOR ALICE: none once sites upgrade to CREAM1.5 11/11/09CREAM: ALICE Experience 7

CREAM-CE: site admins and developers feedback (II) Disk space issues: Site admins report Grow up of the local mysql DB Some tables in the DB still growing up Purge of the input Sandboxes in /opt/glite/var/cream_sandbox area Sandbox auto-purge procedure included in CREAM1.5 working fine now (after 10 days outputs are purged) No further issues observed by the site admins on regards with the purge of the Sandbox after the migration to CREAM1.5 11/11/09CREAM: ALICE Experience 8

CREAM-CE: site admins and developers report (III) DISK SPACE issues (cont.) ALICE REPORT: issues regarding /opt/glite/var/log and /var/log ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files can grow forever CREAM STATUS: policies exist for all these files and can be customized file by file: Only the blah accounting log files are out of the CREAM developer’s control (files cannot be deleted before having been processed by the accounting system) For /opt/glite/var/log/glite-ce-cream.log and /opt/glite/var/log/glite-ce- monitor.log, the policy is defined under /var/lib/tomcat5/webapps/ce-cream/WEB- IFN/classes/log4j.properties and the default values can be changed Relevant info under: For /opt/glite/var/log/glite-xxxparser.log the policy is available under /opt/logrotate.d/glite-xxxparser For /etc/logrotate.d/globus-gridftp manages the gridftp log files under /var/log RISK FOR ALICE: low since the size is manageable by site admins 11/11/09CREAM: ALICE Experience 9

CREAM-CE: site admins and developers report (IV) DISK SPACE issues (cont.) ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy CREAM STATUS: bug reported and accepted not available in CREAM1.5 #49497: « User proxies on CREAM do not get cleaned up » CURRENT STATUS: Already solved, it will be included in CREAM1.6 (bug fix implementation still pending) CREAM developers could increase the priority of this bug if needed DISK SPACE issues: site admins report No issues observed by the sites in the last month 11/11/09CREAM: ALICE Experience 10

CREAM-CE: site admins and developers report (V) LOAD issues (reported by Subatech): ALICE REPORT: UNIX load going up to 5 (during start up or high rate of submission) CREAM STATUS: problem reported by GRNET and the origin of the problem was a missed index in the CREAM DB #52876: « The extra attribute table in the CREAM DB has no key/indexes defined » CURRENT STATUS: solved in CREAM1.5 RISK FOR ALICE: low once upgrading the CREAM version ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status #51978: «CREAM can be slow to start» bug in progress CURRENT STATUS: not included in CREAM1.5 but will be released in CREAM1.6 RISK FOR ALICE: Purge actions should speed this start up and therefore decrease the risk for the experiment 11/11/09CREAM: ALICE Experience 11

CREAM-CE: site admins and developers feedback (V) Load issues: Site admins report Grow up of the UNIX load Reported by Subatech, still visible at the site Load increases during automatic purge operations. Also visible during high job submission rates Site admin report: At this site CREAM is running in a Vmware VM and the load might be due to lack of MySQL performance in such environment. Slow down of MySQL could increase the Unix load CREAM-CE developers report: Issue tracked in bug # the GRNET report “CREAM performance report”: very heavy queries are performed during purge operations CURRENT STATUS: Fix already committed to CVS and will be released with the next CREAM1.6. Developers have not yet assessed the level of optimization of this fix to reduce the load Report from Legnaro After closing the queues the load increased without saturating the CPU (60% CPU load) for about 12h. The issues seems to come from the ALICE submissions which continued although the queues were closed. Tomcat restart slows down the submission of jobs Solved in CREAM1.6 No further reports from the site admins 11/11/09CREAM: ALICE Experience 12

Some other interesting feedback We asked the site admins for: Requirements for the system maintenance Since the last update sites spend much less time monitoring CREAM-CE Keeping control of the disk space basically and consistency between jobs reported by CREAM and the local batch system (Subatech) In some cases, the baby-sitting of the site is almost negligible (Legnaro and CNAF) Issues observed at RAL before the upgrade of the system seems to be gone after the deployment of CREAM1.5 i.e.,Tomcat related issues already solved with this new version 11/11/09CREAM: ALICE Experience 13

Some other interesting feedback (II) We also asked the site admins for: Monitoring applied to the system at the sites In some cases (Subatech and RAL) the site is using Nagios with also specific probes: gLite-LB-logd and tomcat daemons User_nbfiles: number of files used for ALICE production Inactive_jobs: jobs not consuming CPU Open_file_desc: number of file descriptors used Standard fabric (Ganglia) for Legnaro 11/11/09CREAM: ALICE Experience 14

Some other interesting feedback (III) In addition CNAF reported: blparser is not automatically restarted at boot time (only tomcat). Blparser has to be restarted by hand in order to recover the queue info Developers feedback: issue included in bug #56518 CURRENT STATUS: Fix already committed and will be provided in CREAM1.6 Finally INFN-Torino feedback Running CREAM-CE since one day Stefano Lusso has reported the useful added value of the script CheckCreamConf.pl used at the site to set variables: ation 11/11/09CREAM: ALICE Experience 15

Summary ALICE remarks again their high interest in the generalized deployment of CREAM-CE Vibrant and very involved user community provides helpful feedback Fantastic quality developer support and advice ALICE and the sites involved want a fast version certification and deployment cycle Time is very short, data is coming 11/11/09CREAM: ALICE Experience 16