1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

2  2010 Data Taking: Results  The new AliEn2.18 version  WLCG services news  Failover mechanism for the VOBOXES  Deprecation of the LCG-CE  Raw data transfers and monitoring  Operational procedures  Summary and Conclusions

3  Since Feb. till end of Mar. cosmic- ray data taking  ~10 5 events  pp run since March 30 th  7 TeV: 40x10 6 events  0.9 TeV: 7x10 6 events Run processing: start immediately after RAW data transferred to CERN MSS Average – 5h per job At 10h, 95% of the runs processed At 15h, 99% of the runs processed Raw data processing Pass1-6 completed for 0.9 and 2.36 TeV data Pass1@T0 for 7TeV data follows data taking Analysis train running weekly: QA, physics working groups organized analysis Raw data registration: ~77TB LHC restart MC production Several production cycles for 0.9, 2.36 and 7TeV pp: 17x10 6 events with various generators and conditions from real data taking 5h 10h 15h

4 Remarkable stability at all sites during the data taking

5 Peak ~ 125MB/s Average ~ 30MB/s Total transferred: 28.26 TB (only “good” runs are being transferred) Full runs transferred to each T1 site SE choice based on the ML tests at transfer time Equal conditions, SE taken randomly This will change to chose the SE based on the number of resources provided by the site Distribution already defined in SC3

6  Many new features included in AliEnv2.18 solving quite a lot of previous challenges  Deployment of this version transparently from central services  Simultaneously to the startup of data taking  We mention here two important improvements:  Implementation of Job and File Quotas  Limit on the available resources per user  Jobs: # jobs, cpuCost, runningtime  Files: #files, total size (including replicas)  Improved SE discovery  Finding the closest working SEs of a QoS once the file has been registered in the catalogue  For reading and writing and taking into account ML tests  Simplifying the selection of SE  Giving more options in case of special needs

7 Example: Writing purposes Client I am in Madrid, give me SEs Try: CCIN2P3, CNAF and Kosice Authen File Catalogue SERank Optimizer MonaLisa Similar process for reading Number of SE, QoS, avoid SE… can be selected

8  2009 approach: CREAM-CE implementation on AliEn and distribution  System available at T0, all T1 (but NIKHEF at that time) and several T2 sites  Dual submission (LGC-CE and CREAM-CE) at all sites providing CREAM  Second VOBOX was required at the sites providing CREAM to ensure the duality approach LCG-CE vs. CREAM-CE  2010 approach: Deprecation of the gLite-WMS  Latest news in terms of sites:  3 rd CREAM-CE at CERN in SL5 (ce203) announced on Monday night, has entered production immediately  NIKHEF has announced a local CREAM-CE yesterday afternoon. System successfully tested by ALICE and included in production  Actively involved in the operation of the service at all sites together with the site admins and the CREAM-CE developers

9  ALICE has established the 31 st of May as the deadline to have CREAM-CE at all sites  After that date and based on the status of the pending sites, these sites might be blacklisted  Based on the current status, we can say that ALICE is running in CREAM-mode at all sites  T0 is still running in dual mode and the deprecation of the LCG-CE is not expected for the moment

10  CREAM-CE 1.6 has been released in production for gLite 3.2/sl5_x86_64  https://savannah.cern.ch/patch/?3959 https://savannah.cern.ch/patch/?3959  The relevant version for gLite 3.1/sl4_i386 is already released in the staged-rollout  ALICE sites encouraged to migrate to CREAM1.6 as soon as possible: 1. A large number of bugs reported by ALICE sites admins have been solved in the mentioned version 2. It will allow a lighter distribution of the current gLite3.2 VOBOX

11  Purge issues:  ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de- synchronized  CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser  #55078: Possible final state not considered in BLParserPBS and BUpdaterPBS (Ready for review)  #54949: Some job can remain in running state when BLParser is restarted for both lsf and pbs (Ready for review)  #55420: Allow admin to purge CREAM jobs in a non terminal status (verified)  Disk space issues:  ALICE REPORT: Issues regarding the cleanup of /opt/glite/var/cream/user_proxy area  #49497: user proxies on CREAM do not get cleaned up (Ready for review)  Load issues:  ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs  The slow start of CREAM is also due to the problems coming from jobs reported in wrong status  #51978: CREAM can be slow to start (verified)

12  Load issues (cont):  ALICE REPORT: Grow up of the UNIX load. Load increases during automatic purge operations. Also visible during high job submission rates  #58103: Cream database Query performance (Ready for review). The GRNET report “CREAM performance report”: very heavy queries are performed during purge operations  Other issues:  ALICE REPORT: blparser is not automatically restarted at boot time (only tomcat). Blparser has to be restarted by hand in order to recover the queue info  #56518: BLAH blparser doesn't start after boot of the machine (verified)  ALICE REPORT: wrong SGM mapping. Jobs submission fails when the jdl contains InputSandbox  The origin of the problem is a wrong user mapping between CREAM and gridftp  #58941: lcmaps confs for glexec and gridftp are not fully synchronized (TM) (verified)

13  Direct submission of jobs via the CREAM-CE requires the specification of a gridftp server to save the OSB  Server specified to the level of the jdl file  ALICE solved it requiring a gridftp server at the local VOBOX (distributed with gLite3.2 VOBOX)  OSB cannot be retrieved from CREAM disk via any client command  Well… not fully true. Functionality possible but not exposed  Lack of space management mechanism discourages such a procedure  Requirements to expose this feature  Automatic purge procedures  Limiters blocking new submissions in case of low free disk space

14  Automatic purge procedures  Already included in CREAM1.5 according to a configurable policy  Sandbox area for a job deleted while purging the job  http://grid.pd.infn.it/cream/field.php?n=Main.HowToPurgeJobsFromTheCRE AMDB  Limiters  New feature included in CREAM1.6  Several users asking for the possibility to save the OSB in CREAM-CE  CREAM1.6 exposes for the 1 st time the possibility to leave the OSB in the CREAM-CE  If OSB can be left in the CREAM-CE a gridftp server at the VOBOX is not longer needed  Feature successfully tested by ALICE in Torino (CREAM1.5) trusting the available purge procedure  The implementation in AliEn is very simple but not backward compatible

15  For the 2009 approach, ALICE required a 2 nd VOBOX at those sites providing both submission backend (LCG-CE vs. CREAM)  Motivation extensively explained during previous GDB meetings  2010 approach foresees a single backend: CREAM-CE  In principle a single VOBOX is needed  What to do with the 2 nd VOBOX?  Rescue it: FAILOVER MECHANISM  This approach has been included in AliEnv2.18 to take advantage of the 2 nd VOBOX deployed in several ALICE sites  ~25 sites currently providing >=2 VOBOXES

16  Same configuration for both local VOBOXES  No relevance for any of them  They run exactly the same services and share the same software area  AliEnv2.18 implementation  Simple approach  All services (but MonaLisa) will try to connect subsequently to the 1 st available host of a list included in LDAP  The list contains the names of the local VOBOXES  Connection will be establish with the first available VOBOX that will take the whole load in case of failures

17  Dashboard monitoring  Already available

18  Issues daily reported at the ops. Meeting  The weekly ALICE TF meeting includes now analysis items (TF & AF meeting)  Moved to 16:30 to contact with the American sites  Latest issues at T0  CAF nodes  Instabilities in some nodes have been observed in the last weeks  Thanks to the experts at the IT for the prompt answers and actions  AFS space  Replication of afs ALICE volumes  Separation in readable and writable volumes  Thanks to Harry and Rainer for their help

19  Very smooth operations of all sites and services during the 2010 data taking  Very good responds of site admins and experts in case of problems  A new AliEnv2.18 has been deployed and will be the responsible of the data taking infrastructure of ALICE in the next months  Transparent deployment of the new version in parallel to the start up of the data taking  In terms of services, sites are encouraged to provide the latest CREAM1.6 version as soon as possible

1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

Similar presentations

Presentation on theme: "1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

Similar presentations

Presentation on theme: "1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)"— Presentation transcript:

Similar presentations

About project

Feedback