 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.

Slides:



Advertisements
Similar presentations
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Advertisements

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
ALICE Roadmap for 2009/2010 Patricia Méndez Lorenzo (IT/GS) Patricia Méndez Lorenzo (IT/GS) On behalf of the ALICE Offline team Slides prepared by Latchezar.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Status of the Production and Nagios news ALICE TF Meeting 29/07/2010.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
2-Dec Offline Report Matthias Schröder Topics: Scientific Linux Fatmen Monte Carlo Production.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
CREAM CE: upgrades in the system  Migration of the ALICE production queue in the CREAM CE: DONE  From pps-cream-fzk.gridka.de:8443/cream-pbs-pps to.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
ALICE Workload Model – WMS and CREAM
Latest WMS news and more
Summary on PPS-pilot activity on CREAM CE
Service Challenge 3 CERN
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
INFNGRID Workshop – Bari, Italy, October 2004
Update on Plan for KISTI-GSDC
The CREAM CE: When can the LCG-CE be replaced?
1 VO User Team Alarm Total ALICE ATLAS CMS
Simulation use cases for T2 in ALICE
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Offline shifter training tutorial
The LHCb Computing Data Challenge DC06
Presentation transcript:

 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010

Outlook  General results in the last three months  List of general issues  News about services  HI CMS+ALICE exercise  Nagios and Monitoring  Summary and Conclusions 18/15/10 2 ALICE OFFLINE WEEK -- ALICE GRID STATUS

General results in the last three months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 3

List of general issues  T0 site  Instabilities this summer with the local CREAM-CE  Instabilities with the AFS software area  CAF nodes quite stable  Security patches applied to all ALICE VOBOXES at CERN  Migration of out of warranty voboxes (voalice07 to voalice15 and voalice09 to voalice16)  HI combined exercise  T1 sites  CREAM-CE issues including instabilities observed in the resource BDII  SE problems found at CNAF and CC-IN2P3 related to lack of disk space  T2 sites  Usual operations, in general quite stable behavior  Challenge: new sites entering production and updates of T2 to T1 sites (from the ALICE perspective) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 4

T2 sites  T1 sites  Korean and USA sites willing to become ALICE T1 sites  Assuming in terms of services provision and management  Challenge: Bandwidth  We found a poor network between these sites and CERN  Show-stop for these sites and also for new comers  1 st approach: bottleneck entering CERN? (firewall stops)  It has been found this is not the issue  Current situation: Not fully clear (Jeff in contact this week with Edoardo Martelli to report the Supercomputing results)  “Proposal for Next Generation Architecture interconnecting LHC computing sites” (Nov 2010)  Moving towards a more dynamically configured links between sites with a few static connections 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 5

CREAM and AliEn v Easy management of the OSB (OutputSandBox) 2. Removal of any reference to the CREAM-DB 3. Check out of the CREAM-CE status in the BDII 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 6

CREAM and AliEn v2.19  Easy management of the OSB  OSB required by ALICE for debugging purposes only  Direct submission of jobs via the CREAM-CE requires the specification of a gridftp server to save the OSB  Server specified to the level of the jdl file  ALICE solved it requiring a gridftp server at the local VOBOX  OSB cannot be retrieved from CREAM disk via any client command  Well… not fully true. Functionality possible but not exposed before CREAM1.6  Requirements to expose this feature  Automatic purge procedures (from CREAM1.5)  Limiters blocking new submissions in case of low free disk space (from CREAM1.6 )  CREAM1.6 exposes the possibility to leave the OSB in the CREAM-CE  outputsandboxbasedesturi="gsiftp://localhost"; (agent jdl level)  gridftp server at the VOBOX is not longer needed 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 7

CREAM and AliEn v2.19  Removal of any reference to CREAM-DB  Reporting of running/waiting jobs purposes in parallel to the BDII information  AliEn v2.18 enabled both information sources  Definable on a site by site basis through an env variable (CE_USE_BDII) included in LDAP  AliEnv2.19 keeps the env variable but removes the CREAM-DB reference as information source  Too heavy query and not always reliable  If not reliable we could collapse the sites or the opposite: simply not run  CREAM-CE developers have proposed us the creation of a tool able to provide waiting/running number of jobs querying the batch system  Therefore the maintenance of the env variable CE_USE_BDII  WARNING: THE ONLY INFO SYSTEM WE HAVE NOW IS THE RESOURCE BDII 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 8

CREAM and AliEnv2.19  Check out of the CREAM-CE status  “Economic” reasons… for what to keep the submission to CREAM-CEs in draining or maintenance mode?  Until AliEn v2.19: Manual approach  Non operational CEs were manually removed from LDAP  With AliEnv2.19: Automatic approach  Before any CREAM-CE operation the status of the CREAM-CE is queried to the resource BDII  If CE in maintenance of drain mode no operation is performed with this CE  If there is a list of CREAM-CEs, only those in production will be used  No need to restart services when the CE comes back in production  Procedure implemented and tested in Subatech with good results 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 9

CREAM Status  Current CREAM-CE production version:  CREAM1.6.3 (gLite3.2/sl5_x86_64) patch#4415  gLite3.1 version arriving (patch #4387 in staged-rollout) BUT! This will be the last CREAM-CE deployment in gLite3.1  Next CREAM-CE version:  CREAM1.7 (gLite3.2/sl5_x86_64 ONLY!)  Foreseen end of the year/beginning of 2011  Brief parenthesis…  Since the last offline week, I have submitted 27 GGUS tickets  17 associated to CREAM  4 associated to wrong information provided by the BDII  6 associated to SE issues  Let’s see the issues associated to CREAM (and observed by ALICE) in these last three months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 10

CREAM Issues  Last Offline week’s advice for sites:  Migrate to CREAM1.6 as soon as possible  Lots of bug fixes reported by ALICE and new features were included in this version  However several instabilities were observed after the migration to CREAM1.6:  Connection timeout messages observed at submission time  Error messages reporting problems with the blparser service (blparser service not alive)  Issues reported to the CREAM-CE developers  We created a page for site admins describing the problems and the solutions:  46&Itemid=103 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 11

CREAM Issues  Connection timeout error message observed at submission time  CREAM service is down  Bug #69554: Memory leak in util-java if CA loading throws exception. SOLVED IN CREAM1.6.2  Workaround provided by developers very easy to apply:   blparser service is not alive (glite-ce-job-status)  Well documented issue associated to the status of the BLAH blparser service  blparsernotalive  Further problem(s):  Bug #69545: CREAM digests asynchronous commands very slow. SOLVED IN CREAM  Workaround provided by the developers very easy to apply  18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 12

Other issues  Reported by GridKa  User proxy delegation problems  At delegation time user gets “not authorized for operation” messages  Documentation available in:  ToClient  Reported by LPSC  /tmp area of CREAM full of glexec “proxy files” (Bug #73961)  Not direct a CREAM issue although the service was affected  With CREAM1.6.3 the problem is solved  No workaround will have to be applied as soon as sites migrates to this version  Migration to CREAM1.6.3 is highly recommended 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 13

Other issues  Found at CERN  Lots of timeouts while querying the CREAM-DB during the summer 1. Increase of the timeout window to 3min 2. Deprecation of the CREAM-DB usage  Reported by Subatech  glite-ce-job-status fails with the message: “EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred”  Issue associated to poor memory in the CREAM-CE (~2GB when the issue was found)  CREAM-CE advice: CREAM-CE nodes should have a minimum of 8 GB of memory 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 14

More about CREAM  CREAM1.6.3 includes important bug fixes  See Massimo Sgaravatto’ presentation during the latest GDB meeting:   CREAM1.7 client will include glite-ce-job-output  This does not require changes in our CREAM.pm module  And the possibility to leave the OSB in the CREAM (and retrieve it on demand) is of course available 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 15

gLite-VOBOX  Current production version:  VOBOX (gLite3.2/sl5_x86_64) patch#4257 (5.Oct 2010)  New features  new Glue 2.0 service publisher  new version of LB clients  Still gridftp server is included in this version  … included but not configured via YAIM  The startup of the service has to be treated besides YAIM  The removal of this server will be asked 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 16

HI CMS+ALICE exercise  Combined ALICE+CMS exercise (21-October 14:00, 22-October 14:00) to check the IT infrastructure (network and tapes) ability to cope with the expected rates  P2-CASTOR (2.5 GB/sec max) and transfer to tape  2.3 PB available on t0alice and 2.3 PB available on alicedisk  Reconstruct ~10% of data  Simultaneous copy of RAW data to disk pool (via xrd3cp, 2.5GB/sec max)  2100 TB extra space on disk pools provided before by IT  Asynchronous start up of the test  ALICE exported directly to CASTOR while CMS was performing a previous repacking before the export 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 17

P2  CASTOR transfers  Average rate – 2GB/sec with a max rate of 2.5 GB/sec  160 TB transferred (1-% of the expected HI volume), files (2.7 GB/file)  Several interruptions for detector reconfiguration and follow up on data transfer to tapes (realistic scenario) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 18 (Plot provided by L. Betev)

CASTOR disk buffer  tapes transfers 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 19 Data in from P2 To tape Δt=1h (Plot provided by L. Betev) Average rate 2.4 GB/sec Data makes it to tape after 1 hours after being written to the CASTOR buffer 3 rd party copy delayed by 1h

Copy from toalice  alicedisk + reco 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 20 Copy t0alice to alicedisk (average 2.6GB/sec) RAW data reconstruction reading and writing (Plot provided by L. Betev) Average copy rate – 2.5 GB/sec Average reco “in” rate –200 MB/sec Average reco “out” rate -20MB/s

Monitoring: Nagios  Nagios  Monitoring of the ALICE VOBOXES in production since Summer2010  Visualization of the results via SAM is obsolete  Nagios implementation in ML still pending  Sites availability calculation: Currently being compared the calculations though SAM and through Nagios  Next MB meeting will show these results  Pending developments  Implementation of the CREAM-CE standard test suite  Redefinition of the site availability algorithm based on CREAM (currently based on LCG-CE) 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 21

Monitoring: Gridview  The transfer rate reported by Gridview is smaller than the real rate  Issue has been found in August 2010 but it is still pending  Track in a GGUS ticket: # /15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 22 average transfer for the day of 20 MB/s average transfer for the day of 32 MB/s

Summary and conclusions  Very smooth production in these last three months  Raw data transfer to CASTOR, registration in the AliEn file catalog, transfers to T1 sites are already routine  Site inefficiencies immediately managed together with the site admins  Some changes have been included in AliEn v.2.19 concerning the CREAM- CE service  Based on the experiences gained this summer with the 1 st version of CREAM1.6  Some new improvements can be expected for the next CREAM1.7 version  Agile approach foreseen by ALICE with emphasis on the use of T2 sites (even becoming ALICE T1 sites) will be one of the topics to work in in the following months 18/15/10ALICE OFFLINE WEEK -- ALICE GRID STATUS 23