WLCG Service Report ~~~ WLCG Management Board, 23 rd November 2010 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
WLCG Service Report ~~~ WLCG Management Board, 23 rd March
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 23 rd November

Introduction Generally smooth operation on experiment and service side Coped well with higher data rates during the HI run (CMS from CASTOR: 5 GB/s) One Service Incident Report received: IN2P3 shared area problems for LHCb (interim SIR – GGUS:59880)SIRGGUS:59880 Two more SIRs are pending: CASTOR/xrootd problems for LHCb at CERN (GGUS:64166)GGUS:64166 GGUS unavailability on Tuesday November 17 th Three GGUS ALARMS CASTOR/xrootd problems for LHCb at CERN (GGUS:64166)GGUS:64166 ATLAS transfers to/from RAL (GGUS:64228)GGUS:64228 CNAF network problems affecting ATLAS DDM (GGUS:64459)GGUS:64459 Other notable issues reported at the daily meetings Slow transfers to Lyon for ATLAS (dcache problems) Many top-priority tickets are open (e.g. GGUS:63631, GGUS:64151, GGUS:64202)GGUS:63631GGUS:64151GGUS:64202 Security updates in progress (CVE ) BDII timeouts for ATLAS at BNL due to OSG network problems (GGUS:64039)GGUS:64039 Database problems for ATLAS Panda and PVSS at CERN (no GGUS ticket) 2

IN2P3 (intermediate) SIR Since some months LHCb is suffering from problems related to the IN2P3 shared area on AFS: up to 6% of all jobs fail due to timeout during software setup (GGUS:59880) software installation jobs fail (GGUS:62800) ATLAS had seen a similar problem with software setup timeouts workaround for ATLAS (increase timeout) recommended also for LHCb Follow-up is still in progress Separate WNs for LHCb (where ATLAS jobs are excluded) are being deployed Tuning and tests are ongoing 3

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS5409 LHCb Totals

LHCB ALARM->CERN CASTOR ACCESS FAILURE 5 What time UTCWhat hppened 2010/11/11 8:43GGUS ALARM ticket opened, automatic notification to AND automatic assignment to 2010/11/11 8:58Site acknowledges ticket. CASTOR piquet called. 2010/11/11 10:32‘Solved’ by explaining that incoming proxies with VOMS Role=NULL are not recognised as valid members of the root group /lhcb. 2010/11/11 10:54LHCb supporter requests to add this FQAN in the mapping of the xrootd redirector.

ATLAS ALARM->CERN-RAL TRANSFER FAILURES 6 What time UTCWhat happened 2010/11/14 7:11 SUNDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to 2010/11/14 7:49Site acknowledges ticket. CASTOR expert investigating. 2010/11/14 9:19SRM & LSF services restarted at RAL. Submitter reports persistent problems for the rest of the day. Links to TEAM ticket GGUS:64224 for details.GGUS: /11/15 11:19‘Solved’ by reducing the allowed number of jobs on the batch farms. Submitter sets status ‘verified

ATLAS ALARM->CERN CNAF TRANSFER FAILURES 7 What time UTCWhat happened 2010/11/20 20:32 SATURDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to 2010/11/20 20:53Site acknowledges ticket. Investigation till midnight gave no understanding of the problem reasons. 2010/11/21 10:45Downtime recorded in GOCDB. Atlas blacklists the site. 2010/11/22 14:17SRM restarted, CERN jobs accepted again. Site contacts suggest to close the ticket, even if the problem is not explained.

8 Support-related events since last MB GGUS TEAM and ALARM tickets will be equipped with a ‘Problem Type’ field, like USER tickets, in order for periodic reporting to better show weak areas in support. Individual tickets were opened against each middleware-related GGUS Support Unit to inform them about assignment ONLY by the DMSU. Some middleware experts refuse to be hidden behind the 2 nd level EGI body DMSU in GGUS. This affects implementation of the new workflow in ticket assignment, at the next release (Dec. 1 st ). The suggestion to create TEAM instead of USER tickets via the CMS-to-GGUS bridge was rejected. The request to be able to convert TEAM tickets into ALARMS was accepted. There were 3 true ALARM tickets since the Nov. 9 th MB (2 weeks), Details in the previous slides.

Analysis of the availability plots Tuesday 9 th of November: We are still experiencing problems with missing data for one day of the availability report (week ) as reported on the daily WLCG operations meetings on the 10 th on November. ATLAS Green box for BNL: Problems with SAM BDII - availability info is false as tests were failing due to issues not to do with the site. 1.1 IN2P3: SRM tests were timing out after 600 seconds due to a disk space issue + SRM highly loaded by SRMPUT - GGUS:64164 and GGUS: RAL: SRM tests were failing: lcg-cp (timing out after 600 seconds), lcg-cr (timing out after 600 seconds) and lcg-del (unary operator expected ) – GGUS: ALICE 2.1 Green box for SARA: CE tests were failing. SAM CE tests should be ignored as ALICE is using CREAM CE. There are no CREAM CE direct submission tests for ALICE yet (work in progress). CMS 3.1 FNAL: SE type SRMv2 not published - After a SAM-SRM error at T1_US_FNAL, no further SAM tests were running - GGUS: LHCb 4.1 IN2P3: Issues with IN2P3 shared area – GGUS:59880 (performance issues) and GGUS:62800 (SW installation). CE-tests were failing: install, job-Boole, job-DaVinci, job-Brunel,job-Gauss, sft-vo-swdir (‘took more than 150 seconds while 60 seconds are expected’ OR ‘software check failed’), condDB (problems opening the database – error found in the output job) and sftp- job (globus error 126: it is unknown if the job was submitted). SRM tests failing: lcg-cr (connection time out, no accessible BDII), diracUnitTest (assertion error). 4.2 RAL: Unscheduled outage: service at risk while SRMs upgraded by rolling change 11 th Nov 10h until 11 th Nov 14h. SRM tests where failing: lcg-cr, diracUnitTest and lhcb-fileaccess (could not open connection to srm-lchb).

Analysis of the availability plots ATLAS 1.1 INFN: SRM tests were failing as reported on Friday the 19 th : lcg-cp (no files have been lcg-cp on this endpoint), lcg-cr (error reading token data header: connection closed), lcg-del (unary operator expected) possibly due to INFN experiencing network problems (GGUS: 64459). ALICE 2.1 Green box for SARA: CE tests were failing. SAM CE tests should be ignored as ALICE is using CREAM CE. There are no CREAM CE direct submission tests for ALICE yet (work in progress). CMS 3.1 KIT: SRM tests were failing as reported on the 16 th : lcg-cp (timeout after 1800 seconds), lcg-ls, lcg-gt and lcg-gt-rm-gt (globus_ftp_client: server responded with an error). Probably an indication that CMS skimming was overloading the system. 3.2 CNAF: CE and SRM tests were failing. CE: cms-analysis (open failed with system error 'Stale NFS file handle’), sft-job (Got a job held event, reason: Globus error 25: the job manager detected an invalid script status). SRM: lcg-cp, lcg-gt, lcg-ls & lcg-gt- rm-gt (Client transport failed to execute the RPC) and lcg-ls-dir (time out after 1800 seconds). Possibly due to INFN experiencing network problems (GGUS: & 64462). 3.3 RAL: Scheduled outage of the SRM from the 16 th until the 18 th : Upgrade of the CMS Castor instance to version FNAL: SRM tests were failing/timing out due to the heavy usage of the storage element - some of the components responded within larger time intervals. GGUS: LHCb 4.1 CNAF: SRM tests were failing: lcg-cr (communication error on send), diracUnitTest (failed to put file to storage) and fileAccess (invalid argument). Possibly due to INFN experiencing network problems (GGUS: 64459). 4.2 IN2P3: Issues with IN2P3 shared area – GGUS:59880 (performance issues) and GGUS:62800 (SW installation). CE-tests were failing: install, job-Boole, job-DaVinci, job-Brunel,job-Gauss, sft-vo-swdir (‘took more than 150 seconds while 60 seconds are expected’ OR ‘software check failed’), condDB (problems opening the database – error found in the output job) and sftp- job (globus error 126: it is unknown if the job was submitted). SRM tests failing: lcg-cr (connection time out, no accessible BDII), diracUnitTest (assertion error). 4.3 Green box for NIKHEF: The lhcb-availability test was failing. The lhcb-availability is not considered a ‘critical’ test and we are investigating why this tests is incorrectly included on the results of the critical tests. 4.4 RAL: SRM tests were failing (open/create error: read only file system): lcg-cr (the server responded with an error) and diracUnitTest (failed to put file in storage). 4.5 SARA: SRM tests were failing: lcg-cr (connection time out), diracUnitTest (failed to create directory on storage), fileAccess (no accessible BDII).

Conclusions Business as usual – busy but successful Some issues need follow-up (e.g. ATLAS transfers to Lyon) WLCG is meeting the challenges of HI data taking 13