WLCG Service Report ~~~ WLCG Management Board, 14 th February 2012 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Update of SAM Implementation ALICE TF Meeting 18/10/07.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 14 th February

Introduction The service is running rather smoothly, the “metrics” are working relatively well At least one significant change in the pipeline: EMI FTS deployment in production at Tier0 and Tier1s (well) prior to 2012 pp data taking At last T1SCM the relevant m/w had not been released (due Feb 16) nor was roadmap clear to all (being prepared) SIRs: one requested covering Oracle 11g upgrades; others due for the 2 alarm tickets of

WLCG Operations Report – Structure 3 KPIStatusComment GGUS ticketsNo alarms; normal # team and user tickets No issues to report Site UsabilityFully greenNo issues to report SIRs & Change assessmentsNoneNo issues to report KPIStatusComment GGUS ticketsFew alarms; normal # team and user tickets and/or Drill-down Site UsabilitySome issues and/orDrill-down SIRs & Change assessmentsSomeDrill-down KPIStatusComment GGUS ticketsAlarms, many other ticketsDrill-down Site UsabilityPoorDrill-down SIRs & Change assessmentsSeveralDrill-down

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE402 (1)6 ATLAS CMS1552 (1)22 LHCb Totals (2)295 4

3/16/2016WLCG MB Report WLCG Service Report 5 Support-related events since last MB There were 2 real ALARM tickets since the 2012/01/10 MB (5 weeks), 1 submitted by ALICE and 1 by CMS. Both ALARM tickets concerned CERN Databases. Both of them are in status ‘verified’. The GGUS monthly release took place on 2012/01/ test ALARMs were issued and analysed in Savannah:125144Savannah: Details follow…

ALICE ALARM->voms-proxy-init hangs GGUS:78739 GGUS: /16/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2012/01/29 23:23 SUNDAY GGUS ALARM ticket, automatic notification to alice- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Databases.alice- 2012/01/29 23:34Service expert comments that the incident is related to LCGR db hanging. Investigation in progress. 2012/01/29 23:34Operator records in the ticket that db piquet was called. 2012/01/30 00:02Submitter confirms after the db hunging was by-passed VOMS and SAM services became available again. 2012/01/30 00:08Today we have experienced some problems with the archiver processes on LCGR database, instance number 1. We do not know yet if the problem is related to some disk failures or an Oracle bug, this is still under investigation. The database hung completely around 00:40. I had to kill instance number 1 manually in order to get the database back. I have also disabled the archive logs backups as this seems to be the cause for the archiver processes hangs. … 2012/01/30 08:46solved (SAM/Nagios) Host certifcate regenerated. System works fine. This is what is recorded in the ticket. However, it is neither a complete nor accurate summary due to some confusion between multiple incidents and human error in updating (closing) the wrong ticket. IMHO a SIR would be useful in clarifying this.

CMS ALARM->no connect to CMSR db from remote PhEDEx agents GGUS:78843 GGUS: What time UTCWhat happened 2012/02/01 17:31GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Databases.cms- 2012/02/01 17:54Operator records in the ticket that the CMS piquet was called. 2012/02/01 18:07DB expert records in the ticket that the problem should be gone now. Waiting for submitter’s confirmation. 2012/02/01 18:55Submitter agrees and puts the ticket in status ‘solved’. He records that this is a temporary solution and a detailed explanation and a permanent solution is pending. However, as he ‘verified’ the ticket the next day, no further details were ever recorded about the reasons of this. More info in IT C5 report (see slide notes) Firewall misconfiguration immediately after Oracle 11g upgrade

Analysis of the availability plots: Week 09/01/2012 Common to experiments – 0.1 RAL – UNSCHEDULED OUTAGE Site in downtime during network/DNS problems ATLAS – 2.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade CE-JobSubmit test failing with: Globus error 25: the job manager detected an invalid script status SRM-VOPut test was failing - now fixed and understood, no more tests against ATLAS castor instance – 2.2 TAIWAN CREAMCE-JobSubmit test failing with: BLAH error: submission command failed and hit job shallow retry count (1) CMS – 3.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade SRM-VOPut test was failing LHCb – 4.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade Problem with access to castor after downtime of 11 th, problem fixed by castor team (GGUS:78103) SAM jobs failing when accessing CEs (GGUS:78185) – 4.2 IN2P3 SAM issues for testing the shared software area (GGUS:78054)

Analysis of the availability plots: Week 16/01/2012 ATLAS – 2.1 CERN – UNSCHEDULED WARNING Draining nodes before reinstallation with EMI1 Update 11 SAM tests were running against ATLAS castor instance – tests against the SRM EOSATLAS endpoint started running on 19 th (GGUS: 78367) 2.2 All Sites CERN-PROD network issue, as reported in the ITSSB, generated some problems to many ATLAS services. Problems observed till 10:45. All SAM tests failed because of the network problem – 2.3 IN2P3 SRM VOGet test was failing.

Analysis of the availability plots: Week 23/01/2012 ALICE – 1.1 SARA CREAMCE Job submit and Direct Job submit tests were failing with the following error: ‘BLAH error: submission command failed /Permission denied-qsub: illegal -W value-’. CMS – 3.1 TAIWAN SRM VO Put and VO Get tests were failing with the following error: ERROR: zero number of replicas. LHCb – 4.1 RAL CREAMCE Job submit test was timing out with the following message: 'BrokerHelper: no compatible resources'.

Analysis of the availability plots: Week of 30/01/2012 – 05/02/2012 ATLAS 1.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. 1.2 RAL-LCG2 30/01. SRM-VOPUT tests failing Site had registered downtime for update of ATLAS SRM, but between GOC:103720GOC: ALICE 2.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. CMS 3.1. T1_TW_ASGC 04/02 & 05/02 (Green boxes). A problem with the CMS software installation procedure resulted in incorrect software tags being published. Not a site fault. GGUS:126091GGUS: LHCb 4.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. 4.2 LCG.GridKa.de 01/02-05/02. (Green boxes). SAM tests apparently unavailable for SRM-VOLsDir, VOLs and VODEL tests, all other tests for SRM endpoint passing without problem; VO confirms no problems observed with site. 4.3 LCG.RAL.UK 31/01. SAM test jobs timing out with reports of ‘no compatible resources’ as a result of a batch-sytem misconfiguration GGUS:78760GGUS:78760

Analysis of the availability plots: Week of 06/02/2012 – 12/02/2012 ATLAS 1.1 IN2P3-CC. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 1.2 IN2P3-CC. 08/02. Unscheduled downtime due to CREAMCE malfunction after batch upgrade. 1.3 RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 1.4 RAL-LCG2. 11/02. SRMV2 endpoint test failures (e.g. ERROR: [SE][GetSpaceTokens][] httpg://srm- atlas.gridpp.rl.ac.uk:8443/srm/managerv2: CGSI-gSOAP running on samnag013.cern.ch reports Error reading token data header: Connection closed.) No downtime registered.srm- atlas.gridpp.rl.ac.uk:8443/srm/managerv2samnag013.cern.ch 1.5 SARA-MATRIX 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB ALICE NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 2.1 CCIN2P3. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 2.2 RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 2.3 SARA 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB CMS NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 3.1 T1_FR_CCIN2P3. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 3.2 T1_UK_RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL LHCb NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 4.1 LCG.IN2P3.fr. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 4.2 LCG.IN2P3.fr. 08/02. Unscheduled downtime due to CREAMCE malfunction after batch upgrade. 4.3 LCG.NIKHEF.nl 07/02 (Green box). Scheduled downtime (GOCDB). SRM endpoint down at SARA.GOCDB 4.4 LCG.RAL.uk. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 4.5 SARA-MATRIX 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB

SIR by Area (Q4 2011)

Time to Resolution

“Serious” SIRs in Q4 2011

Conclusions The service is (chartreuse, pistachio, olive…) SIRs and alarms: details regarding any problem should preferably be entered into / attached to the corresponding GGUS ticket New rule: if there is an alarm ticket (justified) and the resolution / follow-up are not in the ticket they should be documented in a SIR Quite probable that further investigation is required Usability of SUM: few or no exceptions – there are currently too many “patches” on the reports for them to be useful Change management: at least one “iceberg” ahead (EMI FTS deployment at Tier0 and Tier1s prior to 2012 data taking) 21