Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 14 th February 2012 1.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 14 th February 2012 1."— Presentation transcript:

1 WLCG Service Report Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 14 th February 2012 1

2 Introduction The service is running rather smoothly, the “metrics” are working relatively well At least one significant change in the pipeline: EMI FTS deployment in production at Tier0 and Tier1s (well) prior to 2012 pp data taking At last T1SCM the relevant m/w had not been released (due Feb 16) nor was roadmap clear to all (being prepared) SIRs: one requested covering Oracle 11g upgrades; others due for the 2 alarm tickets of 2012 2

3 WLCG Operations Report – Structure 3 KPIStatusComment GGUS ticketsNo alarms; normal # team and user tickets No issues to report Site UsabilityFully greenNo issues to report SIRs & Change assessmentsNoneNo issues to report KPIStatusComment GGUS ticketsFew alarms; normal # team and user tickets and/or Drill-down Site UsabilitySome issues and/orDrill-down SIRs & Change assessmentsSomeDrill-down KPIStatusComment GGUS ticketsAlarms, many other ticketsDrill-down Site UsabilityPoorDrill-down SIRs & Change assessmentsSeveralDrill-down

4 GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE402 (1)6 ATLAS291891219 CMS1552 (1)22 LHCb542148 Totals532366 (2)295 4

5 3/16/2016WLCG MB Report WLCG Service Report 5 Support-related events since last MB There were 2 real ALARM tickets since the 2012/01/10 MB (5 weeks), 1 submitted by ALICE and 1 by CMS. Both ALARM tickets concerned CERN Databases. Both of them are in status ‘verified’. The GGUS monthly release took place on 2012/01/25. 18 test ALARMs were issued and analysed in Savannah:125144Savannah:125144 Details follow…

6 ALICE ALARM->voms-proxy-init hangs GGUS:78739 GGUS:78739 3/16/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2012/01/29 23:23 SUNDAY GGUS ALARM ticket, automatic email notification to alice- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Databases.alice- operator-alarm@cern.ch 2012/01/29 23:34Service expert comments that the incident is related to LCGR db hanging. Investigation in progress. 2012/01/29 23:34Operator records in the ticket that db piquet was called. 2012/01/30 00:02Submitter confirms after the db hunging was by-passed VOMS and SAM services became available again. 2012/01/30 00:08Today we have experienced some problems with the archiver processes on LCGR database, instance number 1. We do not know yet if the problem is related to some disk failures or an Oracle bug, this is still under investigation. The database hung completely around 00:40. I had to kill instance number 1 manually in order to get the database back. I have also disabled the archive logs backups as this seems to be the cause for the archiver processes hangs. … 2012/01/30 08:46solved (SAM/Nagios) Host certifcate regenerated. System works fine. This is what is recorded in the ticket. However, it is neither a complete nor accurate summary due to some confusion between multiple incidents and human error in updating (closing) the wrong ticket. IMHO a SIR would be useful in clarifying this.

7 CMS ALARM->no connect to CMSR db from remote PhEDEx agents GGUS:78843 GGUS:78843 7 What time UTCWhat happened 2012/02/01 17:31GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Databases.cms- operator-alarm@cern.ch 2012/02/01 17:54Operator records in the ticket that the CMS piquet was called. 2012/02/01 18:07DB expert records in the ticket that the problem should be gone now. Waiting for submitter’s confirmation. 2012/02/01 18:55Submitter agrees and puts the ticket in status ‘solved’. He records that this is a temporary solution and a detailed explanation and a permanent solution is pending. However, as he ‘verified’ the ticket the next day, no further details were ever recorded about the reasons of this. More info in IT C5 report (see slide notes) Firewall misconfiguration immediately after Oracle 11g upgrade

8 2.1 3.1 0.10.1 4.1 4.2 0.10.1 0.10.1 2.2

9 Analysis of the availability plots: Week 09/01/2012 Common to experiments – 0.1 RAL – UNSCHEDULED OUTAGE Site in downtime during network/DNS problems ATLAS – 2.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade CE-JobSubmit test failing with: Globus error 25: the job manager detected an invalid script status SRM-VOPut test was failing - now fixed and understood, no more tests against ATLAS castor instance – 2.2 TAIWAN CREAMCE-JobSubmit test failing with: BLAH error: submission command failed and hit job shallow retry count (1) CMS – 3.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade SRM-VOPut test was failing LHCb – 4.1 CERN - SCHEDULED WARNING CASTORPUBLIC intervention & CASTOR Name Server upgrade Problem with access to castor after downtime of 11 th, problem fixed by castor team (GGUS:78103) SAM jobs failing when accessing CEs (GGUS:78185) – 4.2 IN2P3 SAM issues for testing the shared software area (GGUS:78054)

10 2.1 2.2 2.3

11 Analysis of the availability plots: Week 16/01/2012 ATLAS – 2.1 CERN – UNSCHEDULED WARNING Draining nodes before reinstallation with EMI1 Update 11 SAM tests were running against ATLAS castor instance – tests against the SRM EOSATLAS endpoint started running on 19 th (GGUS: 78367) 2.2 All Sites CERN-PROD network issue, as reported in the ITSSB, generated some problems to many ATLAS services. Problems observed till 10:45. All SAM tests failed because of the network problem – 2.3 IN2P3 SRM VOGet test was failing.

12 1.1 4.1 3.1

13 Analysis of the availability plots: Week 23/01/2012 ALICE – 1.1 SARA CREAMCE Job submit and Direct Job submit tests were failing with the following error: ‘BLAH error: submission command failed /Permission denied-qsub: illegal -W value-’. CMS – 3.1 TAIWAN SRM VO Put and VO Get tests were failing with the following error: ERROR: zero number of replicas. LHCb – 4.1 RAL CREAMCE Job submit test was timing out with the following message: 'BrokerHelper: no compatible resources'.

14 1.1 2.1 1.1 2.1 4.1 1.2 4.3 4.2 3.1

15 Analysis of the availability plots: Week of 30/01/2012 – 05/02/2012 ATLAS 1.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. 1.2 RAL-LCG2 30/01. SRM-VOPUT tests failing 14.00-20.00. Site had registered downtime for update of ATLAS SRM, but between 10.00-12.00 GOC:103720GOC:103720 ALICE 2.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. CMS 3.1. T1_TW_ASGC 04/02 & 05/02 (Green boxes). A problem with the CMS software installation procedure resulted in incorrect software tags being published. Not a site fault. GGUS:126091GGUS:126091 LHCb 4.1 All sites 31/01. (Green boxes). LCGR (SAM) database down for Oracle upgrade; daily average availability not available. 4.2 LCG.GridKa.de 01/02-05/02. (Green boxes). SAM tests apparently unavailable for SRM-VOLsDir, VOLs and VODEL tests, all other tests for SRM endpoint passing without problem; VO confirms no problems observed with site. 4.3 LCG.RAL.UK 31/01. SAM test jobs timing out with reports of ‘no compatible resources’ as a result of a batch-sytem misconfiguration GGUS:78760GGUS:78760

16 1. 1 1.2 1.4 1.5 2.1 3. 1 4.1 2.2 3. 2 4.4 2.3 4.5 4.2 4.3 1.3

17 Analysis of the availability plots: Week of 06/02/2012 – 12/02/2012 ATLAS 1.1 IN2P3-CC. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 1.2 IN2P3-CC. 08/02. Unscheduled downtime due to CREAMCE malfunction after batch upgrade. 1.3 RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 1.4 RAL-LCG2. 11/02. SRMV2 endpoint test failures (e.g. ERROR: [SE][GetSpaceTokens][] httpg://srm- atlas.gridpp.rl.ac.uk:8443/srm/managerv2: CGSI-gSOAP running on samnag013.cern.ch reports Error reading token data header: Connection closed.) No downtime registered.srm- atlas.gridpp.rl.ac.uk:8443/srm/managerv2samnag013.cern.ch 1.5 SARA-MATRIX 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB ALICE NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 2.1 CCIN2P3. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 2.2 RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 2.3 SARA 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB CMS NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 3.1 T1_FR_CCIN2P3. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 3.2 T1_UK_RAL. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL LHCb NB: Test results for 06/02 not shown on this plot, but verified as ‘Green’ for all sites. 4.1 LCG.IN2P3.fr. 07/02 (Green box). Scheduled downtime (GOCDB).GOCDB 4.2 LCG.IN2P3.fr. 08/02. Unscheduled downtime due to CREAMCE malfunction after batch upgrade. 4.3 LCG.NIKHEF.nl 07/02 (Green box). Scheduled downtime (GOCDB). SRM endpoint down at SARA.GOCDB 4.4 LCG.RAL.uk. 08/02 (Green box). Scheduled downtime (GOCDB). Outage for intervention on core network within the RAL Tier1. Affects all services.GOCDBRAL 4.5 SARA-MATRIX 07/02 (Green box). Scheduled downtime (GOCDB). All services down.GOCDB

18 SIR by Area (Q4 2011)

19 Time to Resolution

20 “Serious” SIRs in Q4 2011

21 Conclusions The service is (chartreuse, pistachio, olive…) SIRs and alarms: details regarding any problem should preferably be entered into / attached to the corresponding GGUS ticket New rule: if there is an alarm ticket (justified) and the resolution / follow-up are not in the ticket they should be documented in a SIR Quite probable that further investigation is required Usability of SUM: few or no exceptions – there are currently too many “patches” on the reports for them to be useful Change management: at least one “iceberg” ahead (EMI FTS deployment at Tier0 and Tier1s prior to 2012 data taking) 21


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 14 th February 2012 1."

Similar presentations


Ads by Google