Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1."— Presentation transcript:

1 WLCG Service Report Andrea.Valassi@cern.ch ~~~ WLCG Management Board, 18 th September 2012 1

2 Introduction 3 relatively quiet weeks since the last MB report on August 28 th Smooth LHC operations, including proton-ion test run. Now in technical stop. No Service Incident Reports received. One SIR expected: Accidental deletion on EOSCMS of 1.6M files (1PB) by an (unprivileged) CMS user. Several group-writeable areas deleted, only a minor fraction could be recovered. Permissions tightened, other preventive measures being reviewed. 3 real GGUS ALARMS, all at CERN 1 for CMS (SRM down), 2 for ATLAS (slow LSF; slow migration to tape) Many other issues reported at the daily meetings, most notably: Ongoing issues with Alcatel audioconf system. On average one remote user per day has been unable to connect to the meeting for the last two weeks. Under investigation (INC:158097), seems (at least partly) browser-related.INC:158097 Oracle security patches installations. Also Castor upgrades and NAS migration. Constant rate of aborted LHCb pilots for one week due to CERN batch issues. SRM overload for ATLAS at PIC, related to ATLAS deletion policy. Bug in CVMFS stratum ones at CERN, affecting mainly LHCb. Storage issues in Denmark due to power supply problems. 2

3 GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE5005 ATLAS141032119 CMS74112 LHCb531036 Totals311383172 3

4 Support-related events since last MB There have been 3 real ALARMs since the 2012/08/28 MB. They were submitted by ATLAS (2) and CMS (1). Site for all tickets was CERN. There has been no GGUS Releases since the last MB due to summer holidays. The next one is planned for 2012/09/26. 4

5 CMS ALARM->CERN SRM DOWN GGUS:85530 GGUS:85530 What time UTCWhat happened 2012/08/27 14:07GGUS ALARM ticket opened, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/27 14:12Operator records that the CASTOR piquet was called. 2012/08/27 14:16Expert records in the ticket that there is an overload of 400 pending transfers in c2cms/t1transfer queue. 2012/08/30 12:3810 comments exchanged between shifters, service experts and IT/ES CMS supporters. Excessive PhEDEx activity to Caltech was temporarily thought to be the problem cause but this was not the case. 2012/08/31 08:38Ticket set to ‘solved’ with conclusion that test transfers had caused the overload. Automatic ticket closing took place after 3 working days. 5

6 ATLAS ALARM->CERN SLOW LSF GGUS:85556 GGUS:85556 What time UTCWhat happened 2012/08/28 11:37GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/28 11:49The operator records in the ticket that the the it-dep- pes-pes-sms e-group was informed. 2012/08/28 11:51Service expert starts investigating. Heavy queries were found that slow down the job submission. 2012/08/30 06:56Ticket set to ‘solved’ & ‘verified’ after exchange of 7 comments between supporters and shifters. The monitoring plots in a period of 1.5 days of supervision showed occasional spikes from bursts of job submission without any real problem. 6

7 ATLAS ALARM->CERN TAPE MIGRATION PROBLEM GGUS:85704GGUS:85704 What time UTCWhat happened 2012/08/31 23:12 SATURDAY here already. GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/31 23:26The operator records in the ticket that the Castor piquet was informed. 2012/09/04 09:32The service mgr and multiple shifters exchanged 40 comments throughout the night, the whole of Saturday and Monday. The problem was that inaccessible files were on a broken server that required vendor intervention, followed by a long time required to mount the disk. 2012/09/04 09:36Service expert sets the ticket to ‘solved’ after reconfiguration of the tape system. Ten hours later, the shifter of the day set it to status ‘verified’. 7

8 3.1 2.1 2.2 2.1 3.1 8

9 Analysis of the reliability plots: Week 27/08/2012 ATLAS: 2.1 BNL: Some error transfers from T0 to BNL. A few percent of the transfers timed out 2.2 SARA-MATRIX: Some inconsistency between dCache and BDII was noticed, [SE][StatusOfPutRequest][SRM_NO_FREE_SPACE] CMS: 3.1 IN2P3: The site was busy and the mc tests didn’t run due to their lower priority compare to the production activity 9

10 3.1 1.1 3.1 3.2 10

11 Analysis of the reliability plots: Week of 03/09/2012 ALICE: 1.1 RAL 04/09 [Green]: Not a site problem – bug detected in SAM test. See GGUS #85794.GGUS #85794 CMS: 3.1 IN2P3 05/09-08/09: CREAMCE-JobSubmit tests intermittently failing against cccreamceli05 & 06 with timeouts. No downtime registered; no relevant Savannah tickets found. 3.2 CNAF 09/09: Site problem with STORM storage element. See GGUS #85953 and Savannah #131937.GGUS #85953Savannah #131937 11

12 0.1 3.3 3.13.2 12

13 Analysis of the reliability plots: Week 10/09/2012 Common: 0.1 SARA : CREAMCE - JobSubmit test was failing due to timeouts and errors while loading glite libraries CMS: 3.1 ASGC : CREAMCE - failures of Software-Installed test 3.2 ASGC : SRM - failures of the VOPut test 3.3 CNAF : SRM - intermittent failures of VOPut test due to moving of data from disk which causes delays, see GGUS & SAV ticketsGGUSSAV 13

14 Conclusions Business as usual – relatively quiet period Ongoing Alcatel issue preventing users from connecting 14


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1."

Similar presentations


Ads by Google