Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 7 th June 2011 1.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 7 th June 2011 1."— Presentation transcript:

1 WLCG Service Report Dirk.Duellmann@cern.ch ~~~ WLCG Management Board, 7 th June 2011 1

2 Introduction This report covers the four weeks since the last MB report 10 May 2011 Three Service Incident Reports received during this period: 8 days with Slow transfer from/to ASGC 1h LFC Outage After Database Update at RAL 36h Power cut at ASGC 12h Batch System Instabilities at PIC 2

3 GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE80210 ATLAS1114814173 CMS184325 LHCb530237 Totals4218221245 3

4 2/20/2016WLCG MB Report WLCG Service Report 4 Support-related events since last MB A big number of test ALARMs were sent by the GGUS developers following the major Release of 2011/05/25. Many things changed simultaneously at KIT around the GGUS network, Remedy server and mail infrastructure which led to many unexpected problems in the workflow. There were 7 real ALARM tickets since the 2011/05/10 MB (4 weeks), 6 submitted by ATLAS, 1 by CMS, all ‘solved’ and ‘verified’. Notified sites were: CERN RAL PIC SARA Details follow…

5 ATLAS ALARM->RAL LFC down GGUS:70435GGUS:70435 2/20/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/05/10 12:11GGUS ALARM ticket, automatic email notification to lcg-alarm@gridpp.rl.ac.uk AND automatic assignment to ROC_UK/Ireland. 2011/05/10 12:12Automatic RAL operator acknowledgement registered in ticket diary. 2011/05/10 12:23Service manager puts ticket in status ‘solved’. Reason was an ACL problem following an Oracle patch installation while site was declared ‘at risk’. 2011/05/15 14:35 SUNDAY Submitter puts the ticket to status ‘verified’.

6 ATLAS ALARM->CERN lost AFS token on LSF batch nodes GGUS:70450 GGUS:70450 2/20/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/05/10 15:40GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- operator-alarm@cern.ch 2011/05/10 15:57Operator acknowledges and records in the GGUS ticket that ‘a responsible’ was contacted. Recording who specifically is called is part of our procedures. 2011/05/10 16:00Service mgr confirms in the ticket the investigation started. 2011/05/10 16:38Debug exchanges between service mgrs and submitter who says the problem is gone. 2011/05/10 16:48Service mgr puts ticket to status ‘solved’. No diagnosis recorded. 2011/05/11 10:58Submitter sets to ‘verified’.

7 ATLAS ALARM-> PIC transfer errors GGUS:70470 GGUS:70470 2/20/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/05/11 11:30GGUS TEAM ticket, automatic email notification to lcg.support@pic.es AND automatic assignment to NGI_IBERGrid. 2011/05/11 12:17Site sets the ticket into status ‘solved’ with description ‘restarted dCache srm’. 2011/05/11 16:57Ticket ‘solved’ and ‘re-opened’ by the site 3 times. Problem persists despite the additional SRM memory increase measure. Ticket upgraded to an ALARM! Email sent to tier1- alarms@pic.es. No automatic acknowledgement of ALARM registration.tier1- alarms@pic.es 2011/05/11 20:45Multiple exchanges between site mgr and shifter. PIC temporarily banned atlddm29.cern.ch. The experiments consider wiser to exclude PIC from the DDM for the day. 2011/05/12 13:15Ticket set to ‘solved’ after fixing config SRM timeout parameters on atlddm29.cern.ch.

8 CMS ALARM -> CERN can’t open files via xrootd, jobs fail. GGUS:70434GGUS:70434 2/20/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/05/10 12:05GGUS ALARM ticket, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/05/10 12:10Submitter provides more evidence and the error message. 2011/05/10 12:15Castor service expert starts investigation. 2011/05/10 12:27Operator acknowledges and records in the GGUS ticket that the Castor piquet was called. Submitter/expert exchanges already on-going. 2011/05/10 20:53After many exchanges between 3 experts and the submitter, a workaround suggested around 17hrs, sets the ticket to status ‘solved’ while ‘the bug is being escalated’. No reference in the ‘Related issue’ field! 2011/05/17 07:53The submitter sets it to ‘verified’. Will there be a follow- up? Where?

9 ATLAS ALARM -> CERN acrontab service down GGUS:70735GGUS:70735 2/20/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/05/19 18:40GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- operator-alarm@cern.ch 2011/05/19 18:46Service mgr records in the ticket that the problem has been observed for some minutes but seems to have gone away. 2011/05/19 18:51Operator records that email was sent to afs admins but it is known that the service mgr is already working on this. 2011/05/19 20:19Afs service responsible reports that afs server where acron is hosted crashed. Expert is trying to bring it back up, else move the acron service to another host. 2011/05/20 01:35Afs service expert partially restores service. 2011/05/20 12:53AFS and acron services fully back in production.Ticket is set to ‘solved’ and shortly afterwards to ‘verified’.

10 ATLAS ALARM -> CERN Castor data retrieval fails GGUS:70774GGUS:70774 2/20/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/05/22 11:04 SUNDAY GGUS TEAM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- operator-alarm@cern.ch 2011/05/22 11:22Operator records in the ticket that the Castor piquet was called. 2011/05/22 11:45Service mgr finds the relevant diskserver doesn’t respond even if monitoring shows it OK. Because he can’t login on the host, he reboots and sets the ticket to ‘solved’. 2011/05/22 12:33Submitter sets the ticket to ‘verified’.

11 ATLAS ALARM -> SARA LFC down GGUS:71028 GGUS:71028 2/20/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/05/29 09:49 SUNDAY GGUS TEAM ticket, automatic email notification to grid.support@sara.nl AND automatic assignment to NGI_NL. 2011/05/30 06:53Site mgr reports a network problems prevents the server to reach its database. 2011/05/30 08:53Site sets the ticket to ‘solved’. 2011/06/04 16:18Submitter sets the ticket to status ‘verified’.

12 SIR - Link down between Chicago and Amsterdam affecting ASGC Analysis Yellow/AC-2 system was broken seriously. Services of the system were all down. For there was no sufficient capacity to provide backup service for such amount of users, thus, ASGC has to wait until the submarine cable system was fixed, or partially fixed. Timeline May 1, 13:33 UTC: Amsterdam router down. May 1, 17:34 UTC: Carrier reported that the event was caused by submarine cable cut. […] May 3, 08:15 UTC: Vendor committed to recover the link on May 10.May 8, 12:45 UTC: ASGC 10G link was lighted up. All LHCOPN traffic is routed back to 10G link. May 8, 12:48 UTC: Carrier confirmed that the link is up again. Future plan Another dedicated 2.5Gb link from Taiwan to CERN would be online around August 15, 2011, serving as an online backup link (12.5Gb bandwidth in total when there is nothing wrong) for the ASGC LHCOPN. 12

13 SIR - LFC Outage After DB Update at RAL for 1h Description Following a planned update of the Oracle databases behind the LFC and FTS services the client applications were unable to connect to the database. This was traced to a problem with Oracle ACLs. The particular ACLs were removed and the services resumed after an outage of about 1 hour. Timeline 10th May 12:00Start of scheduled Warning/At Risk in GOC DB 10th May 12:05Start of updating Oracle database. 10th May 12:30First Nagios failure (indicating problem with FTS). 10th May 13:00End of updating Oracle database. 10th May 13:12Notification received of arrival of GGUS Alarm ticket from Atlas. 10th May 13:12Oracle ACLs list removed. 10th May 13:23Nagios tests clearing. 10th May 13:23GGUS Alarm ticket responded to and marked as solved. 10th May 14:00End of scheduled Warning/At Risk in GOC DB Follow Up Root Cause understood. The ACLs using DNS names have been removed. Although the databases are protected by unique passwords the ACLS should be restored in this case but using IP addresses not DNS names. There are no ACLs applied on other databases. Update procedures such that relevant changes made by the Database Team are added to the Tier1 ELOGger so that they become visible to the wider team. 13

14 SIR - Power Outage at ASGC Description On 21 May, ASGC Data Center experienced power failure due to the circuit breaker trip of the power substation in Academia Sinica.. Power cut happened at 21:16:00, the DC power generator was up and running at 21:16:26. DC power was recovered at 21:50:00. Besides networking, database and critical systems, ASGC services were all affected. Timeline 21-May 21:16 UTC - Power lost in data center. Many institutes of Academia Sinica were affected by this event including ASGC Tier1 center. 21-May 21:17 UTC - Power generator operated, services power on, started to check and recover services. 21-May 21:50 UTC – Power was restored. 22-May 02:59 UTC – All WLCG services were recovered except LFC and FTS because of the problem in SAN switch. 22-May 08:40 UTC – Fixed SAN switch problem, started to recover LFC and FTS. 22-May 09:50 UTC – LFC was back, but FTS service was still not working because of one corrupted data file. 23-May 01:51 UTC - FTS DB problem was solved, and FTS service was reviewed and checked. 23-May 07:32 UTC - FTS was back. Transfer started. 23-May 09:00 UTC - Downtime ended, all ASGC services back online. Follow Up Both the SAN switch and FTS DB should be protected by UPS, but the UPS also failed at the power outage. The problem UPS was in repair and all critical systems are now relocated to some other UPS. 14

15 SIR – Batch System instabilities at PIC Description The incident was caused by collateral effects of the security challenge SSC5 (email from security officers received on Wed. 25/May around 15h UTC). As a result of malicious jobs tracking and compromised DNs found, the stipulated protocol recommend to unplug the network of the affected nodes. There were about 55 Worker Nodes (WN) affected, all of them embedded in blade centers. The computing problem was caused when trying to set the network interface down of the affected WNs and caused Computing service instabilities until Thu 26/May 06:55h) Timeline 2011/05/25 15:00 Mail reception from NGI security officers alerting about the SSC5 (formal start of the exercise) 2011/05/25 15:05 Start of the investigation of the involved DNs 2011/05/25 15:15 DN ban process started on all grid services (LFC, FTS, SRM, CE) 2011/05/25 16:40 Distributed command issued to isolate affected nodes (55 WNs), start noticing not expected disruptions in not affected WNs. 2011/05/25 18:00 Engineers start working on the computing services to recover the nodes. Batch system (Torque) was affected by the instabilities and was in a non-responsive state. 2011/05/25 22:00 Situation seemed to reach stable state. Majority of WNs recovered (except the ones affected by SSC5, protocol recommendation is to reinstall) 2011/05/26 23:15 The batch system got unstable as a result of the many WNs that were down. Non-responsive state caused failure in SAM probes during the next 7 hours. 2011/05/26 06:55 The jobs that caused the batch system instabilities were cleaned manually and computing services were recovered. Follow Up New procedure to take down nodes implemented and tested 15

16 SIR – Data Loss at KIT Impact Loss of 4356 files of ALICE VO. List of file names was reported to ALICE. All data has been recovered by copying files from other sites. Timeline 06-04-2011 05:58 broken HDD reported by monitoring software 06-04-2011 06:02 second broken HDD reported by monitoring software 06-04-2011 06:04 rebuild is not started because of faulty ADP unit 06-04-2011 09:25 operator contacts vendor and receives instructions to recover […] 07-04-2011 09:00 Notified ALICE contacts […] 19-04-2011 14:00 Replacement of ADP unit, recovery of content started 20-04-2011 12:00 Storage gradually placed online 31-05-2011 16:31 After several mails between vendor and manufacturer it is finally concluded that a third disk failed. Analysis The cause of the problem was several broken disks in the same disk array. For still unknown reasons 2 disks failed within minutes. The controller started using a third parity disk but it was flagged faulty with read errors. Surface/media errors on disks may go undetected until the content is actually read during rebuild of a degraded RAID array. The problem was amplified by the fact that circuitry in one of the disk enclosures, an APD unit, was also reporting errors. The ADP unit had to be replaced before further analysis and recovery could be performed. 16

17 4.1 41

18 Analysis of the availability plots: Week of 09/05/2011 LHCb 4.1 LCG2.GRIDKA.de, LCG2.SARA.nl. Space token seems not be accessible - problems due to some internal cleaning campaign of "dark storage" (should be OK after cleaning).

19 4.1 4.2 3.1

20 Analysis of the availability plots: Week of 16/05/2011 CMS 3.1 T1_TW_ASGC. Unscheduled downtime for Taiwan T1 and T2 due to unexpected power surge. LHCb 4.1 LCG.IN2P3.fr Most jobs failed with “jobs stalled; pilot aborted” (GGUS:70788). 4.2 LCG.SARA.nl Problems caused by changes in the space token definitions which were not done in sync between Site and LHCb.

21 4.3 4.24.1 1.2 4.6 2.1 1.1 4.5 4.2 4.5 4.4 3.1 3.2

22 Analysis of the availability plots: Week of 23/05/2011 ATLAS 1.1 IN2P3. Scheduled downtime affecting SRM, CEs, FTS and LFC. 1.2 SARA MATRIX. LFC down due to a spanning tree problem in part of SARA's network (GGUS:71028). ALICE 2.1 All sites. Availability of T0 and T1's red in dashboard, after changes in SAM test (one test removed, another one added). It looks like the new test was not executed, still investigating why. In the meantime tests were rolled back to the previous state. CMS 3.1 IN2P3. Scheduled downtime affecting SRM, CEs, FTS and LFC. 3.2 ASCG. Job submissions with the production role failing with the Maradona error. underlying cause is still under investigation. LHCb 4.1 LCG.IN2P3.fr Most jobs failed with “jobs stalled; pilot aborted” (GGUS:70788). 4.2 LCG.IN2P3.fr Transparent intervention turned out to be non-transparent due to a problem with a draining script. 4.3 LCG.IN2P3.fr LFC RO Mirror is down, as result most MC jobs from French sites failed to upload data files. 4.4 LCG.RAL.uk Small glitch in availability but seems that for some short period the running Nagios jobs were failing. Under investigation. 4.5 LCG.PIC.es Scheduled downtime – drain and decommission as lcg-CE. 4.6 LCG.SARA.nl SARA LFC server not reachable (GGUS:71042).

23 4.1 1.2 2.1 1.1 4.1 4.2 3.1 2.2 1.3 3.1

24 Analysis of the availability plots: Week of 30/05/2011 ATLAS 1.1 IN2P3. High job failure related to the setup time due to AFS slowness (GGUS:71032) 1.2 TAIWAN-LCG2. Faulty tape – impossible to recover files (GGUS:70763). 1.3 TAIWAN-LCG2. Unknown – under investigation. ALICE 2.1 All sites. Availability of T0 and T1's red in dashboard, after changes in SAM test (one test removed, another one added). It looks like the new test was not executed, still investigating why. In the meantime tests were rolled back to the previous state. 2.2 NIKHEF. VOBox SAM tests failing at the site (GGUS:71155).The DN of the host changed and had to be registered again. CMS 3.1 IN2P3. CMSSW release not properly installed (GGUS:71244). LHCb 4.1 LCG.IN2P3.fr LFC RO Mirror is down, as result most MC jobs from French sites failed to upload data files. 4.2 LCG.PIC.es Small glitch for short period of time – still under investigation.

25 Additional Topics Discussion on continued use of S2 SRM integrations tests took place between storage element providers and will be reported in next T1 service coordination meeting General agreement about way forward and procedure for maintenance and test evolution High load on CERN KDC observed repeatedly – few individual ATLAS user have been alerted Analysis of root cause still ongoing Long chain of people involved (analysis user - experiment framework – POOL - ROOT – XROOT plugin – batch system – KDC service) make problem analysis and resolution time consuming May need to streamline communication chain for client s/w induced problems – the daily operations meeting may not yet reach all involved parties efficiently May need a procedure for bug fixing (by now) “older” software releases which enter production at increasing scale only now 25


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 7 th June 2011 1."

Similar presentations


Ads by Google