WLCG Service Report ~~~ WLCG Management Board, 7 th June 2011 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Overview of day-to-day operations Suzanne Poulat.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 7 th June

Introduction This report covers the four weeks since the last MB report 10 May 2011 Three Service Incident Reports received during this period: 8 days with Slow transfer from/to ASGC 1h LFC Outage After Database Update at RAL 36h Power cut at ASGC 12h Batch System Instabilities at PIC 2

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE80210 ATLAS CMS LHCb Totals

2/20/2016WLCG MB Report WLCG Service Report 4 Support-related events since last MB A big number of test ALARMs were sent by the GGUS developers following the major Release of 2011/05/25. Many things changed simultaneously at KIT around the GGUS network, Remedy server and mail infrastructure which led to many unexpected problems in the workflow. There were 7 real ALARM tickets since the 2011/05/10 MB (4 weeks), 6 submitted by ATLAS, 1 by CMS, all ‘solved’ and ‘verified’. Notified sites were: CERN RAL PIC SARA Details follow…

ATLAS ALARM->RAL LFC down GGUS:70435GGUS: /20/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/05/10 12:11GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_UK/Ireland. 2011/05/10 12:12Automatic RAL operator acknowledgement registered in ticket diary. 2011/05/10 12:23Service manager puts ticket in status ‘solved’. Reason was an ACL problem following an Oracle patch installation while site was declared ‘at risk’. 2011/05/15 14:35 SUNDAY Submitter puts the ticket to status ‘verified’.

ATLAS ALARM->CERN lost AFS token on LSF batch nodes GGUS:70450 GGUS: /20/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/05/10 15:40GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- 2011/05/10 15:57Operator acknowledges and records in the GGUS ticket that ‘a responsible’ was contacted. Recording who specifically is called is part of our procedures. 2011/05/10 16:00Service mgr confirms in the ticket the investigation started. 2011/05/10 16:38Debug exchanges between service mgrs and submitter who says the problem is gone. 2011/05/10 16:48Service mgr puts ticket to status ‘solved’. No diagnosis recorded. 2011/05/11 10:58Submitter sets to ‘verified’.

ATLAS ALARM-> PIC transfer errors GGUS:70470 GGUS: /20/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/05/11 11:30GGUS TEAM ticket, automatic notification to AND automatic assignment to NGI_IBERGrid. 2011/05/11 12:17Site sets the ticket into status ‘solved’ with description ‘restarted dCache srm’. 2011/05/11 16:57Ticket ‘solved’ and ‘re-opened’ by the site 3 times. Problem persists despite the additional SRM memory increase measure. Ticket upgraded to an ALARM! sent to tier1- No automatic acknowledgement of ALARM registration.tier /05/11 20:45Multiple exchanges between site mgr and shifter. PIC temporarily banned atlddm29.cern.ch. The experiments consider wiser to exclude PIC from the DDM for the day. 2011/05/12 13:15Ticket set to ‘solved’ after fixing config SRM timeout parameters on atlddm29.cern.ch.

CMS ALARM -> CERN can’t open files via xrootd, jobs fail. GGUS:70434GGUS: /20/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/05/10 12:05GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/05/10 12:10Submitter provides more evidence and the error message. 2011/05/10 12:15Castor service expert starts investigation. 2011/05/10 12:27Operator acknowledges and records in the GGUS ticket that the Castor piquet was called. Submitter/expert exchanges already on-going. 2011/05/10 20:53After many exchanges between 3 experts and the submitter, a workaround suggested around 17hrs, sets the ticket to status ‘solved’ while ‘the bug is being escalated’. No reference in the ‘Related issue’ field! 2011/05/17 07:53The submitter sets it to ‘verified’. Will there be a follow- up? Where?

ATLAS ALARM -> CERN acrontab service down GGUS:70735GGUS: /20/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/05/19 18:40GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- 2011/05/19 18:46Service mgr records in the ticket that the problem has been observed for some minutes but seems to have gone away. 2011/05/19 18:51Operator records that was sent to afs admins but it is known that the service mgr is already working on this. 2011/05/19 20:19Afs service responsible reports that afs server where acron is hosted crashed. Expert is trying to bring it back up, else move the acron service to another host. 2011/05/20 01:35Afs service expert partially restores service. 2011/05/20 12:53AFS and acron services fully back in production.Ticket is set to ‘solved’ and shortly afterwards to ‘verified’.

ATLAS ALARM -> CERN Castor data retrieval fails GGUS:70774GGUS: /20/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/05/22 11:04 SUNDAY GGUS TEAM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- 2011/05/22 11:22Operator records in the ticket that the Castor piquet was called. 2011/05/22 11:45Service mgr finds the relevant diskserver doesn’t respond even if monitoring shows it OK. Because he can’t login on the host, he reboots and sets the ticket to ‘solved’. 2011/05/22 12:33Submitter sets the ticket to ‘verified’.

ATLAS ALARM -> SARA LFC down GGUS:71028 GGUS: /20/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/05/29 09:49 SUNDAY GGUS TEAM ticket, automatic notification to AND automatic assignment to NGI_NL. 2011/05/30 06:53Site mgr reports a network problems prevents the server to reach its database. 2011/05/30 08:53Site sets the ticket to ‘solved’. 2011/06/04 16:18Submitter sets the ticket to status ‘verified’.

SIR - Link down between Chicago and Amsterdam affecting ASGC Analysis Yellow/AC-2 system was broken seriously. Services of the system were all down. For there was no sufficient capacity to provide backup service for such amount of users, thus, ASGC has to wait until the submarine cable system was fixed, or partially fixed. Timeline May 1, 13:33 UTC: Amsterdam router down. May 1, 17:34 UTC: Carrier reported that the event was caused by submarine cable cut. […] May 3, 08:15 UTC: Vendor committed to recover the link on May 10.May 8, 12:45 UTC: ASGC 10G link was lighted up. All LHCOPN traffic is routed back to 10G link. May 8, 12:48 UTC: Carrier confirmed that the link is up again. Future plan Another dedicated 2.5Gb link from Taiwan to CERN would be online around August 15, 2011, serving as an online backup link (12.5Gb bandwidth in total when there is nothing wrong) for the ASGC LHCOPN. 12

SIR - LFC Outage After DB Update at RAL for 1h Description Following a planned update of the Oracle databases behind the LFC and FTS services the client applications were unable to connect to the database. This was traced to a problem with Oracle ACLs. The particular ACLs were removed and the services resumed after an outage of about 1 hour. Timeline 10th May 12:00Start of scheduled Warning/At Risk in GOC DB 10th May 12:05Start of updating Oracle database. 10th May 12:30First Nagios failure (indicating problem with FTS). 10th May 13:00End of updating Oracle database. 10th May 13:12Notification received of arrival of GGUS Alarm ticket from Atlas. 10th May 13:12Oracle ACLs list removed. 10th May 13:23Nagios tests clearing. 10th May 13:23GGUS Alarm ticket responded to and marked as solved. 10th May 14:00End of scheduled Warning/At Risk in GOC DB Follow Up Root Cause understood. The ACLs using DNS names have been removed. Although the databases are protected by unique passwords the ACLS should be restored in this case but using IP addresses not DNS names. There are no ACLs applied on other databases. Update procedures such that relevant changes made by the Database Team are added to the Tier1 ELOGger so that they become visible to the wider team. 13

SIR - Power Outage at ASGC Description On 21 May, ASGC Data Center experienced power failure due to the circuit breaker trip of the power substation in Academia Sinica.. Power cut happened at 21:16:00, the DC power generator was up and running at 21:16:26. DC power was recovered at 21:50:00. Besides networking, database and critical systems, ASGC services were all affected. Timeline 21-May 21:16 UTC - Power lost in data center. Many institutes of Academia Sinica were affected by this event including ASGC Tier1 center. 21-May 21:17 UTC - Power generator operated, services power on, started to check and recover services. 21-May 21:50 UTC – Power was restored. 22-May 02:59 UTC – All WLCG services were recovered except LFC and FTS because of the problem in SAN switch. 22-May 08:40 UTC – Fixed SAN switch problem, started to recover LFC and FTS. 22-May 09:50 UTC – LFC was back, but FTS service was still not working because of one corrupted data file. 23-May 01:51 UTC - FTS DB problem was solved, and FTS service was reviewed and checked. 23-May 07:32 UTC - FTS was back. Transfer started. 23-May 09:00 UTC - Downtime ended, all ASGC services back online. Follow Up Both the SAN switch and FTS DB should be protected by UPS, but the UPS also failed at the power outage. The problem UPS was in repair and all critical systems are now relocated to some other UPS. 14

SIR – Batch System instabilities at PIC Description The incident was caused by collateral effects of the security challenge SSC5 ( from security officers received on Wed. 25/May around 15h UTC). As a result of malicious jobs tracking and compromised DNs found, the stipulated protocol recommend to unplug the network of the affected nodes. There were about 55 Worker Nodes (WN) affected, all of them embedded in blade centers. The computing problem was caused when trying to set the network interface down of the affected WNs and caused Computing service instabilities until Thu 26/May 06:55h) Timeline 2011/05/25 15:00 Mail reception from NGI security officers alerting about the SSC5 (formal start of the exercise) 2011/05/25 15:05 Start of the investigation of the involved DNs 2011/05/25 15:15 DN ban process started on all grid services (LFC, FTS, SRM, CE) 2011/05/25 16:40 Distributed command issued to isolate affected nodes (55 WNs), start noticing not expected disruptions in not affected WNs. 2011/05/25 18:00 Engineers start working on the computing services to recover the nodes. Batch system (Torque) was affected by the instabilities and was in a non-responsive state. 2011/05/25 22:00 Situation seemed to reach stable state. Majority of WNs recovered (except the ones affected by SSC5, protocol recommendation is to reinstall) 2011/05/26 23:15 The batch system got unstable as a result of the many WNs that were down. Non-responsive state caused failure in SAM probes during the next 7 hours. 2011/05/26 06:55 The jobs that caused the batch system instabilities were cleaned manually and computing services were recovered. Follow Up New procedure to take down nodes implemented and tested 15

SIR – Data Loss at KIT Impact Loss of 4356 files of ALICE VO. List of file names was reported to ALICE. All data has been recovered by copying files from other sites. Timeline :58 broken HDD reported by monitoring software :02 second broken HDD reported by monitoring software :04 rebuild is not started because of faulty ADP unit :25 operator contacts vendor and receives instructions to recover […] :00 Notified ALICE contacts […] :00 Replacement of ADP unit, recovery of content started :00 Storage gradually placed online :31 After several mails between vendor and manufacturer it is finally concluded that a third disk failed. Analysis The cause of the problem was several broken disks in the same disk array. For still unknown reasons 2 disks failed within minutes. The controller started using a third parity disk but it was flagged faulty with read errors. Surface/media errors on disks may go undetected until the content is actually read during rebuild of a degraded RAID array. The problem was amplified by the fact that circuitry in one of the disk enclosures, an APD unit, was also reporting errors. The ADP unit had to be replaced before further analysis and recovery could be performed. 16

4.1 41

Analysis of the availability plots: Week of 09/05/2011 LHCb 4.1 LCG2.GRIDKA.de, LCG2.SARA.nl. Space token seems not be accessible - problems due to some internal cleaning campaign of "dark storage" (should be OK after cleaning).

Analysis of the availability plots: Week of 16/05/2011 CMS 3.1 T1_TW_ASGC. Unscheduled downtime for Taiwan T1 and T2 due to unexpected power surge. LHCb 4.1 LCG.IN2P3.fr Most jobs failed with “jobs stalled; pilot aborted” (GGUS:70788). 4.2 LCG.SARA.nl Problems caused by changes in the space token definitions which were not done in sync between Site and LHCb.

Analysis of the availability plots: Week of 23/05/2011 ATLAS 1.1 IN2P3. Scheduled downtime affecting SRM, CEs, FTS and LFC. 1.2 SARA MATRIX. LFC down due to a spanning tree problem in part of SARA's network (GGUS:71028). ALICE 2.1 All sites. Availability of T0 and T1's red in dashboard, after changes in SAM test (one test removed, another one added). It looks like the new test was not executed, still investigating why. In the meantime tests were rolled back to the previous state. CMS 3.1 IN2P3. Scheduled downtime affecting SRM, CEs, FTS and LFC. 3.2 ASCG. Job submissions with the production role failing with the Maradona error. underlying cause is still under investigation. LHCb 4.1 LCG.IN2P3.fr Most jobs failed with “jobs stalled; pilot aborted” (GGUS:70788). 4.2 LCG.IN2P3.fr Transparent intervention turned out to be non-transparent due to a problem with a draining script. 4.3 LCG.IN2P3.fr LFC RO Mirror is down, as result most MC jobs from French sites failed to upload data files. 4.4 LCG.RAL.uk Small glitch in availability but seems that for some short period the running Nagios jobs were failing. Under investigation. 4.5 LCG.PIC.es Scheduled downtime – drain and decommission as lcg-CE. 4.6 LCG.SARA.nl SARA LFC server not reachable (GGUS:71042).

Analysis of the availability plots: Week of 30/05/2011 ATLAS 1.1 IN2P3. High job failure related to the setup time due to AFS slowness (GGUS:71032) 1.2 TAIWAN-LCG2. Faulty tape – impossible to recover files (GGUS:70763). 1.3 TAIWAN-LCG2. Unknown – under investigation. ALICE 2.1 All sites. Availability of T0 and T1's red in dashboard, after changes in SAM test (one test removed, another one added). It looks like the new test was not executed, still investigating why. In the meantime tests were rolled back to the previous state. 2.2 NIKHEF. VOBox SAM tests failing at the site (GGUS:71155).The DN of the host changed and had to be registered again. CMS 3.1 IN2P3. CMSSW release not properly installed (GGUS:71244). LHCb 4.1 LCG.IN2P3.fr LFC RO Mirror is down, as result most MC jobs from French sites failed to upload data files. 4.2 LCG.PIC.es Small glitch for short period of time – still under investigation.

Additional Topics Discussion on continued use of S2 SRM integrations tests took place between storage element providers and will be reported in next T1 service coordination meeting General agreement about way forward and procedure for maintenance and test evolution High load on CERN KDC observed repeatedly – few individual ATLAS user have been alerted Analysis of root cause still ongoing Long chain of people involved (analysis user - experiment framework – POOL - ROOT – XROOT plugin – batch system – KDC service) make problem analysis and resolution time consuming May need to streamline communication chain for client s/w induced problems – the daily operations meeting may not yet reach all involved parties efficiently May need a procedure for bug fixing (by now) “older” software releases which enter production at increasing scale only now 25