Presentation is loading. Please wait.

Presentation is loading. Please wait.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS411646211 CMS114621 LHCb748358 Totals6321616295 1.

Similar presentations


Presentation on theme: "GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS411646211 CMS114621 LHCb748358 Totals6321616295 1."— Presentation transcript:

1 GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS411646211 CMS114621 LHCb748358 Totals6321616295 1

2 1/31/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There were 12 real ALARM tickets since the 2011/10/11 MB (2 weeks), 5 submitted by ATLAS, 5 by CMS, 2 by LHCb. 5 ALARM tickets concerned CERN, 2 were for RAL, 1 for CNAF, 1 for FNAL (via OSG), 1 for GridKa and 2 for IN2P3. 17 test ALARM tickets were submitted by the GGUS developers on Release day 2011/10/19, as a part of the regular procedure. All parts of the process chain were tested and were successful. Following this release, 16 additional GGUS Support Units (SUs), mostly 3 rd level middleware support, interfaced with the T0 local ticketing system Service Now (SNOW). This completed the GGUS- SNOW transition. On 2011/10/22 ‘top priority’ tickets started getting switched automatically down to ‘less urgent’. A SOAP web service bug was found that caused this. Case tracked in GGUS:75575.GGUS:75575 On 2011/11/04 GGUS-SNOW interface broke for ~2.5 days. Case tracked in GGUS:76052.GGUS:76052 Details follow…

3 CMS ALARM->CERN errors in read from Castor GGUS:75218 GGUS:75218 1/31/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/10/11 13:01GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: File Access.cms- operator-alarm@cern.ch 2011/10/11 13:02Service expert assigns the ticket to himself. 2011/10/11 13:22Submitter and expert contribute log entries in the ticket to demonstrate the error which only appears in batch. Interactive reads work fine. 2011/10/11 13:26The operator records in the ticket that “EOS support was called” 2011/10/11 14:15Several exchanges conclude there is an authentication problem in batch mode. 2011/10/11 17:16Indeed there was a problem with the Kerberos credentials’ handling. When fixed the ticket was set to ‘solved’ and ‘verified’.

4 ATLAS ALARM->CERN No T0 jobs due to Oracle COOL DB problems GGUS:75234 GGUS:75234 1/31/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/10/12 01:30GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Databases.atlas- operator-alarm@cern.ch 2011/10/12 01:48Operator records in the ticket that the Oracle piquet for the Atlas RAC was contacted. Unfortunately, they called the data mgnt piquet first who indicated the right service. 2011/10/12 07:535 comments contributed by the submitter and the shifter at P1 throughout the night recording a lock of the ATLAS_COOL_READER_TZ account and other observations on load from sms.cern.ch. 2011/10/12 08:14Grid services’ expert assigns the ticket to the Physics DB group in SNOW. 2011/10/12 09:54Expert sets the ticket to status ‘solved’ explaining the reason “high load and fail-over to a system unable to cope”.

5 LHCb ALARM->GridKa dCache not responding GGUS:75261 GGUS:75261 1/31/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/10/12 16:12GGUS ALARM ticket, automatic email notification to de-kit- alarm@scc.kit.edu AND automatic assignment to NGI_DE. ToP: Storage Systems.de-kit- alarm@scc.kit.edu 2011/10/12 16:19NGI_DE supporter sets the ticket in progress and records the technician on-call was contacted. 2011/10/12 19:26FZK WLCG contact records in the ticket that dCache experts are investigating. 2011/10/12 20:01Service expert sets the ticket to status ‘solved’ recording there was a problem with the dCache pnfs service which was restarted. 2011/10/12 20:46Submitter sets the ticket to status ‘verified’.

6 CMS ALARM->OSG FNAL facilities unreachable GGUS:75267 GGUS:75267 1/31/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/10/13 06:17GGUS ALARM ticket, automatic email notification to various paging addresses defined by the site AND automatic assignment to OSG(Prod). Automatic OSG-GOC ticket creation successful. ToP: Network problem. 2011/10/13 08:00Site contact records in the ticket a known network problem at FNAL and people working on it. 2011/10/13 12:06Site contact sets the ticket in status ‘solved’ recording that the CMS network switch was rebooted to clear the numerous errors recorded in the buffer log.

7 CMS ALARM-> CERN vocms15 unreachable GGUS:75510 GGUS:75510 1/31/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/10/20 06:52GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Local Batch System.cms- operator-alarm@cern.ch 2011/10/20 07:03Operator records in the ticket the problem with vocms15 is known and the sys admin is working on it. 2011/10/20 08:47Grid services’ expert sets the ticket in status ‘solved’ recording that the sys admin fixed the problem (how?).

8 ATLAS ALARM-> RAL Transfers fail – SRM down GGUS:75597 GGUS:75597 1/31/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/10/22 04:12 SATURDAY GGUS TEAM ticket, automatic email notification to lcg- support@gridpp.rl.ac.uk AND automatic assignment to NGI_UK. ToP: File Transfer.lcg- support@gridpp.rl.ac.uk 2011/10/22 05:49Ticket upgrade from TEAM to ALARM. Email notification sent to lcg-alarm@gridpp.rl.ac.uklcg-alarm@gridpp.rl.ac.uk 2011/10/22 05:50Automatic system reply for the alarm registration. 2011/10/22 05:53New shifter provides additional information about SRM being down at RAL. 2011/10/22 until 13:42 The transfers restarted a few minutes after the ALARM was raised but 6 comments exchanged to explain that an Oracle node on the Castor DB rack rebooted for reason unknown. 2011/10/23 05:05 SUNDAY Another Castor Oracle DB node crashed and NOT rebooted. GOCDB publishing was not possible. ATLAS blacklists RAL temporarily. 2011/10/24 11:57All RAL Castor instances restored.Ticket in status ‘solved’ with the debug info known so far.

9 ATLAS ALARM-> CNAF LFC down GGUS:75601 GGUS:75601 1/31/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/10/22 12:02 SATURDAY GGUS TEAM ticket, automatic email notification to t1- alarms@cnaf.infn.it AND automatic assignment to NGI_IT. ToP: NONE. This is not an agreed value. See Savannah:124239t1- alarms@cnaf.infn.it Savannah:124239 2011/10/22 12:28Site supporter takes the ticket and starts investigating. 2011/10/22 13:20Site mgr records in the ticket that one frontend was found stuck with kernel panic and another, apparently, out of memory. The service was restarted. The ticket was set to status ‘solved’. 2011/10/22 19:13Submitter sets the ticket to status ‘verified’.

10 ATLAS ALARM -> IN2P3 exports from T0 fail GGUS:75609 GGUS:75609 1/31/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/10/23 07:16 SUNDAY GGUS TEAM ticket, automatic email notification to grid.admin@cc.in2p3.fr AND automatic assignment to NGI_FRANCE. ToP: File Transfer. grid.admin@cc.in2p3.fr 2011/10/23 08:09Ticket upgrade to ALARM. GGUS automatic email notification sent to lhc-alarm@cc.in2p3.fr. Automatic acknowledgment by the ‘LHC Alert’ CC-IN2P3 response team recorded in the ticket.lhc-alarm@cc.in2p3.fr 2011/10/23 09:16Site mgr records in the ticket that SRM problems were discovered and being investigated. 2011/10/23 14:16Following several comments exchanged with the site, submitter sets the site ‘offline’ for data exports in agreement with the supporter at the site, as the problem takes long to fix. 2011/10/24 09:51Ticket set to ‘solved’ as SRM was back and observed 15+ hrs. No explanation for the failure recorded in the ticket.

11 LHCb ALARM-> IN2P3 SRM not responding GGUS:75610 GGUS:75610 1/31/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/10/23 11:31 SUNDAY GGUS ALARM ticket, automatic email notification to lhc- alarm@cc.in2p3.fr AND automatic assignment to NGI_DE. ToP: Storage Systems.lhc- alarm@cc.in2p3.fr 2011/10/23 11:35Automatic email acknowledgement of the ALARM by the site. 2011/10/23 11:37Submitter pastes error msgs in the ticket (from voboxes and lxplus) for debugging. 2011/10/23 11:50Site contact records the problem is known and being investigated. Meanwhile the SRM will be declared in downtime. 2011/10/23 16:43IN2P3 SRM operational again. After 1.5 hrs of checks, the ticket is set to ‘solved’.

12 CMS ALARM-> CERN Castor CMS T1 transfer pool GGUS:75743 GGUS:75743 1/31/2016WLCG MB Report WLCG Service Report 12 What time UTCWhat happened 2011/10/26 10:50GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Transfer.cms- operator-alarm@cern.ch 2011/10/26 11:04Operator records in the ticket the Castor piquet was contacted. 2011/10/26 11:07Expert sets the ticket ‘in progress’. Requests examples as problem evidence. 2011/10/26 11:33Expert sets the ticket to status ‘solved’ after making sure that a simple overload and FTS timeout were the reasons for the failing transfers. 2011/10/26 13:24Submitter confirms and sets the ticket to status ‘verified’.

13 ATLAS ALARM-> RAL Transfers fail – SRM down GGUS:75823 GGUS:75823 1/31/2016WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2011/10/29 20:52 SATURDAY GGUS TEAM ticket, automatic email notification to lcg- support@gridpp.rl.ac.uk AND automatic assignment to NGI_UK. ToP: File Transfer.lcg- support@gridpp.rl.ac.uk 2011/10/29 21:01Ticket upgrade from TEAM to ALARM. Email notification sent to lcg-alarm@gridpp.rl.ac.uklcg-alarm@gridpp.rl.ac.uk 2011/10/29 21:02Automatic system reply for the alarm registration. 2011/10/29 22:001 st estimate gives high load as problem cause. Site reduced FTS file limits. This did NOT help. 2011/10/30 05:31 SUNDAY The situation did not improve so the site limited also the number of allowed Atlas jobs. 2011/10/31 08:37RAL ran in reduced capacity all Sunday and set Atlas instances in downtime in GOCDB. Starting Monday the VO set the UK cloud in brokeroff to continue debugging with low load. 2011/11/02 16:06FTS channels’ capacity was gradually set back to normal. Ticket ‘solved’ and soon afterwards ‘verified’. SIR promised.

14 CMS ALARM-> CERN CEs failing SAM tests GGUS:75833 GGUS:75833 1/31/2016WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2011/10/30 14:14 SUNDAY GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Storage Systems.cms- operator-alarm@cern.ch 2011/10/30 14:18Storage Services’ expert records in the ticket beginning of the investigation. 2011/10/30 14:28Operator records in the ticket email was sent to it-dep- pes-ps-sms. This is not the right experts’ list! 2011/10/30 until 15:00hrs A couple of comments with the operators to indicate the right piquet that should have called and with the submitters for getting more info on the failing tests. 2011/10/30 15:06Expert sets the file into status ‘solved’. The file needed by the SAM tests to be found in a given directory for the tests to run successfully was accidentally removed by CMS. 2011/10/30 18:41Following a few comments of apologies the submitter set the ticket in status ‘verified’.


Download ppt "GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS411646211 CMS114621 LHCb748358 Totals6321616295 1."

Similar presentations


Ads by Google