Download presentation
Presentation is loading. Please wait.
1
1 VO User Team Alarm Total ALICE 4 1 5 ATLAS 12 175 191 CMS 13 2 16
GGUS summary (5 weeks) VO User Team Alarm Total ALICE 4 1 5 ATLAS 12 175 191 CMS 13 2 16 LHCb 8 33 42 Totals 37 210 7 254 1
2
Support-related events since last MB
There have been 6 real and 1 test ALARMs since the 2012/07/24 MB. All were submitted by ATLAS,CMS & LHCb. Site for all these tickets was CERN. There has been no GGUS Releases since the last MB due to summer holidays. The next one is planned for 2012/09/26. 6/15/2018 WLCG MB Report WLCG Service Report
3
LHCb ALARM->CERN->GridKA FTS PRoblems GGUS:84778
What time UTC What happened 2012/08/03 07:31 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Transfer. 2012/08/03 07:39 Operator records that it-dep-pes-ps-sms was ed. 2012/08/03 08:16 Grid service expert records in the ticket that the problem dates since 2012/07/27 and is handled via ticket GGUS:84550 assigned to NGI_DE. 2012/08/03 09:40 Ticket assigned to Castor supporters in case there is a problem dependency due to Many “PrepareToGet” timeouts seen on srm-lhcb. 2012/08/17 03:00 Ticket set to ‘solved’ following 9 comments’ exchange. Transfer speed turned out to be slow because, in this low service performance, files started to be moved to tape and had to be fetched from there. Increasing LHCbTAPE pool size helped. 6/15/2018 WLCG MB Report WLCG Service Report
4
ATLAS ALARM->CERN SLOW LSF GGUS:84928
What time UTC What happened 2012/08/06 20:57 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/06 21:06 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/06 21:15 The expert starts working on the problem but sees no problem as bsub response time is around 100ms. 2012/08/08 09:50 Ticket set to ‘solved’ and very soon afterwards to ‘verified’ after exchange of 8 comments where ATLAS is asked to apply some configuration changes while the service asked Platform to find the root cause (we don’t know what the outcome was!!). 6/15/2018 WLCG MB Report WLCG Service Report
5
CMS ALARM->CERN EOScmS DOWN GGUS:84966
What time UTC What happened 2012/08/08 07:38 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/08 07:49 Service expert comments in the ticket’s Internal diary a known problem from the day before on looping nscd (restarted). 2012/08/08 08:02 Operator records that the sysadmin piquet was called. 2012/08/08 08:04 Expert adds a number of comments with about 10 userIDs holding, each hundreds of sessions. 2012/08/08 09:18 Ticket set to ‘solved’ after EOSCMS MGM restart. SIR 6/15/2018 WLCG MB Report WLCG Service Report
6
ATLAS ALARM->CERN SLOW LSF GGUS:84998
What time UTC What happened 2012/08/08 17:25 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/08 17:32 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/08 17:57 The submitter attaches various service plots.The expert starts working on the problem but sees no problem as bsub response time is < 100ms. 2012/08/10 15:04 Ticket set to ‘solved’ after exchange of 19 comments between supporters and shifters due to discrepancies in performance results measured by the service and the experiment. It turned out that the unit was different (1/100s vs 1ms). 6/15/2018 WLCG MB Report WLCG Service Report
7
ATLAS ALARM->CERN SLOW LSF GGUS:85058
What time UTC What happened 2012/08/11 22:33 SATURDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/11 22:42 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/12 07:57 SUNDAY The submitter attaches various service plots. The next shifter records in the ticket things got worse in the night. 2012/08/12 12:05 Operator asks for patience while trying to reach the expert. 2012/08/12 15:23 Service expert found user jobs that blocked the system, killed them and set the ticket to ‘solved’. 6/15/2018 WLCG MB Report WLCG Service Report
8
CMS ALARM->CERN CAstor DOWN GGUS:85398
What time UTC What happened 2012/08/21 20:04 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/21 20:18 Operator records that the Castor piquet was called. 2012/08/21 20:31 Service expert restarted the Castor transfermanageron one of the CASTORSRM headnodes. 2012/08/21 20:56 Ticket set to ‘solved’ as the service went back to normal. 6/15/2018 WLCG MB Report WLCG Service Report
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.