Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 VO User Team Alarm Total ALICE ATLAS CMS

Similar presentations

Presentation on theme: "1 VO User Team Alarm Total ALICE ATLAS CMS"— Presentation transcript:

1 1 VO User Team Alarm Total ALICE 4 2 6 ATLAS 18 164 5 187 CMS 8 3 29
GGUS summary (5 weeks) VO User Team Alarm Total ALICE 4 2 6 ATLAS 18 164 5 187 CMS 8 3 29 LHCb 13 39 52 Totals 53 213 274 1

2 Support-related events since last MB
There have been 8 real ALARMs since the 2012/10/16 MB. 3 were submitted by CMS and 5 by ATLAS. 1 concerned IN2P3, the rest the CERN site. A GGUS Release took place since the last MB (on 2012/10/28). All ALARM tests were successful (operators received notification, reacted within minutes, interfaces worked, experts closed promptly). Detailed analysis in 8/3/2018 WLCG MB Report WLCG Service Report

3 CMS ALARM->CERN job failure due to castor GGUS:88507
What time UTC What happened 2012/10/16 07:50 GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/10/16 08:11 Monitoring shows all is normal but files are NOT transferred from P5. Ticket upgraded to ALARM. sent to . 2012/10/16 08:23 The service manager started investigation immediately, while operators record in the ticket that “the CASTOR piquet was called”. 2012/10/16 13:17 Ticket set to ‘solved’ after 8 comments’ exchange, mainly names of files stuck at P5 to help debugging. The solution contained no explanation. Possible cause: Diskserver draining (Many small files) with unstable DB instance. ‘verified’ the next day. 8/3/2018 WLCG MB Report WLCG Service Report

4 ATLAS ALARM->in2p3 srm unreachable GGUS:87872
What time UTC What happened 2012/10/29 04:40 GGUS TEAM ticket opened, automatic notification to AND automatic assignment to NGI_FRANCE. Type of Problem: Storage Systems. 2012/10/29 07:49 After 2 comments by the shifter noting that all T0 exports fail, the ticket is upgraded into an ALARM. sent to 2012/10/29 13:27 Ticket set to ‘solved’ by the service manager after 4 comments’ exchange and a restart of the SRM server at 8:35 CEST. 2012/10/29 13:45 Ticket set to ‘verified’ by an ATLAS supporter. 8/3/2018 WLCG MB Report WLCG Service Report

5 ATLAS ALARM->CERN castor data inaccessible GGUS:88064
What time UTC What happened 2012/11/02 09:40 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Access. 2012/11/02 09:42 Service expert takes ownership of the ticket and comments that all examples of inaccessible files are on a diskserver having a controller problem. 2012/11/02 09:45 Operator records in the ticket that CASTOR piquet was called. 2012/11/05 10:46 Ticket set to ‘solved’, all files accessible again, after 7 comments exchanged. Vendor needed a number of days to run tests. Experiment required a solution within hours. Service expert explained what can be realistically expected given the tests’ length and the vendor’s working hours’ agreement. 8/3/2018 WLCG MB Report WLCG Service Report

What time UTC What happened 2012/11/07 11:40 GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/11/07 12:18 Service expert confirms LSF is slow at times but the reason is not yet understood. 2012/11/07 13:52 Submitter attaches multiple plots showing the great amount of pending jobs and converts the ticket into an ALARM. to sent. 2012/11/07 14:12 Operator confirms ALARM reception. 2012/11/14 15:26 Ticket set to status ‘solved’ after 20 comments’ exchange, from which no real cause of the problem was identified. While doing these drills service and experiment agreed with our suggestion to close this ALARM and open a TEAM ticket for long-term investigation. 8/3/2018 WLCG MB Report WLCG Service Report

What time UTC What happened 2012/11/14 15:38 GGUS TEAM ticket opened as agreed in the previous slide, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/11/14 20:19 Submitter sees a great service degradation, attaches proving plots & copies the operators in this comment, who suggests to contact the service directly. 2012/11/14 21:30 Submitter attaches multiple plots showing the great amount of pending jobs and converts the ticket into an ALARM. to sent. 2012/11/14 21:36 Operator confirms ALARM reception and forward to it-dep-pes-ps. 2012/11/15 07:32 Expert confirms the problem is back and investigation is on-going. 19 comments were exchanged throuhout Thu and Fri. No weekend activity. Reasons are >1, including the submission of 100k job at once by a single user. Still ‘in progress’ on Monday 2012/11/19. 8/3/2018 WLCG MB Report WLCG Service Report

8 CMS ALARM->CERN job failure due to castor GGUS:88507
What time UTC What happened 2012/11/15 00:26 GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Access. 2012/11/15 00:33 Ticket upgraded to ALARM. sent to . 13 mins later operator confirms that CASTOR piquet was called. 2012/11/15 00:51 The service manager (mgr) started investigation immediately (in 3 mins). 2012/11/15 01:23 Ticket set to ‘solved’ after 6 comments’ exchange. Service mgr found a lot o load on the headnodes, xrootd daemons were restarted. 8/3/2018 WLCG MB Report WLCG Service Report

9 ATLAS ALARM->CERN castor diskserver GGUS:88528
What time UTC What happened 2012/11/15 16:59 GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Access. 2012/11/15 17:08 Service expert checking the server is up but has a RAID controller problem. 2012/11/15 17:21 Operator confirms ALARM reception and forwards to CASTOR piquet. Meanwhile the expert set the host out of production 2012/11/16 05:28 Expert observes an interrupt in the file draining. Submitter confirms all necessary files are accessible for ATLAS users. 2012/11/16 21:56 The service manager confirms the file drain did complete correctly and set the ticket to status ‘solved’. 8/3/2018 WLCG MB Report WLCG Service Report

10 CMS ALARM->CERN jobs queueing GGUS:88530
What time UTC What happened 2012/11/15 20:40 GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch System. 2012/11/15 21:13 Ticket upgraded to ALARM. sent to . 13 mins later operator confirms ticket was sent to it-dep-pes-ps-sms. 2012/11/15 00:51 The service manager (mgr) started investigation immediately (in 3 mins). 2012/11/19 08:02 Ticket set to ‘solved’ after 13 comments’ exchange during the night of Nov. Solution was an LSF reconfiguration. 8/3/2018 WLCG MB Report WLCG Service Report

Download ppt "1 VO User Team Alarm Total ALICE ATLAS CMS"

Similar presentations

Ads by Google