Download presentation
Presentation is loading. Please wait.
Published byMay Wiggins Modified over 9 years ago
1
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please: 1.Take the summary from the table on pages: https://gus.fzk.de/download/wlcg_metrics/html/20110718_escalationreport_wlcg.html https://gus.fzk.de/download/wlcg_metrics/html/20110725_escalationreport_wlcg.html 2. Copy file: https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls Locally and add the 2 lines for 18-Jul and 25-Jul. Re-upload.xls on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3. Add up the last 7 weeks, starting 13-Jun (included) and put them in this table. 4. Copy/paste the graph from the.xls file of point 2 above.
2
10/13/2015WLCG MB Report WLCG Service Report 2 Support-related events since last MB NB!!!!!!!!!!!!!!! CHECK IF THERE ARE MORE ALARMS BETWEEN 13-24 July & adjust totals below!!!!!!!!!!! There were 11 real ALARM tickets since the 2011/06/07 MB (7 weeks), 9 submitted by ATLAS, 2 by CMS, all ‘solved’, some even ‘verified’, 10 of them for CERN and 1 for CNAF. The 1 st 5 ALARM tickets for CERN did not generate the required email notification to the CERN operators and experts on call! This was due to a switch of the sender’s email address from helpdesk@ggus.org to apache@ggus.org that happened with the 2011/05/25 GGUS Release due to the new exim mailer at KIT. helpdesk@ggus.orgapache@ggus.org This was solved in the week of 2011/06/27 by including this new email address in the CERN [VO]-operator-alarm@cern.ch e-groups’ admins. All test ALARMs following the 2011/07/06 release were successful. Details follow…
3
ATLAS ALARM->CERN SRM connections fail GGUS:71471 GGUS:71471 10/13/2015WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/06/12 16:54 SUNDAY & WHIT MONDAY GGUS TEAM ticket, automatic email notification to grid-cern- prod-admins@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.grid-cern- prod-admins@cern.ch 2011/06/12 16:58Submitter adds reference to dashboard link with details. 2011/06/12 17:28Submitter escalates ticket to ALARM. Email notification recorded as ‘Sent to atlas-operator-alarm@cern.ch’ but no email received by the e-group members (operators & service mgrs)!!!atlas-operator-alarm@cern.ch 2011/06/12 19:50Service mgr starts investigation (GGUS-SNOW mapping takes care of bypassing the helpdesk outside working hours and notifying the service mgrs’ list). 2011/06/12 20:27 – 2011/06/13 17:21 (8 comments exchanged) Service mgr records a load-related lack of available frontend threads. Submitter and other Atlas members acknowledge FTS config. may need to be restored to pre-CASTOR/EOS migration values. Related GGUS:71328GGUS:71328 2011/06/14 07:30Submitter records ‘FTS settings reviewed’ & sets to ‘solved’ and ‘verified’.
4
ATLAS ALARM->CERN SRM many errors GGUS:71715 GGUS:71715 10/13/2015WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/06/20 15:25GGUS TEAM ticket, automatic email notification to grid-cern- prod-admins@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.grid-cern- prod-admins@cern.ch 2011/06/20 16:16Service mgr starts investigation. 2011/06/20 16:50Problem appears to be due to FTS killing transfers. 2011/06/20 17:12A supporter escalates ticket to ALARM. Email notification recorded as ‘Sent to atlas-operator-alarm@cern.ch’ but no email received by the e-group members (operators & service mgrs)!!!atlas-operator-alarm@cern.ch 2011/06/20 17:19 - 2011/06/21 10:19 (23 comments exchanged) Service mgrs, submitter and other shifters & supporters supply dashboard data and more debug info. Internal SNOW escalation records misleading entries in the GGUS diary. This is followed-up via a SNOW development Request.SNOW development Request. 2011/06/21 11:07Service mgr records the FTS DB clean-up working at a time of high transfer load caused the timeouts. Configured a lower rate of clean-up and set the ticket to ‘solved’. 2011/06/21 13:07Supporter sets the ticket to status ‘verified’.
5
ATLAS ALARM->CERN Castor timeouts GGUS:71904 GGUS:71904 10/13/2015WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/06/24 11:34GGUS ALARM ticket, automatic email notification sent to atlas-operator-alarm@cern.ch But No email received by the e-group members (operators & service mgrs)!!! atlas-operator-alarm@cern.ch Automatic GGUS assignment to ROC_CERN successful. Automatic SNOW ticket creation successful. 2011/06/24 11:58Service mgr starts investigation (due to the assignment of the relevant SNOW ticket to Castor). 2011/06/24 12:28A stuck job was found to have locked the whole Atlas stager DB. This job was removed and the service was restored. 2011/06/24 15:51Service mgr sets the ticket to ‘solved’. 2011/06/24 16:24Submitter sets the ticket to ‘verified’.
6
CMS ALARM->CERN job stageout errors GGUS:71934 GGUS:71934 10/13/2015WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/06/26 14:31 SUNDAY GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.cms- operator-alarm@cern.ch 2011/06/26 14:43Submitter adds a long list of email addresses in Cc. 2011/06/26 14:49Submitter emails computer.operations@cern.ch. Email notification recorded as ‘Sent to cms-operator- alarm@cern.ch but no email received by the e-group members (operators & service mgrs)!!!cms-operator- alarm@cern.ch 2011/06/26 15:20Service mgrs start investigation. 2011/06/26 15:45 – 2011/06/26 21:07 (8 comments exchanged) Service mgr records 2 out of 3 Castor headnodes are in trouble (readonly FS, stuck rsyslog daemon, files appearing to have zero size). Moved tape functionality to another machine. Service restarted very slowly. 2011/06/26 21:33Submitter records ‘unclear if the stuck rsyslog was the reason’ & sets to ‘solved’ and ‘verified’.
7
CMS ALARM-> CERN CASTOR POOL FULL GGUS:71969 GGUS:71969 10/13/2015WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/06/27 13:48GGUS ALARM ticket opened, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/06/27 14:10Service mgr records in the ticket that investigation has started. The email notification to operators was not yet fixed at that point. 2011/06/27 17:35Submitter records in the ticket info received by phone. Castor can’t perform with intense pool use and high rate of file deletions. 2011/06/28 07:47Service mgr puts ticket to status ‘solved’. Work on- going for garbage collection optimisation.
8
ATLAS ALARM-> CERN EXPORT FAILS WITH FTS ERRORS GGUS:71985GGUS:71985 10/13/2015WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/06/27 21:04GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators was not yet fixed at that point. 2011/06/27 21:40Service mgr records in the ticket that operators should be asked to call the FTS expert on call. 2011/06/28 01:22Night shifter escalates ticket in GGUS. This sends a reminder email to the relevant Support Unit (ROC_CERN). 2011/06/28 03:21Expert acknowledges ticket reception. 2011/06/28 04:15Expert records agents were stopped to clean-up. 2011/06/28 08:05Submitter reports jobs are stuck in FTS for hours. 2011/06/28 09:19 – 13:00 (3 comments) Expert checks again and puts the ticket into status ‘solved’ with diagnostic: /var/tmp was filling up too quickly. Reason was the clean-up job failing since the FTS uid had become, recently, global by mistake.
9
ATLAS ALARM-> CERN T0MERGE WRITING ERRORS GGUS:72132GGUS:72132 10/13/2015WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/07/01 06:57GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators was working again! 2011/07/01 07:11Operator records in the ticket that the Castor piquet was called. 2011/07/01 07:35First diagnostic recorded in the ticket. Filesystem full, no space left on device. Castor expert on piquet implements work-around, by which each re-try will try a different filesystem. 2011/07/01 21:13Service mgr and puts the ticket into status ‘solved’. 2011/07/02 05:02 SATURDAY Submitter puts the ticket into status ‘verified’.
10
ATLAS ALARM-> CERN NO SPACE LEFT ON DEVICE IN POOLS GGUS:72218GGUS:72218 10/13/2015WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/07/04 13:42GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Email notification to operators worked as expected! 2011/07/04 13:48Castor expert records that investigation started on atlt3 filesystems appearing DISABLED! 2011/07/04 13:49Operator records in the ticket that the Castor piquet was called. 2011/07/04 14:51Service mgr puts the ticket into status ‘solved’. Explanations include Oracle errors (then under investigation) and configuration problems (repaired). 2011/07/04 15:00Submitter puts the ticket into status ‘verified’.
11
ATLAS ALARM-> CERN CASTOR POOLS’ WRITING HANGS GGUS:72262GGUS:72262 10/13/2015WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/07/05 13:10GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/07/05 13:22Operator records in the ticket that the Castor piquet was called. 2011/07/05 14:06Service expert puts the ticket into status ‘solved’. There was a hardware issue with the load-balanced head-nodes. A xroot-castor plugin was also upgraded with the occasion. 2011/07/05 14:11Submitter puts the ticket into status ‘verified’.
12
ATLAS ALARM-> CNAF MONITORING SHOWS ZERO SPACE ON DATATAPE GGUS:72473GGUS:72473 10/13/2015WLCG MB Report WLCG Service Report 12 What time UTCWhat happened 2011/07/08 21:55GGUS TEAM ticket opened, automatic email notification to t1-admin@lists.cnaf.infn.it AND automatic assignment to NGI_IT. 2011/07/09 06:56 SATURDAY Site mgr records in the ticket that a problem with info provider should by now be fixed. 2011/07/09 14:13Shifter records errors from DDM dashboard. 2011/07/09 14:35Shifter upgrades TEAM ticket into an ALARM. Email sent to t1-alarms@cnaf.infn.it. 2011/07/09 18:02Site admin (?) records they are checking. 2011/07/11 06:07Automatic (?) warning about non-authorised ALARM raising (?) 2011/07/11 14:02Site admin. Puts the ticket in status ‘solved’ with explanation ‘storm misconfiguration fixed’.
13
ATLAS ALARM-> CERN CASTOR NO ACCESS TO FILE GGUS:72528GGUS:72528 10/13/2015WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2011/07/11 16:01GGUS TEAM ticket opened, automatic email notification to grid-cern-prod-admins@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. 2011/07/11 16:46Shifter upgrades TEAM ticket into an ALARM. Email sent to atlas-operator-alarm@cern.ch 2011/07/11 17:05Expert records in the ticket that the file is on an unavailable server and the incident doesn’t qualify for an ALARM. 2011/07/11 17:22Operator records in the ticket that the Castor piquet is called. 2011/07/11 17:51Service mgr puts the ticket in status ‘solved’ with explanation ‘the file server is a faulty box, discussed at the WLCG daily meeting already, which is given to the vendor for repair’.
14
VONAME ALARM->SITE SERVICE GGUS:XXXXX GGUS:XXXXX 10/13/2015WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2011/xx/yy xx:yyGGUS ALARM ticket opened, automatic email notification to Mailing_list_name AND automatic assignment to ROC_or_NGI_name 2011/xx/yy xx:yyComment on acknowledgment. May be several raws on operator-to-service mgr notification. Investigation. 2011/xx/yy xx:yyPb traced down to [put the Diagnosis here]. Service mgr puts ticket ‘solved’. 2011/xx/yy xx:yySubmitter puts ticket to status ‘verified’.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.