GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
AMOD Weekly report (Ale, Alexei, Jarka) Doug Benjamin (AMOD shadow)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove this slide and let me know if drills are missing and should be prepared for a future MB. Thank You! MariaDZ 1 12/23/2015WLCG MB Report WLCG Service Report

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 2 To calculate the totals for this slide and copy/paste the usual graph for the 2012/07/24 MB please: 1.Take the summary from the table on html and html html html 2. Copy locally file Include 2 more lines from the escalation reports above. Add up the last 5 weeks i.e. starting from the 25-Jun line and put the totlas in this table. 4. Copy/paste here, instead of these instructions, the updated graph from the point 2.xls file.

12/23/2015WLCG MB Report WLCG Service Report 3 Support-related events since last MB There have been 12+ real ALARMs since the 2012/06/19 MB. All were submitted by ATLAS,CMS & LHCb. Sites for all these tickets were CERN, IN2P3, FZK, PIC, SARA. There have been 2 GGUS Releases since the last MB: On 2012/06/25: specifically on new Reporting Tools. On 2012/07/09: all other dev.items.

ATLAS ALARM->CERN CASTOR PROBLEM GGUS:83360GGUS: /23/2015WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2012/06/18 15:42GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/06/18 15:42Expert records work started. 2012/06/18 15:48Operator records that expert is working already. 2012/06/18 15:50Expert records there was a configuration error. ITSBB is updated and fixing started. 2012/06/18 16:27Ticket set to ‘solved’ after configuration change and propagation. 4 more comments were exchanged because the problem persisted for some nodes that appeared to be under maintenance in CASTOR monitor and had not received the new config. Problem really solved at 18:05 hrs.

ATLAS ALARM->CERN LSF SCHEDULING GGUS:83362GGUS: /23/2015WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2012/06/18 15:50GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/06/18 16:02Operator’s acknowledgment and to …pes-sms… 2012/06/18 16:19Service mgr starts work. 2012/06/18 16:38The ticket is ‘solved’ because the LSF problem was a side-effect of the CASTOR problem of the previous slide.

ATLAS ALARM-> FZK FTS TRANSFER ERRORS GGUS:83367GGUS: /23/2015WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2012/06/18 23:16GGUS TEAM ticket opened, automatic notification to AND automatic assignment to NGI_DE. Type of Problem: File Transfer. 2012/06/18 01:18Increased to “Top Priority” followed by ticket conversion to ALARM 10 mins later as transfer failure rate increases. 2012/06/19 05:46A CMS comment! They have the same problem! 2012/06/19 09:02The ticket is ‘solved’ after finding a disk issue that needed a log partition cleanup on an FTS host. Both experiments agree the problem is gone.

ATLAS ALARM->CERN LSF SLOW RESPONSE GGUS:83375GGUS: /23/2015WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2012/06/19 07:43GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/06/19 07:51Operator’s acknowledgment and to …pes-sms… 2012/06/19 07:55Service mgr starts work. 2012/06/20 15:42The ticket is ‘solved’ because the problem went away. Although Platform was supposed to get back with a diagnostic, after the ticket was set to ‘verified’ no further update is possible, hence, we never knew what the cause of the problem was.

ATLAS ALARM->IN2P3 SW SRC PROBLEM VIA CVMFS GGUS:83517GGUS: /23/2015WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2012/06/24 08:30 SUNDAY GGUS TEAM ticket opened, automatic notification to AND automatic assignment to NGI_FRANCE. Type of Problem: Middleware. 2012/06/25 07:37Ticket upgrade to ALARM after 2 comments with all WNs where 100% of the jobs failed. sent to Automatic acknowledgment recorded immediately afterwards. 2012/06/25 08:21Sys.admins investigate (cvmfs cache problem). 2012/06/25 11:16The ticket is ‘solved’ after changing the logrotate policy to reduce the logs but as the ticket was set to ‘verified’ no further update is possible, hence, we never knew why the high increase of connections led to this fast grow of logfiles.

ATLAS ALARM-> SARA SRM CONTACT PROBLEM GGUS:83523GGUS: /23/2015WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2012/06/24 19:57 SUNDAY GGUS TEAM ticket opened, automatic notification to AND automatic assignment to NGI_NL. Type of Problem: Storage Systems. 2012/06/24 20:21Ticket upgrade to ALARM as the SRM layer appeared broken. sent to Automatic acknowledgment recorded immediately afterwards. 2012/06/25 05:54Service mgr restarted srm. 2012/06/27 14:47The ticket is ‘solved’ after exchanging16 comments to understand the cause, which seemed to be the recent dcache upgrade to v Moving the srm to new hardware didn’t help but re-indexing the DB did.

ATLAS ALARM->CERN VOATLAS SERVERS DOWN GGUS:83705GGUS: /23/2015WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2012/06/29 06:33GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other. 2012/06/29 06:34Grid services’ expert informs the submitter that there is a power cut in the CC, published on the itssb. 2012/06/29 06:40Operator also records there all many problems due to the power cut. 2012/06/29 12:01The ticket is set to ‘verified’ after the services got back at 08:26 and the solution was recorded at 11:55.

LHCB ALARM->CERN MISSING DATA ON DISK GGUS:83713GGUS: /23/2015WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2012/06/29 11:37GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/06/29 11:38Storage expert informs the submitter that after the power cut in the CC earlier on that day, not all servers have yet recovered. 2012/06/29 11:45Operator also records that CASTOR piquet was called. 2012/06/29 16:20Ticket set to ’solved’ at 16:20 when all servers came back to production. SLS was showing all was fine even if this was partially true. The reason was that the monitoring process checks a necessary and sufficient subset of nodes’ availability only. 2012/07/04 07:45The ticket was ‘re-opened’ and eventually re-’solved’ & ‘verified’ following experiment complaints when files were found missing. The reason was that a machine was still unreachable. It came back after vendor call.

CMS ALARM->CERN VOCMS203 WEB SERVICE PROBLEM GGUS:83726GGUS: /23/2015WLCG MB Report WLCG Service Report 12 What time UTCWhat happened 2012/06/30 07:41 SATURDAY GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other. 2012/06/30 10:37Escalated and soon afterwards upgraded to ALARM. 2012/06/30 10:52Operator records that the problem is known and the piquet has already sent mail suggesting copying the data because the disk is scheduled for replacement. 2012/07/02 10:04Various CMS ALARMers submitted 6 comments in the ticket trying to get any news on progress of this. 2012/07/03 09:09Ticket set to ‘solved’ after fixing the hardware problem.

ATLAS ALARM->PIC TRANSFERS FROM CERN FAIL GGUS:83923GGUS: /23/2015WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2012/07/06 09:31GGUS TEAM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Transfer. 2012/07/06 09:36Site mgrs record in the ticket a know network problem in the dCache pools. LHCb opened a similar ticket on the matter. 2012/07/06 11:25Transfer failure rate keeps increasing. Ticket upgraded to ALARM. sent to tier /07/06 16:26The ticket is set to ‘solved’ after reducing the timeout and increasing the queue size. Supporters and submitters observed the service recovering for 2 days before ‘verify’ing the ticket.

ATLAS ALARM->CERN SLOW LSF GGUS:83947 GGUS: /23/2015WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2012/07/07 07:27 SATURDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/07/07 11:22The same problem was reported by CMS via ALARM GGUS: No operator acknowledgment was recorded in these 2 tickets, due to the invalid addresses used Submitters provided debug info about jobs appearing to ‘run’ on lost-and-found machines. Service mgr applied recently received hot fixes. 7 comments exchanged. 2012/07/07 20:12The ticket is set to ‘solved’. ‘verified’ the next day. Similar process for the CMS ALARM on this issue.