1 VO User Team Alarm Total ALICE ATLAS CMS

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.
1Maria Dimou- cern-it-gd LCG GDB May 2008 USAG and direct GGUS ticket routing to Sites Grid Deployment.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Servizi core INFN Grid presso il CNAF: setup attuale
Federating Data in the ALICE Experiment
WLCG IPv6 deployment strategy
Monitoring Evolution and IPv6
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
VOs and ARC Florido Paganelli, Lund University
Cross-site problem resolution Focus on reliable file transfer service
PL-Grid – an example of NGI support structure Marcin Radecki
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
GOCDB Update 27/05/ Me: Working on GOCDB 3 days a week
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
CASTOR-SRM Status GridPP NeSC SRM workshop
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
WLCG Management Board, 16th July 2013
Castor services at the Tier-0
Jamie Shiers ~~~ WLCG MB, 19th February 2008
Olof Bärring LCG-LHCC Review, 22nd September 2008
1 VO User Team Alarm Total ALICE ATLAS CMS
Technical Forum Lyon Torsten Antoni, Sabine Reißer
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
xGUS The EGI Helpdesk Template
Teaching by Inquiry How to Teach by Asking Questions
WLCG Status – 1 Use remains consistently high
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

1 VO User Team Alarm Total ALICE 4 1 5 ATLAS 12 175 191 CMS 13 2 16 GGUS summary (5 weeks) VO User Team Alarm Total ALICE 4 1 5 ATLAS 12 175 191 CMS 13 2 16 LHCb 8 33 42 Totals 37 210 7 254 1

Support-related events since last MB There have been 6 real and 1 test ALARMs since the 2012/07/24 MB. All were submitted by ATLAS,CMS & LHCb. Site for all these tickets was CERN. There has been no GGUS Releases since the last MB due to summer holidays. The next one is planned for 2012/09/26. 6/15/2018 WLCG MB Report WLCG Service Report

LHCb ALARM->CERN->GridKA FTS PRoblems GGUS:84778 What time UTC What happened 2012/08/03 07:31 GGUS ALARM ticket opened, automatic email notification to lhcb-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Transfer. 2012/08/03 07:39 Operator records that it-dep-pes-ps-sms was emailed. 2012/08/03 08:16 Grid service expert records in the ticket that the problem dates since 2012/07/27 and is handled via ticket GGUS:84550 assigned to NGI_DE. 2012/08/03 09:40 Ticket assigned to Castor supporters in case there is a problem dependency due to Many “PrepareToGet” timeouts seen on srm-lhcb. 2012/08/17 03:00 Ticket set to ‘solved’ following 9 comments’ exchange. Transfer speed turned out to be slow because, in this low service performance, files started to be moved to tape and had to be fetched from there. Increasing LHCbTAPE pool size helped. 6/15/2018 WLCG MB Report WLCG Service Report

ATLAS ALARM->CERN SLOW LSF GGUS:84928 What time UTC What happened 2012/08/06 20:57 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/06 21:06 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/06 21:15 The expert starts working on the problem but sees no problem as bsub response time is around 100ms. 2012/08/08 09:50 Ticket set to ‘solved’ and very soon afterwards to ‘verified’ after exchange of 8 comments where ATLAS is asked to apply some configuration changes while the service asked Platform to find the root cause (we don’t know what the outcome was!!). 6/15/2018 WLCG MB Report WLCG Service Report

CMS ALARM->CERN EOScmS DOWN GGUS:84966 What time UTC What happened 2012/08/08 07:38 GGUS ALARM ticket opened, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/08 07:49 Service expert comments in the ticket’s Internal diary a known problem from the day before on looping nscd (restarted). 2012/08/08 08:02 Operator records that the sysadmin piquet was called. 2012/08/08 08:04 Expert adds a number of comments with about 10 userIDs holding, each hundreds of sessions. 2012/08/08 09:18 Ticket set to ‘solved’ after EOSCMS MGM restart. SIR https://twiki.cern.ch/twiki/bin/view/EOS/IncidentsEOSCMSLdap20120807 6/15/2018 WLCG MB Report WLCG Service Report

ATLAS ALARM->CERN SLOW LSF GGUS:84998 What time UTC What happened 2012/08/08 17:25 GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/08 17:32 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/08 17:57 The submitter attaches various service plots.The expert starts working on the problem but sees no problem as bsub response time is < 100ms. 2012/08/10 15:04 Ticket set to ‘solved’ after exchange of 19 comments between supporters and shifters due to discrepancies in performance results measured by the service and the experiment. It turned out that the unit was different (1/100s vs 1ms). 6/15/2018 WLCG MB Report WLCG Service Report

ATLAS ALARM->CERN SLOW LSF GGUS:85058 What time UTC What happened 2012/08/11 22:33 SATURDAY GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/08/11 22:42 The operator records in the ticket that the service was informed (it doesn’t mention Which service). 2012/08/12 07:57 SUNDAY The submitter attaches various service plots. The next shifter records in the ticket things got worse in the night. 2012/08/12 12:05 Operator asks for patience while trying to reach the expert. 2012/08/12 15:23 Service expert found user jobs that blocked the system, killed them and set the ticket to ‘solved’. 6/15/2018 WLCG MB Report WLCG Service Report

CMS ALARM->CERN CAstor DOWN GGUS:85398 What time UTC What happened 2012/08/21 20:04 GGUS ALARM ticket opened, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/08/21 20:18 Operator records that the Castor piquet was called. 2012/08/21 20:31 Service expert restarted the Castor transfermanageron one of the CASTORSRM headnodes. 2012/08/21 20:56 Ticket set to ‘solved’ as the service went back to normal. 6/15/2018 WLCG MB Report WLCG Service Report