GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS411646211 CMS114621 LHCb748358 Totals6321616295 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

1/31/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There were 12 real ALARM tickets since the 2011/10/11 MB (2 weeks), 5 submitted by ATLAS, 5 by CMS, 2 by LHCb. 5 ALARM tickets concerned CERN, 2 were for RAL, 1 for CNAF, 1 for FNAL (via OSG), 1 for GridKa and 2 for IN2P3. 17 test ALARM tickets were submitted by the GGUS developers on Release day 2011/10/19, as a part of the regular procedure. All parts of the process chain were tested and were successful. Following this release, 16 additional GGUS Support Units (SUs), mostly 3 rd level middleware support, interfaced with the T0 local ticketing system Service Now (SNOW). This completed the GGUS- SNOW transition. On 2011/10/22 ‘top priority’ tickets started getting switched automatically down to ‘less urgent’. A SOAP web service bug was found that caused this. Case tracked in GGUS:75575.GGUS:75575 On 2011/11/04 GGUS-SNOW interface broke for ~2.5 days. Case tracked in GGUS:76052.GGUS:76052 Details follow…

CMS ALARM->CERN errors in read from Castor GGUS:75218 GGUS: /31/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/10/11 13:01GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: File Access.cms- 2011/10/11 13:02Service expert assigns the ticket to himself. 2011/10/11 13:22Submitter and expert contribute log entries in the ticket to demonstrate the error which only appears in batch. Interactive reads work fine. 2011/10/11 13:26The operator records in the ticket that “EOS support was called” 2011/10/11 14:15Several exchanges conclude there is an authentication problem in batch mode. 2011/10/11 17:16Indeed there was a problem with the Kerberos credentials’ handling. When fixed the ticket was set to ‘solved’ and ‘verified’.

ATLAS ALARM->CERN No T0 jobs due to Oracle COOL DB problems GGUS:75234 GGUS: /31/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/10/12 01:30GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Databases.atlas- 2011/10/12 01:48Operator records in the ticket that the Oracle piquet for the Atlas RAC was contacted. Unfortunately, they called the data mgnt piquet first who indicated the right service. 2011/10/12 07:535 comments contributed by the submitter and the shifter at P1 throughout the night recording a lock of the ATLAS_COOL_READER_TZ account and other observations on load from sms.cern.ch. 2011/10/12 08:14Grid services’ expert assigns the ticket to the Physics DB group in SNOW. 2011/10/12 09:54Expert sets the ticket to status ‘solved’ explaining the reason “high load and fail-over to a system unable to cope”.

LHCb ALARM->GridKa dCache not responding GGUS:75261 GGUS: /31/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/10/12 16:12GGUS ALARM ticket, automatic notification to de-kit- AND automatic assignment to NGI_DE. ToP: Storage Systems.de-kit- 2011/10/12 16:19NGI_DE supporter sets the ticket in progress and records the technician on-call was contacted. 2011/10/12 19:26FZK WLCG contact records in the ticket that dCache experts are investigating. 2011/10/12 20:01Service expert sets the ticket to status ‘solved’ recording there was a problem with the dCache pnfs service which was restarted. 2011/10/12 20:46Submitter sets the ticket to status ‘verified’.

CMS ALARM->OSG FNAL facilities unreachable GGUS:75267 GGUS: /31/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/10/13 06:17GGUS ALARM ticket, automatic notification to various paging addresses defined by the site AND automatic assignment to OSG(Prod). Automatic OSG-GOC ticket creation successful. ToP: Network problem. 2011/10/13 08:00Site contact records in the ticket a known network problem at FNAL and people working on it. 2011/10/13 12:06Site contact sets the ticket in status ‘solved’ recording that the CMS network switch was rebooted to clear the numerous errors recorded in the buffer log.

CMS ALARM-> CERN vocms15 unreachable GGUS:75510 GGUS: /31/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/10/20 06:52GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Local Batch System.cms- 2011/10/20 07:03Operator records in the ticket the problem with vocms15 is known and the sys admin is working on it. 2011/10/20 08:47Grid services’ expert sets the ticket in status ‘solved’ recording that the sys admin fixed the problem (how?).

ATLAS ALARM-> RAL Transfers fail – SRM down GGUS:75597 GGUS: /31/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/10/22 04:12 SATURDAY GGUS TEAM ticket, automatic notification to lcg- AND automatic assignment to NGI_UK. ToP: File Transfer.lcg- 2011/10/22 05:49Ticket upgrade from TEAM to ALARM. notification sent to 2011/10/22 05:50Automatic system reply for the alarm registration. 2011/10/22 05:53New shifter provides additional information about SRM being down at RAL. 2011/10/22 until 13:42 The transfers restarted a few minutes after the ALARM was raised but 6 comments exchanged to explain that an Oracle node on the Castor DB rack rebooted for reason unknown. 2011/10/23 05:05 SUNDAY Another Castor Oracle DB node crashed and NOT rebooted. GOCDB publishing was not possible. ATLAS blacklists RAL temporarily. 2011/10/24 11:57All RAL Castor instances restored.Ticket in status ‘solved’ with the debug info known so far.

ATLAS ALARM-> CNAF LFC down GGUS:75601 GGUS: /31/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/10/22 12:02 SATURDAY GGUS TEAM ticket, automatic notification to t1- AND automatic assignment to NGI_IT. ToP: NONE. This is not an agreed value. See Savannah:124239t1- Savannah: /10/22 12:28Site supporter takes the ticket and starts investigating. 2011/10/22 13:20Site mgr records in the ticket that one frontend was found stuck with kernel panic and another, apparently, out of memory. The service was restarted. The ticket was set to status ‘solved’. 2011/10/22 19:13Submitter sets the ticket to status ‘verified’.

ATLAS ALARM -> IN2P3 exports from T0 fail GGUS:75609 GGUS: /31/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/10/23 07:16 SUNDAY GGUS TEAM ticket, automatic notification to AND automatic assignment to NGI_FRANCE. ToP: File Transfer. 2011/10/23 08:09Ticket upgrade to ALARM. GGUS automatic notification sent to Automatic acknowledgment by the ‘LHC Alert’ CC-IN2P3 response team recorded in the 2011/10/23 09:16Site mgr records in the ticket that SRM problems were discovered and being investigated. 2011/10/23 14:16Following several comments exchanged with the site, submitter sets the site ‘offline’ for data exports in agreement with the supporter at the site, as the problem takes long to fix. 2011/10/24 09:51Ticket set to ‘solved’ as SRM was back and observed 15+ hrs. No explanation for the failure recorded in the ticket.

LHCb ALARM-> IN2P3 SRM not responding GGUS:75610 GGUS: /31/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/10/23 11:31 SUNDAY GGUS ALARM ticket, automatic notification to lhc- AND automatic assignment to NGI_DE. ToP: Storage Systems.lhc- 2011/10/23 11:35Automatic acknowledgement of the ALARM by the site. 2011/10/23 11:37Submitter pastes error msgs in the ticket (from voboxes and lxplus) for debugging. 2011/10/23 11:50Site contact records the problem is known and being investigated. Meanwhile the SRM will be declared in downtime. 2011/10/23 16:43IN2P3 SRM operational again. After 1.5 hrs of checks, the ticket is set to ‘solved’.

CMS ALARM-> CERN Castor CMS T1 transfer pool GGUS:75743 GGUS: /31/2016WLCG MB Report WLCG Service Report 12 What time UTCWhat happened 2011/10/26 10:50GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Transfer.cms- 2011/10/26 11:04Operator records in the ticket the Castor piquet was contacted. 2011/10/26 11:07Expert sets the ticket ‘in progress’. Requests examples as problem evidence. 2011/10/26 11:33Expert sets the ticket to status ‘solved’ after making sure that a simple overload and FTS timeout were the reasons for the failing transfers. 2011/10/26 13:24Submitter confirms and sets the ticket to status ‘verified’.

ATLAS ALARM-> RAL Transfers fail – SRM down GGUS:75823 GGUS: /31/2016WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2011/10/29 20:52 SATURDAY GGUS TEAM ticket, automatic notification to lcg- AND automatic assignment to NGI_UK. ToP: File Transfer.lcg- 2011/10/29 21:01Ticket upgrade from TEAM to ALARM. notification sent to 2011/10/29 21:02Automatic system reply for the alarm registration. 2011/10/29 22:001 st estimate gives high load as problem cause. Site reduced FTS file limits. This did NOT help. 2011/10/30 05:31 SUNDAY The situation did not improve so the site limited also the number of allowed Atlas jobs. 2011/10/31 08:37RAL ran in reduced capacity all Sunday and set Atlas instances in downtime in GOCDB. Starting Monday the VO set the UK cloud in brokeroff to continue debugging with low load. 2011/11/02 16:06FTS channels’ capacity was gradually set back to normal. Ticket ‘solved’ and soon afterwards ‘verified’. SIR promised.

CMS ALARM-> CERN CEs failing SAM tests GGUS:75833 GGUS: /31/2016WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2011/10/30 14:14 SUNDAY GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Storage Systems.cms- 2011/10/30 14:18Storage Services’ expert records in the ticket beginning of the investigation. 2011/10/30 14:28Operator records in the ticket was sent to it-dep- pes-ps-sms. This is not the right experts’ list! 2011/10/30 until 15:00hrs A couple of comments with the operators to indicate the right piquet that should have called and with the submitters for getting more info on the failing tests. 2011/10/30 15:06Expert sets the file into status ‘solved’. The file needed by the SAM tests to be found in a given directory for the tests to run successfully was accidentally removed by CMS. 2011/10/30 18:41Following a few comments of apologies the submitter set the ticket in status ‘verified’.