WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 9 th August

Introduction 3 busy weeks since the last MB report on July 19 th Good data taking with LHC record fills (passed the 2 fb -1 mark on August 5!) Three Service Incident Reports received: IN2P3 outage of 13 DBs due to disk failures on July 19 th –21 st (SIR)SIR Affected Atlas (COOL, LFC, AMI), CMS (FTS), LHCb (COOL, LFC) for >1 week GGUS ALARM submission affected by KIT mail interface, July 22 th -26 th (SIR)SIR Loss of 11k ATLAS files at KIT due to dirty GPFS, July 12 th -26 th (SIR)SIR One more Service Incident Report is expected: CERN KDC flood from ATLAS users in May-June (reported at last MB) 4 real GGUS ALARMS (3 for ATLAS and 1 for CMS) All about storage – at CERN (Castor) and CNAF (Storm) Other notable issues reported at the daily meetings Major power outage at FNAL due to thunderstorm on July 29 Storm issues at many ATLAS sites after upgrade, applied workarounds Low CPU efficiency of ALICE jobs finally solved (new hw, xrootd, svc config) ADCR DB performance slow (after move to standby hw, but not correlated?) 2

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS4026 LHCb Totals

WLCG MB Report WLCG Service Report Support-related events since last MB There were 4 real ALARM tickets since the 2011/07/18 MB (3 weeks), 3 submitted by ATLAS, 1 by CMS, all ‘solved’ and ‘verified’; 2 of them for CERN CASTOR, 2 for CNAF Storm. Ongoing GGUS problems in ALARM submission and/or escalation: Problems between June already reported at last MB, due to new KIT exim mailer and supposedly solved during week of June 27 For ATLAS ticket on July 24, GGUS did not allow ALARM submission and also failed to notify operators on TEAM-to-ALARM escalation. For CMS ALARM submitted on July 26, piquet was not called. These issues were solved last week at KIT (see SIR) and validated with test alarms.SIR This weekend again an ALARM submitted by ATLAS with INFN on August 6 did not reach the SMS system of the site. This had already been reported on July 17 (GGUS:72717). CNAF reported this morning that a fix has been applied and validated (tests have confirmed that ALARMS correctly trigger SMS messages).GGUS:

ATLAS ALARM->CERN CASTORATLAS DOWN GGUS:72890GGUS:72890 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/24 03:16 SUNDAY GGUS TEAM ticket (as GGUS did not allow direct ALARM submission!), automatic notification to grid-cern-prod- AND automatic assignment to ROC_CERN.grid-cern-prod- 2011/07/24 03:17Submitter immediately escalates ticket to ALARM. notification recorded as ‘Sent to (but no received by operators & service mgrs?). Automatic SNOW ticket creation 2011/07/24 06:34Supporter records that data export from CERN is also affected 2011/07/24 06:43- 07:57 Supporter calls Operator had received no alarm! Supporter s and later also and 2011/07/24 08:03Castor developer confirms TEAM-to-ALARM did not work and observes that no problem can be seen at this time. 2011/07/24 08:20- 08:44 Supporter confirms problem was real. ATLAS data export still suffering due to backlog accumulated when CASTOR down. 2011/07/26 10:16Castor mgr puts ticket on hold, discussion ongoing with ATLAS 2011/07/29 16:35- 20:56 Castor expert sets ticket ‘solved’, applying workarounds and hotfixes. Submitter sets ticket ‘verified’. 5

CMS ALARM->CERN CASTOR XROOTD REDIRECTOR NOT WORKING GGUS:72944GGUS:72944 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/26 08:56GGUS ALARM ticket, automatic notification to cms-operator- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.cms-operator- 2011/07/26 09:56Castor admin restarts redirector and asks if all ok. “Redirector threads were busy with CASTOR (stuck in synchronous Puts), so new requests were stuck (and would get eventually run into Kerberos Clock skew detection). The number of threads can be increased, but this might point to some overload issue. We might also have hit some issue with locking on the Kerberos replay cache, a core dump was taken and is being looked at.” 2011/07/26 09:58Castor admin adds “For the record, ALARM seems not to have reached CERN via the usual channels (i.e. no parallel routing to CERN operator or SMS alert list, hence no piquet call)”. 2011/07/26 10:15- 21:12 Submitter replies and Castor admin sets ticket ‘solved’ and later ‘verified’. 6

ATLAS ALARM->INFN SRM DOWN GGUS:73054GGUS:73054 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/29 15:13GGUS TEAM ticket, automatic notification to t1- AND automatic assignment to NGI_IT.t /07/29 16:02Transfers from T0 are also failing. Supporter escalates ticket to ALARM. Notification sent to address 2011/07/29 16:02Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.” 2011/07/29 16:45Site admin restarts GPFS process in Storm BE, asks if ok now. 2011/07/29 17:21Supporter confirms all is ok, ticket can be closed. 2011/07/31 03:23Shifter reopens ticket because SRM is down again. 2011/07/31 05:04Supporter sets ticket as closed and moves new SRM issue to new TEAM ticket GGUS:73068 (to be escalated if not solved promptly – but the issue is fixed at 05:53).GGUS: /07/31 17:22Supporter sets ticket as ‘verified’. 7

ATLAS ALARM->INFN PUT GRIDFTP_COPY_WAIT: CONNECTION TIMED OUT GGUS:73236GGUS:73236 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/08/06 14:30 SATURDAY GGUS TEAM ticket, automatic notification to t1- AND automatic assignment to NGI_IT.t /08/06 17:43SRM seems to be down. Supporter escalates ticket to ALARM. Notification sent to address 2011/08/06 17:43Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.” 2011/08/06 19:53Site admin resets Storm BE via power cycle, asks if ok now. Problem with SMS will be investigated during the week. 2011/08/06 22:30Supporter confirms all is ok, sets ticket as closed and verified. 8

Analysis of the availability plots: Week of 18/07/2011 Atlas 2.1 IN2P3-CC - UNSCHEDULED - problem with disk on the oracle cluster - DB service was unstable LHCb 4.1 LCG.IN2P3.fr - UNSCHEDULED - problem with disk on the oracle cluster - DB service was unstable

Analysis of the availability plots: Week of 25/07/2011 Atlas 2.1 Taiwan-LCG2 - SCHEDULED - Network Maintenance CMS 3.1 T1_TW_ASGC - SCHEDULED - Network Maintenance and Phedex agent upgrade

Analysis of the availability plots: Week of 01/08/2011 All sites were operating above 50% threshold during the entire week. Nothing to report.

Conclusions Business as usual – successful record data taking Serious issue with databases at IN2P3 affecting ATLAS, CMS, LHCb Experienced many GGUS problems with ALARM submission and escalation (operators and piquet not always contacted) 15