WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.

Slides:



Advertisements
Similar presentations
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
Advertisements

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
Operation team at Ccin2p3 Suzanne Poulat –
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG Management Board, 30th September 2008
1 VO User Team Alarm Total ALICE ATLAS CMS
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Take the summary from the table on
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
WLCG Workshop Introduction
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009

Introduction This report covers the week since the last WLCG MB  Another significant service incident this week – CASTOR ATLAS 12 hour outage on Saturday (CERN) The CASTORATLAS instance was unavailable for ~12hrs between 01:00 and 13:00 this morning. The outage was due to a database corruption. More details from the database team are available herehere A post-mortem analysis is available here.here. A series of (largely) “transparent” interventions on the CERN CASTOR DBs has been scheduled for this week: moremore 2

CASTOR ATLAS (CERN) ATLAS CASTOR and SRM production instances were down for approximately 12 hours: Timeline: 01:48 - operator called the SMoD following the alarm: "castor_degraded on C2atlassrv101":SMoD noticed some load on dbrac servers for atlast0 and heavy network traffic Found the following errors on the castor logs: Error caught in subRequestToDo. ORA-25408: can not safely replay call ORA-12514: TNS:listener does not currently know of service requested in connect descriptor  02:40 - SMoD tried to contact Nilo and Eric without success (didn't find the oracle support contact even after calling the operator)SMoD 04:36 - Alerted by their own monitoring Eric started to investigate the problem. The analysis and sequence of events on the DB is below. 12:20 - The database was been re-opened 12:24 - Miguel started the clean up and restart of CASTOR ATLAS 13:25 - CASTOR ATLAS is back in production 3 Root cause: we have been hit by a bug ( ), the root cause being that a block has been written (in one of the castor_stager.id2type table extents) in a way that can not be recovered later (transaction issue) Oracle have been requested to “publish an "alert" about such a critical problem (they publish a list of critical known issues), we review this list on a regular basis and we would have included the patch several months ago”

GGUS Summaries 4  1 left-over alarm ticket from the week-before-last’s scheduled test – ATLAS to ASGC (not opened during the agreed period due to understood problems at ASGC – responded to well within targets)  LHCb alarms: we also should be testing the ability of the VO’s alarm team to open tickets – i.e. ALL VOs should have people able to raise alarms at any time!  Reminder: we will repeat such a test the week after CHEP: the goal is to have completed the analysis prior to the following week’s F2F meetings (7 – 8 April 2009) VO concernedUSERTEAMALARMTOTAL ALICE3003 ATLAS CMS2002 LHCb71816 Totals

Service Summary – Experiment Tests 5

WMS Issues Growing concern on WMS stability issues – several sites (CERN, RAL, GRIF) that have installed the megapatch report stability problems – not always load related… Maarten Litmaath has done some investigations – the results of which are included on the following slides. 6

WMS investigations (Maarten) 1.There are at least 2 new problems besides the known bug + workaround: WMS-known-issues.asphttp://cern.ch/glite/packages/R3.1/deployment/glite-WMS/glite- WMS-known-issues.asp 2.A cron job was implemented to automate the workaround, but it changed the wrong parameter in the configuration and therefore failed; this should be fixed today. 3.On wms216 (LHCb) I temporarily disabled the cron job (it is enabled now), fixed the correct parameter and restarted the WM. It went fine until it crashed with a different segfault, details here: I had to move one unprocessed job out of the way to allow the recovery to proceed, after which yet a different segfault occurred, this time for a cancellation request; the good news here is that a simple restart dealt with that, so we probably can live with it. 4.… 7

WMS investigations cont. 4.Prior to investigating the WM troubles I looked into why the WMProxy had become autistic, details here: 5.Whatever caused those processes to hang, at least a restart must get rid of them: Those 3 bugs are all ranked major at the moment, but we may want to bump some or all of them to critical. We now need to check what is happening on both WMS nodes for ATLAS and on one WMS for SAM; probably more of the same. 8

CERN Network Intervention – March 19 2 part network intervention 1.The intervention will start with the upgrade of the routers of the General Purpose Network which will happen at 06:00 a.m. This will entail a ~15min interruption, which will affect access to AFS, NICE, MAIL and all Databases which are hosted in the GPN among other services. Next, the switches in the General Purpose Network that have not been previously upgraded will be upgraded resulting to a ~10min interruption. All services requiring access to services hosted in the Computer Center will see interruptions. 2.The routers of the LCG network will be upgraded at 08:00 a.m., mainly affecting the Batch system and CASTOR services, including Grid related services.The switches in the LCG network that have not been previously upgraded will be upgraded next. This network intervention is planned to have finished by noon (12:00 pm). After the network intervention is finished (13h00) there will be a 3h intervention affecting the CASTOR nameserver (all instances affected). CASTOR will be inaccessible during that intervention. More details here.here.  No doubt this and recent incidents will give us plenty of material to discuss at WLCG Collaboration workshop 9

CNAF downtime: 30/03 – 03(06?)/04 In the period 30/03/ /04/2009 the INFN-T1 at CNAF in Bologna will be completely down (power supply and air conditioning) due to the interconnection of the existing services to the new infrastructure system. Batch queues will be closed on Friday 27/03/2009. The electrical down will last from 30/03/2009 to 02/04/2009. The reactivation of all software systems, farming and storage services will start as soon as the intervention on the new infrastructure is completed; all services are planned to be completely up and running by 06/04/2009.  Please note that the dates of this scheduled downtime are dependent on proper commissioning of the new infrastructure, which has to happen before the actual downtime: in case of major problems with the commissioning, this downtime will be rescheduled. The only services planned to be maintained active during the downtime week (with "at risk" status) are: the wide area network Point of Presence (GARR) the INFN-T1 central core router Basic INFN national services Basic grid services (trouble ticketing, LFC) LSF license servers 10

Agenda Overview – Day 1 What?Who? IntroductionMilos, Jamie LHC Status ReportSee CHEP talk from Sergio Bertolucci WLCG Roadmap for 2009/2010Ian Experiment RoadmapsPeter, Kors, Matthias, Philippe Site ReviewsIan DB Service ReviewsMaria DM Service Reviews & OutlookJos, Barbara, Andrea, Brian, … Panel discussion on DM futures (including but not limited to short term priorities): experts, site & experiment representatives etc. WLCG Beer Tasting Session! 11

Agenda Overview – Day 2 What?Who? WLCG Analysis Services WGMarkus Supporting Analysis ServicesMassimo + discussion(s) Operations procedures & toolsNick et al Service Incident ReportsOlof et al Site SuspensionGraeme, Jamie, Nick, … Site Response[ sites, experiments ] MonitoringElisa, Julia, David, Rob, Brian, … Resource & Usage MonitoringSteve Downtime handingPablo Support issues: team & alarm tickets etc.Julia “Meet the SCOD”WLCG Service Coordination Rota WLCG Collaboration board (closed) followed by CHEP reception! 12

“Summer Time” Starts in Europe and many other countries on Sunday 29 th March – one week later in US and others See Fortunately does not coincide with WLCG workshop (and an early start due to security!)  “Spring forward, fall back” 13

Summary In addition to continuing Service Incidents – such as CASTOR + SRM at CERN – on-going issues at many sites reported at daily meetings by the experiments  Some of these can persist for long periods of time: need to follow-up on the most important of these and understand root cause(s) and possible resolution  We have plenty of material for the workshop this weekend! 14