WLCG Service Report ~~~ WLCG Management Board, 1 st September 2009 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
SSC2 and Update on Multi-user Pilot Jobs Framework Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Julia Andreeva on behalf of the MND section MND review.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
Cross-site problem resolution Focus on reliable file transfer service
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Introduction Covers the two weeks 17 th to 29 th August Attendance First week affected by holiday period Significant ramp up with high attendance in second week Main events: BNL site name change: BNL-LCG2  BNL-ATLAS Kernel patch for published local exploits Two alarm tickets: ATLAS  BNL: TEST ALARM Ticket after name change CMS  CERN: All dedicated CMS LSF queues at CERN disabled with no clear pre-warning51172 Incidents leading to service incident reports 26/8/2009 CERN: Closing of public and production queues for Emergency batch reboot to apply kernel upgrade 2

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE8008 ATLAS CMS2013 LHCb0120 Totals

4

Experiment Availabilities Reports NIKHEF completed move ~24/8 SARA: the ALICE ongoing downtime is due to a failing ALICE VObox there. GGUS ticket submitted on the 31 st. RAL was back in full prod on Tuesday 18 th after the air conditioning stoppages due to water chiller failures that had started on 12 th (see Harry’s report at the last MB) 5

Incidents: closing batch queues at CERN On the advice of the CERN security team and Grid security team, and due to two severe kernel vulnerabilities (one with no workaround), the batch services had to be reinstalled with a new kernel and rebooted. The system was partially drained first. Wednsday Long running public batch queues set Inactive to Drain servers Mail sent to info-experiments describing this and the plan to reboot at next day All batch queues set Inactive Mail sent for SB to Computer Operations describing current situation (posted 18.55)Mail MOD sees original info-expeiments mail and moves the update [saying it is all batch queus] to the usual place on SSB, but links from it the original mail sent at saying it is only the public batch queuesmail CMS "T0 operations list" mail describing the problem of the Inactive T0 queue and complaining about the confused messages ("public" queus vs. "all" queues) Update to computer operations on SSB saying that public and grid queues are now open with limited capacity.Update GGUS alarm ticket from CMS arrives detailing the issue with Inactive CMS T0 queueGGUS alarm ticket Operator responds to alarm ticket, asking for clarification CMS T0 hosts rebooted and queues reopened. Thursday Remaining batch jobs killed. Service rebooted Reboots ongoing (30% already available). All queues reactivated All machines rebooted and in production (except for a few outliers). 6 Wide announcement: ‘CERN is vulnerable’ Communication race condition where the subtle change ‘public’  ‘all’ wasn’t noticed by MoD

Closing batch queues at CERN cont… To drain or not? Only ~650 jobs out of 17,000 were lost But, closed queues == lost processing time Some VOs prefer us to keep queues open and not drain the nodes while rolling out the new kernel but rather kill the running jobs Potentially a large number of failed jobs The CERN batch configuration is highly customized with >70 dedicated queues For a simplified configuration with only a (few) large resources it would have been possible to handled the rollout almost transparently without closing the queue During the emergency rollout there were no time for negotiating how to handle each dedicated queues We are considering different options for how to better handle similar situations in the future Will make a proposal to the VOs later this month 7

BNL site name change Site name change in BDII on 25/8 at 15:00 CEST (09:00 EDS): BNL-LCG2  BNL-ATLAS A few minutes (~15’) later ATLAS Tier-1 sites could start updating their FTS channel definitions following a procedure provided by Gavin The update of the channel definitions had to be synchronized with the BDII updates. RAL and TRUIMF have disabled daily BDII updates. The update is still pending… Some sites were not actively confirming the update though successful. BNL planned the operation and executed it across the grid with the assistance of the Tier-0 FTS team The operation was transparent for ATLAS and transfers did not suffer. The site name was also done for GGUS and a TEST ALARM ticket was injected by ATLAS in order to confirm that the workflow was working. 8

Miscellaneous Reports ALICE VOBox issue at CERN Problem had been reported from several sites where ALICE VOBoxes couldn't connect to the Alien DB at CERN. The port, 8084, had been closed recently and must now be re-opened. RAL lost one disk server when recovering from the cooling problems 99k files out of 4M in MCDISK space token CMS tape migrations stalled at ASGC Ongoing for two weeks. Local CASTOR staff are investigating Help from CASTOR CERN may be required. If so, the mailing list should be used OPN problems at PIC connectivity between PIC and CNAF didn’t work over backup route Early announcement for downtime registration not possible anymore? Reported by PIC who wanted to register a several hours downtime a week ahead The option to send an advance announcement seems to have disappeared? 9

Summary BNL site name change successful NIKHEF move completed. Back in prod CMS alarm ticket for lost processing time when closing dedicated queues at CERN for emergency kernel rollout Procedure and communication need to be improved 10