WLCG Service Report ~~~ WLCG Management Board, 27 th October 2009 1.

Slides:



Advertisements
Similar presentations
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
Advertisements

Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
Tier 1 Status and Recent Major WLCG Service Incidents LCG-LHCC Referees Meeting 22 September 2008.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
WLCG Service Report ~~~ WLCG Management Board, 23 rd March
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Cross-site problem resolution Focus on reliable file transfer service
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 27 th October

Introduction Covers the weeks 12th to 25th October. Mixture of problems Incidents leading to (eventual) service incident reports RAL disk subsystem failures taking down FTS, LFC and CASTOR from 4 th to 9 th and eventually led to the loss of 10 days data ( files) at RAL. ASGC ATLAS conditions database still not synchronized. ASGC CASTOR DB corrupted (21th October, not recovered yet) 2

Meeting Attendance Summary 3 SiteMTWTF CERNYYYYY ASGCYYYYY BNLYYYYY CNAFY FNAL FZKYYYY IN2P3Y NDGF NL-T1YYYY PICYYY RALYYYYY TRIUMF

GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE0000 ATLAS CMS8019 LHCb Totals

Alarm tickets There were 3 alarm tickets in the week starting 12 th October CERN CASTOR stager stuck reported by LHCb CERN CASTOR Name Server problem reported by CMS CERN CASTOR Name Server problem reported by LHCb

GGUS tickets and OSG The OIM view provided to GGUS should list only the ‘resource group’ name (BNL_ATLAS) with valid contact- and emergency- addresses

7

RAL Disk failures 1/2 ATLAS, CMS and LHCb SAM tests saw the RAL LFC, FTS and CASTOR downtimes (4 to 7 October for LFC and FTS and up to 9 October for CASTOR) due to failing disk sub-systems. ALICE only test their VOboxes and saw an SL4 to SL5 migration interrupt. RAL CASTOR runs on a SAN with disk systems containing primary and mirrored databases. Hardware faults on mirror since 10 September also hit primary on 4 October and CASTOR went down. Decision was to revert to older hardware then revalidate the failing systems. Suspicion early on was temperature problems. 8 October CASTOR being restored without loss for ALICE and CMS and losing a few hours transactions for ATLAS and LHCb – estimated at files. List of lost files being prepared for experiment decision. 9 October CASTOR restored – experiments to recover lost files or to clean catalogues. Vendor working with RAL to understand root cause of failures. 14 October Discovered problem with database used following the restore. Resulted in loss of around last ten days data added to Castor. The database restore had been OK. The problem arose when Oracle opened the database and picked up the ‘wrong’ disk array. 21 October List of lost files ( for Atlas) produced and LFC cleanup started. Actually only one dataset did not have another copy available at another site. 8

RAL Disk failures 2/2 SIRs: Hardware failures and loss of service Loss of data following restoration of services:

ASGC DB Problems 2 major DB problems Atlas Condition DB: has not been available for more than 4 weeks now CERN DM group recommends to perform a complete re- instantiation using transportable tablespaces. BNL will be the source. Synchronization should happen tomorrow 28 th October (09:00 CET). CASTOR DB: Has not been available for almost a week All recovery attempts failed Should the DB be reset? A phone conference will take place tomorrow 28 th October (09:15 CET) 10

Miscellaneous Reports CASTOR Name Server problem at CERN due to new CASTOR release (2.1.9). 180 files lost at NLT1 (tape destroyed). Problem at CNAF installing CMS SW release (not understood). Instability of Atlas Condition Database at BNL due to high load from Tier2s; solved by increasing memory. Problems with new SRM release at CERN (needed a rollback to the previous version). Problems with new BDII release at CERN (needed a rollback to the previous version). SRM problem at BNL last Friday due to a Java exception. 11

Summary/Conclusions Very long standing problems at ASGC (CASTOR and Condition Database). Serious disk hardware failures at RAL: files lost. A number of sites, including ASGC, have been unable to recover production databases from backups / recovery areas with major downtimes occurring as a result. A coordinated DB recovery validation exercise will take place the 26 th November: RAL and ASGC are especially encouraged to participate. 12