Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Castor incident (and follow up) Alberto Pace.

Slides:



Advertisements
Similar presentations
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR Status Alberto Pace.
Advertisements

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.
CERN IT Department CH-1211 Geneva 23 Switzerland t Marcin Blaszczyk, IT-DB Atlas standby database tests February.
CERN - IT Department CH-1211 Genève 23 Switzerland t Transportable Tablespaces for Scalable Re-Instantiation Eva Dafonte Pérez.
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Database Backup and Recovery
Backup and Recovery Part 1.
CERN - IT Department CH-1211 Genève 23 Switzerland t STREAMS Resynchronization Scenarios and Tests LCG 3D CERN September 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t Recovery Exercise Wrap-up Jacek Wojcieszuk, CERN IT-DM Distributed Database Operations.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams new features in 11g Zbigniew Baranowski.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
CERN IT Department CH-1211 Geneva 23 Switzerland t Experience with NetApp at CERN IT/DB Giacomo Tenaglia on behalf of Eric Grancher Ruben.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
CERN IT Department CH-1211 Genève 23 Switzerland t Experience with Windows Vista at CERN Rafal Otto Internet Services Group IT Department.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Summary of CASTOR incident, April 2010 Germán Cancio Leader,
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS From data management to storage services to the next challenges.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
CERN IT Department CH-1211 Genève 23 Switzerland t Using AI tools for IT-CS Spectrum-based monitoring Véronique Lefébure IT/CS-CE February.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
Offline shifter training tutorial L. Betev February 19, 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics.
Impact of end of EMI+EGI-SA3 April 2013: EMI project finishes EGI-Inspire-SA3 finishes (mainly CERN affected) EGI-Inspire continues until April 2014 EGI.eu.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Deployment of Exchange 2010 mail platform Pawel Grzywaczewski, CERN IT/OIS HEPIX.
CERN IT Department CH-1211 Genève 23 Switzerland t Possible Service Upgrade Jacek Wojcieszuk, CERN/IT-DM Distributed Database Operations.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review and Outlook Distributed Database Workshop PIC, 20th April 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS XROOTD news New release New features.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
CERN IT Department CH-1211 Geneva 23 Switzerland t Eva Dafonte Perez IT-DB Database Replication, Backup and Archiving.
RCF Status Extended outage of the Mass Storage System (HPSS) last Wednesday –Latest transaction logs of namespace DB were erroneously deleted in the production.
CERN IT Department CH-1211 Genève 23 Switzerland t ALICE XROOTD news New xrootd bundle release Fixes and caveats A few nice-to-know-better.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
1 CERN IT Department CH-1211 Genève 23 Switzerland t Risk of network incident during the last LHC run CERN, 10 January 2013
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
Developments for tape CERN IT Department CH-1211 Genève 23 Switzerland t DSS Developments for tape CASTOR workshop 2012 Author: Steven Murray.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop November 17 th, 2010 Przemyslaw Radowiecki CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t Distributed Database Operations Workshop CERN, 17th November 2010 Dawid Wójcik Streams.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
CERN IT Department CH-1211 Genève 23 Switzerland t CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann,
CERN IT Department CH-1211 Genève 23 Switzerland t Bamboo users meeting IT-CS-CT.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Discussing possibility of deleting archives.
Recovering deleted files from HPSS Pierre-Emmanuel Brinette.
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
Planning for Application Recovery
Experiences and Outlook Data Preservation and Long Term Analysis
Offline shifter training tutorial
STREAMS failover and resynchronization
Presentation transcript:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS The incident On 26th April, after the upgrade to Castor , restart and config file change took place in the wrong sequence. This activated an old “feature” (now a bug) that used a default policies instead. In the absence of a policy, all tape pools were used in round robin. Many experiments had a recycle tape pool. Files sent to a recycle pool are deleted after some days when the files expire. The CMS and ATLAS daemons were restarted on April 30 th following an (unrelated) incident on slow streaming to tape. The problem went undetected until May 15 th when ALICE reported missing files. 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Impact on experiments LHCb: –no recycle pools, no data loss, only files written to the incorrect pool CMS –490 files were deleted –We were able to restore the deleted files in the name server as the tapes where not yet recycled –In parallel CMS was able to copy back these files from existing Tier1s copies ATLAS –9689 files were deleted –We were able to restore the deleted files in the name server as the tapes where not yet recycled –In parallel ATLAS was able to copy back these files from existing Tier1s copies ALICE –10268 files were deleted –Tapes were (partially) overwritten before we were aware of the incident –Data were not yet replicated to Tier1s 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Actions already taken All recycle pools were stopped. –Disable unwanted “delete” actions on tape All effort focussed in trying to recover data: –Possible at CERN only when tape cartridges have not been overwritten –For the Alice tapes, we have identified the 10 cartridges that contained all files lost. These tapes have been sent to IBM and SUN data recovery labs. 4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Additional planned actions Review any possible source of file deletion in Castor in order to disable all unwanted “delete” action on tape –Considering disabling all possibility of “delete on tape” and cartridge recycling –Implement “logical” instead of “physical” deletion –Physical “write switch” on the cartridge Identify possibilities to create an additional copy of the raw files to be kept until the data is replicated at the Tier1s Internal review of operational procedures and creation of a user-oriented dashboard Run a external review of Castor in September Continue the process of addressing the longer term evolution of the data archiving and management (Amsterdam Jamboree). 5

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Links Incident report – ervice/IncidentsAliceRecycled14May2010https://twiki.cern.ch/twiki/bin/viewauth/CASTORS ervice/IncidentsAliceRecycled14May2010 Incident response on data recovery – /RecoveryAfterDataLossRecyclePoolshttps://twiki.cern.ch/twiki/bin/viewauth/DSSGroup /RecoveryAfterDataLossRecyclePools Additional T0 copy – storT0PoolsBackuphttps://twiki.cern.ch/twiki/bin/view/DSSGroup/Ca storT0PoolsBackup 6