Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Summary of CASTOR incident, April 2010 Germán Cancio Leader,

Slides:



Advertisements
Similar presentations
Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.
Advertisements

Chapter 12 - Backup and Disaster Recovery1 Ch. 12 – Backups and Disaster Recovery MIS 431 – Created Spring 2006.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Backup and Recovery Part 1.
Backup Concepts. Introduction Backup and recovery procedures protect your database against data loss and reconstruct the data, should loss occur. The.
Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.
CERN IT Department CH-1211 Geneva 23 Switzerland t CERN IT Department CH-1211 Geneva 23 Switzerland t
Introduction to Oracle Backup and Recovery
CERN IT Department CH-1211 Genève 23 Switzerland t Streams new features in 11g Zbigniew Baranowski.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Windows Server 2003 資料備份與還原 林寶森
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
PPOUG, 05-OCT-01 Agenda RMAN Architecture Why Use RMAN? Implementation Decisions RMAN Oracle9i New Features.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Chapter 7 Making Backups with RMAN. Objectives Explain backup sets and image copies RMAN Backup modes’ Types of files backed up Backup destinations Specifying.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
15 Copyright © 2005, Oracle. All rights reserved. Performing Database Backups.
CERN IT Department CH-1211 Genève 23 Switzerland t Service Management GLM 15 November 2010 Mats Moller IT-DI-SM.
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
15 Copyright © 2007, Oracle. All rights reserved. Performing Database Backups.
Module 6 Backup of SQL Server 2008 R2 Databases. Module Overview Backing up Databases and Transaction Logs Managing Database Backups Working with Backup.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
A university for the world real R © 2009, Chapter 9 The Runtime Environment Michael Adams.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CASTOR status Presentation to LCG PEB 09/11/2004 Olof Bärring, CERN-IT.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Deployment of Exchange 2010 mail platform Pawel Grzywaczewski, CERN IT/OIS HEPIX.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Advances in Bit Preservation (since DPHEP’2015) 3/2/2016 DPHEP / WLCG Workshop1 Germán Cancio IT Storage Group CERN DPHEP / WLCG Workshop Lisbon, 3/2/2016.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
Good user practices + Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CUF,
19 Copyright © 2004, Oracle. All rights reserved. Database Backups.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
CTA: CERN Tape Archive Rationale, Architecture and Status
Federating Data in the ALICE Experiment
Database recovery contd…
Tape Operations Vladimír Bahyl on behalf of IT-DSS-TAB
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Experiences and Outlook Data Preservation and Long Term Analysis
CTA: CERN Tape Archive Adding front-ends and back-ends Status report
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
CTA: CERN Tape Archive Overview and architecture
Acutelearn Technologies Tivoli Storage Manager(TSM) Training Tivoli Storage Manager Basics: Tivoli Storage Manager Overview Tivoli Storage Manager concepts.
Presentation transcript:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Summary of CASTOR incident, April 2010 Germán Cancio Leader, DSS-TAB section

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS The incident in a nutshell On April 26 th, following a CASTOR software upgrade, a fraction of production data started to flow from disk to “recycle” tape pools (where data is deleted after some time) The incident was detected on May 14 th when ALICE reported missing files. ALICE, CMS, and ATLAS were affected by this incident. In total, files were unintentionally deleted, out of which were recovered. All 5435 lost files were owned by ALICE. It took 5 weeks from incident detection until full resolution on June 18 th. 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS CASTOR tape recycle pools “Tape recycle pools” are tape pools configured at CERN for storing temporary data. The recycling consists of a cron job which deletes tapes marked as FULL for over 2 weeks. The volume can then be re-used for new data. –deletion == removal from the CASTOR name server Tape pool recycling was created at CERN during early DC validation of CASTOR-2 in September 2005 –Recycle tape pools have been there since; configured and used in production on the ALICE, CMS and ATLAS stagers. Tape recycling is not part of the CASTOR software bundle. From the CASTOR software perspective, a recycle pool looks like any other (persistent) tape pool. 5

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Tape pool selection The selection of tape pools in CASTOR is configured on every CASTOR stager, and defined in two levels: 1) A tape migration enabled (HSM) disk pool is “associated” to one or more tape pools. 2) For every file written to a tape enabled disk pool, a policy script is called. –Depending on file metadata passed as parameters (such as mode, owner, time, “class” etc), the policy will select a tape pool from the list of associated tape pools. In case no policy is defined, the file will be migrated to any of the “associated” tape pools. This hardcoded default policy was aimed as a safety measure to avoid data pileups, and has been in CASTOR since May

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 7

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident trigger A naming convention change for the migration policy was introduced in CASTOR , released April was marked as a transparent upgrade. However, this naming convention change would have required to first update the configuration files on the CASTOR stager and afterwards restart the associated CASTOR migration daemon. On April 26 th, was deployed on CASTOR production stagers. Daemons were restarted before, and not after the configuration file changes. The new policy configuration was not found by the new daemon version so it applied the default hardcoded policy. This caused new files to migrate towards any associated tape pool, including recycle pools. On the CMS and ATLAS instances, incorrect policy settings were detected on April 30 th, and migration daemons were restarted. This stopped the hardcoded policy after 4 days. For the ALICE stager, there was no daemon restart before the incident was reported; the hardcoded policy ran for 20 days. 8

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident root cause analysis The combination of recycle tape pools and the hardcoded migration policy was a dormant bomb, existing since Both the recycle tape pools and the hardcoded migration policy were undocumented. Any similar policy misconfiguration (e.g. empty configuration file, configuration typo, Quattor problem) between 2005 and 2010 could have triggered a similar incident. –We have no evidence this has happened, but cannot discard it either at 100%. 9

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 10

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident detection time line Fri May 14 th : ALICE creates GGUS ticket reporting catalogue inconsistencies – missing files on CASTOR –10258 files missing, out of which 1773 high-priority (900GeV calibration run) Sat May 15 th : Preliminary analysis shows that missing files have been sent to recycling pools and deleted. Tape recycling is stopped. Mon May 17 th : Analysis completed; incident trigger understood. Verification on CMS and ATLAS stagers shows missing files for the same reason. Disk garbage collection stopped. 11

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (1/3) First step was to obtain the list of files incorrectly migrated to the recycle tape pools and then deleted. –Reconstructed using: tape pool logs, name server unlink logs, monitoring logs, target migration policy. –Verified list with affected experiments None of the deleted files were found on disk (all garbage collected). Reconstructed tape residency information (VID, blockID, size, checksum) using name server logs. For ATLAS and CMS, all deleted files were located on tapes which had been recycled, but not overwritten yet (i.e. data physically still on tape) For ALICE, all deleted files were on tapes which had been recycled and overwritten. 10 tapes were partially, and 2 tapes completely overwritten. 12

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (2/3) ATLAS and CMS: Defined a procedure for restoring deleted name server information, using a mixture of name server commands and generated back-door SQL statements. Applied the procedure. –Not all metadata could be retrieved from the logs: original ownership, permissions, ACL’s were lost. Recalled all files, verified size and checksum, and informed the users of the successful restore. CMS files were restored by Tue May 18 th. ATLAS files were restored by Wed May 19 th. 13

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (3/3) ALICE: For the 2 completely overwritten tapes (EOD at physical EOT), all files written before recycling were lost. Files written after recycling were restored by Thu May 20 th. For the 10 partially overwritten tapes, the files between EOD and EOT were potentially recoverable by vendors. Tapes were physically sent to IBM and SUN on Thu May 20 th. Tapes received with recovered data were imported into CASTOR and copied. IBM tapes were received on June 7 th, Sun tapes on June 17 th. Out of files, 4823 were recovered and 5435 persistently lost. –From the 900GeV high priority files, 1717 were recovered and 56 lost. 14

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident communication Continuously provided information to stakeholders (experiment contacts, IT/LCG management) regarding incident impact, recovery expectations and progress –Incident post-mortem incl. time line: Recycled14May Recycled14May2010 –Recovery progress: ery ery –LCG Management Board presentation: (Indico)Indico –Documented technical description of recovery steps: LossRecyclePools LossRecyclePools 15

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 16

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (1/5) All tape recycle pools were stopped on Sat May 15 th, and have been decommissioned since. –Replaced by ‘test data’ pools without automated data deletion. Other tape media reuse has been stopped. –Background data defragmentation stopped. –Deleted data is still on tape. –Additional cost: ~ 1 tape / day (to be reviewed end 2010) Implemented/deployed background tape media verification, for scanning all new data + the complete CASTOR tape archive. –Verify data integrity and consistency – 17

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (2/5) Removed the hardcoded default migration policy from the migration daemon, and added safety nets for configuration file handling. –bug #67945: RFE: The mighunter should not start if the migration and stream policy modules are not configured properlybug #67945 –bug #68020: RFE: The mighunter should gracefully stop when it encounters an invalid migration-policy function-namebug #68020 Review if and how logical file deletion in CASTOR can be implemented. –Assess impact in terms of code, databases, operations –Work in progress Request to improve name server logs for full metadata restore –bug #67763: RFE: provide missing file metadata attributes during file deletionbug #

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (3/5) Improvements for operations Change Management –CM events tracked via Savannah –Notifications sent to all involved parties (stager operations, tape operations, development 3rd level) –Talks by Miguel and Vlado Software Change Management procedures being reviewed and tightened –Sebastien’s talk Review monitoring & dashboard for user-oriented information –Talks by Miguel, Vlado, Dirk 19

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (4/5) Physical protection for tapes containing RAW data –Flip over read-only tab on full tapes belonging to RAW data pools: Analysis under sReadOnly sReadOnly –Manpower intensive: 1.5h for eject, flip over, re-enter 100 cartridges - when libraries are idle. –Would lead to backlogs during heavy load periods (HI run, repacking media to higher density) –Requires logical deletion functionality as not all metadata is stored on tape 20

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (5/5) Investigate adding additional independent safeguards for raw data disk pools –Analysis using TSM (incremental backup, archiving): lsBackup lsBackup –Significant cost in terms of infrastructure investments, licensing and media –Requires logical deletion as metadata is not stored on disk pools –More useful to review with experiments data replication within CASTOR and off-site to T1’s 21

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Questions / comments? Summary of links: Incident post-mortem page: Recovery status tracking page: LCG Management Board presentation: Technical description of recovery steps: Tape background verification: Backup of CASTOR Tier-0 pools with TSM: Read-only tab protection for CASTOR: CASTOR release notes: (cf ‘configuration changes’ and ‘upgrade instructions’) 22