Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Summary of CASTOR incident, April 2010 Germán Cancio Leader, DSS-TAB section
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS The incident in a nutshell On April 26 th, following a CASTOR software upgrade, a fraction of production data started to flow from disk to “recycle” tape pools (where data is deleted after some time) The incident was detected on May 14 th when ALICE reported missing files. ALICE, CMS, and ATLAS were affected by this incident. In total, files were unintentionally deleted, out of which were recovered. All 5435 lost files were owned by ALICE. It took 5 weeks from incident detection until full resolution on June 18 th. 2
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 3
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 4
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS CASTOR tape recycle pools “Tape recycle pools” are tape pools configured at CERN for storing temporary data. The recycling consists of a cron job which deletes tapes marked as FULL for over 2 weeks. The volume can then be re-used for new data. –deletion == removal from the CASTOR name server Tape pool recycling was created at CERN during early DC validation of CASTOR-2 in September 2005 –Recycle tape pools have been there since; configured and used in production on the ALICE, CMS and ATLAS stagers. Tape recycling is not part of the CASTOR software bundle. From the CASTOR software perspective, a recycle pool looks like any other (persistent) tape pool. 5
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Tape pool selection The selection of tape pools in CASTOR is configured on every CASTOR stager, and defined in two levels: 1) A tape migration enabled (HSM) disk pool is “associated” to one or more tape pools. 2) For every file written to a tape enabled disk pool, a policy script is called. –Depending on file metadata passed as parameters (such as mode, owner, time, “class” etc), the policy will select a tape pool from the list of associated tape pools. In case no policy is defined, the file will be migrated to any of the “associated” tape pools. This hardcoded default policy was aimed as a safety measure to avoid data pileups, and has been in CASTOR since May
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 7
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident trigger A naming convention change for the migration policy was introduced in CASTOR , released April was marked as a transparent upgrade. However, this naming convention change would have required to first update the configuration files on the CASTOR stager and afterwards restart the associated CASTOR migration daemon. On April 26 th, was deployed on CASTOR production stagers. Daemons were restarted before, and not after the configuration file changes. The new policy configuration was not found by the new daemon version so it applied the default hardcoded policy. This caused new files to migrate towards any associated tape pool, including recycle pools. On the CMS and ATLAS instances, incorrect policy settings were detected on April 30 th, and migration daemons were restarted. This stopped the hardcoded policy after 4 days. For the ALICE stager, there was no daemon restart before the incident was reported; the hardcoded policy ran for 20 days. 8
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident root cause analysis The combination of recycle tape pools and the hardcoded migration policy was a dormant bomb, existing since Both the recycle tape pools and the hardcoded migration policy were undocumented. Any similar policy misconfiguration (e.g. empty configuration file, configuration typo, Quattor problem) between 2005 and 2010 could have triggered a similar incident. –We have no evidence this has happened, but cannot discard it either at 100%. 9
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 10
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident detection time line Fri May 14 th : ALICE creates GGUS ticket reporting catalogue inconsistencies – missing files on CASTOR –10258 files missing, out of which 1773 high-priority (900GeV calibration run) Sat May 15 th : Preliminary analysis shows that missing files have been sent to recycling pools and deleted. Tape recycling is stopped. Mon May 17 th : Analysis completed; incident trigger understood. Verification on CMS and ATLAS stagers shows missing files for the same reason. Disk garbage collection stopped. 11
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (1/3) First step was to obtain the list of files incorrectly migrated to the recycle tape pools and then deleted. –Reconstructed using: tape pool logs, name server unlink logs, monitoring logs, target migration policy. –Verified list with affected experiments None of the deleted files were found on disk (all garbage collected). Reconstructed tape residency information (VID, blockID, size, checksum) using name server logs. For ATLAS and CMS, all deleted files were located on tapes which had been recycled, but not overwritten yet (i.e. data physically still on tape) For ALICE, all deleted files were on tapes which had been recycled and overwritten. 10 tapes were partially, and 2 tapes completely overwritten. 12
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (2/3) ATLAS and CMS: Defined a procedure for restoring deleted name server information, using a mixture of name server commands and generated back-door SQL statements. Applied the procedure. –Not all metadata could be retrieved from the logs: original ownership, permissions, ACL’s were lost. Recalled all files, verified size and checksum, and informed the users of the successful restore. CMS files were restored by Tue May 18 th. ATLAS files were restored by Wed May 19 th. 13
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident recovery (3/3) ALICE: For the 2 completely overwritten tapes (EOD at physical EOT), all files written before recycling were lost. Files written after recycling were restored by Thu May 20 th. For the 10 partially overwritten tapes, the files between EOD and EOT were potentially recoverable by vendors. Tapes were physically sent to IBM and SUN on Thu May 20 th. Tapes received with recovered data were imported into CASTOR and copied. IBM tapes were received on June 7 th, Sun tapes on June 17 th. Out of files, 4823 were recovered and 5435 persistently lost. –From the 900GeV high priority files, 1717 were recovered and 56 lost. 14
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Incident communication Continuously provided information to stakeholders (experiment contacts, IT/LCG management) regarding incident impact, recovery expectations and progress –Incident post-mortem incl. time line: Recycled14May Recycled14May2010 –Recovery progress: ery ery –LCG Management Board presentation: (Indico)Indico –Documented technical description of recovery steps: LossRecyclePools LossRecyclePools 15
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Agenda Background information: –Recycle Pools –Tape pool selection Incident trigger and root cause Incident detection, recovery and communication Follow-up actions 16
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (1/5) All tape recycle pools were stopped on Sat May 15 th, and have been decommissioned since. –Replaced by ‘test data’ pools without automated data deletion. Other tape media reuse has been stopped. –Background data defragmentation stopped. –Deleted data is still on tape. –Additional cost: ~ 1 tape / day (to be reviewed end 2010) Implemented/deployed background tape media verification, for scanning all new data + the complete CASTOR tape archive. –Verify data integrity and consistency – 17
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (2/5) Removed the hardcoded default migration policy from the migration daemon, and added safety nets for configuration file handling. –bug #67945: RFE: The mighunter should not start if the migration and stream policy modules are not configured properlybug #67945 –bug #68020: RFE: The mighunter should gracefully stop when it encounters an invalid migration-policy function-namebug #68020 Review if and how logical file deletion in CASTOR can be implemented. –Assess impact in terms of code, databases, operations –Work in progress Request to improve name server logs for full metadata restore –bug #67763: RFE: provide missing file metadata attributes during file deletionbug #
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (3/5) Improvements for operations Change Management –CM events tracked via Savannah –Notifications sent to all involved parties (stager operations, tape operations, development 3rd level) –Talks by Miguel and Vlado Software Change Management procedures being reviewed and tightened –Sebastien’s talk Review monitoring & dashboard for user-oriented information –Talks by Miguel, Vlado, Dirk 19
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (4/5) Physical protection for tapes containing RAW data –Flip over read-only tab on full tapes belonging to RAW data pools: Analysis under sReadOnly sReadOnly –Manpower intensive: 1.5h for eject, flip over, re-enter 100 cartridges - when libraries are idle. –Would lead to backlogs during heavy load periods (HI run, repacking media to higher density) –Requires logical deletion functionality as not all metadata is stored on tape 20
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Follow-up actions (5/5) Investigate adding additional independent safeguards for raw data disk pools –Analysis using TSM (incremental backup, archiving): lsBackup lsBackup –Significant cost in terms of infrastructure investments, licensing and media –Requires logical deletion as metadata is not stored on disk pools –More useful to review with experiments data replication within CASTOR and off-site to T1’s 21
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS Questions / comments? Summary of links: Incident post-mortem page: Recovery status tracking page: LCG Management Board presentation: Technical description of recovery steps: Tape background verification: Backup of CASTOR Tier-0 pools with TSM: Read-only tab protection for CASTOR: CASTOR release notes: (cf ‘configuration changes’ and ‘upgrade instructions’) 22