Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Castor incident (and follow up) Alberto Pace.

Similar presentations


Presentation on theme: "Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Castor incident (and follow up) Alberto Pace."— Presentation transcript:

1 Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Castor incident (and follow up) Alberto Pace

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS The incident On 26th April, after the upgrade to Castor 2.1.9-5, restart and config file change took place in the wrong sequence. This activated an old “feature” (now a bug) that used a default policies instead. In the absence of a policy, all tape pools were used in round robin. Many experiments had a recycle tape pool. Files sent to a recycle pool are deleted after some days when the files expire. The CMS and ATLAS daemons were restarted on April 30 th following an (unrelated) incident on slow streaming to tape. The problem went undetected until May 15 th when ALICE reported missing files. 2

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS Impact on experiments LHCb: –no recycle pools, no data loss, only files written to the incorrect pool CMS –490 files were deleted –We were able to restore the deleted files in the name server as the tapes where not yet recycled –In parallel CMS was able to copy back these files from existing Tier1s copies ATLAS –9689 files were deleted –We were able to restore the deleted files in the name server as the tapes where not yet recycled –In parallel ATLAS was able to copy back these files from existing Tier1s copies ALICE –10268 files were deleted –Tapes were (partially) overwritten before we were aware of the incident –Data were not yet replicated to Tier1s 3

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS Actions already taken All recycle pools were stopped. –Disable unwanted “delete” actions on tape All effort focussed in trying to recover data: –Possible at CERN only when tape cartridges have not been overwritten –For the Alice tapes, we have identified the 10 cartridges that contained all files lost. These tapes have been sent to IBM and SUN data recovery labs. 4

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS Additional planned actions Review any possible source of file deletion in Castor in order to disable all unwanted “delete” action on tape –Considering disabling all possibility of “delete on tape” and cartridge recycling –Implement “logical” instead of “physical” deletion –Physical “write switch” on the cartridge Identify possibilities to create an additional copy of the raw files to be kept until the data is replicated at the Tier1s Internal review of operational procedures and creation of a user-oriented dashboard Run a external review of Castor in September Continue the process of addressing the longer term evolution of the data archiving and management (Amsterdam Jamboree). 5

6 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS Links Incident report –https://twiki.cern.ch/twiki/bin/viewauth/CASTORS ervice/IncidentsAliceRecycled14May2010https://twiki.cern.ch/twiki/bin/viewauth/CASTORS ervice/IncidentsAliceRecycled14May2010 Incident response on data recovery –https://twiki.cern.ch/twiki/bin/viewauth/DSSGroup /RecoveryAfterDataLossRecyclePoolshttps://twiki.cern.ch/twiki/bin/viewauth/DSSGroup /RecoveryAfterDataLossRecyclePools Additional T0 copy –https://twiki.cern.ch/twiki/bin/view/DSSGroup/Ca storT0PoolsBackuphttps://twiki.cern.ch/twiki/bin/view/DSSGroup/Ca storT0PoolsBackup 6


Download ppt "Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS Castor incident (and follow up) Alberto Pace."

Similar presentations


Ads by Google