Download presentation
Presentation is loading. Please wait.
1
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
INFN Tier-1 status Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
2
11/9 The flood happened early in the morning of November 9
Breaking of one of the Bologna water main pipelines Also the road near CNAF seriously damaged (see fire track) Luca dell'Agnello 13/12/2017
3
Luca dell'Agnello 13/12/2017
4
First inspection The water has damaged nearly all the electrical equipment in the electrical room and the lower two units in the racks of the IT halls Also the lower row of tapes in the library submerged Luca dell'Agnello 13/12/2017
5
Damage to IT equipment* (1)
Computing farm ~34 kHS06 are now lost (~14% of the total capacity) Network infrastructure Ethernet core switches unaffected IB/FC SAN:3 Fibre Channel switches damaged Library 4 TSM-HSM servers lost (CMS, LHCb) 1 tape drive, 136 tapes Data mostly recoverable according to tests in lab * In the hypothesis that only submerged equipment was damaged Luca dell'Agnello 13/12/2017
6
Damage to IT equipment* (2)
Nearly all storage disk systems involved 2 Huawei JBODs (all non-LHC experiments excepting AMS, Darkside, Virgo) 11 DDN JBODs (LHC, AMS) RAID parity affected 2 Dell JBODs including controllers (astro-particle experiments) Most critical - 2 trays out of 5 went underwater. High probability of losing the data 4 disk-servers (4 Alice) * In the hypothesis that only submerged equipment was damaged Luca dell'Agnello 13/12/2017
7
What has been done (1) Data center dried over the first week-end
IT services (non scientific computing) moved outside CNAF The General IP connectivity restored few days after the flood Activated a temporary power line (60 kW) Essential for GARR POP equipment Cleaning from remaining dust and mud completed during the first week of December Hall 1, Hall 2, library area and core switch network area Luca dell'Agnello 13/12/2017
8
What has been done (2) Analyzed a subset of “wet tapes”
Decision on recovery strategy this week Inspection on library Cleaning and recertification scheduled for the beginning of January Core switches tested Schedule for foreseen upgrade to 100 Gbit confirmed (December 15) Delivered tender 2017 storage (today) Installation should start this week and take the rest of the month Started recovery of first electrical line (1.4 MW) It should be available before Xmas No UPS Luca dell'Agnello 13/12/2017
9
Recovery roadmap (1) Recovery of 1st UPS foreseen for the end of January Lease of 500 kW UPS under consideration to start recovery of storage Recovery of second line + 2nd UPS still to be approved Waiting for the replacement of damaged components of storage systems Wet disks dried Oldest systems repaired by our selves using spare parts Data on phasing out systems will be moved to new storage Not all storage back to operation at the same time Luca dell'Agnello 13/12/2017
10
Recovery roadmap (2) In the meanwhile test and (possibly) recover of the disks Replacement of the enclosures and disks Depending on the available power, test of the file-systems (one by one) After the replacement of the electrical equipment power on the storage and farm systems if possible prioritizing experiments w/o other resources Luca dell'Agnello 13/12/2017
11
Recover of old DDN completed
Recovery roadmap Recover of old DDN completed Farm will be switch on in January if power line recovery will be respected 34 kHS06 loss will be replaced by resources installed at CINECA – timeline to be defined Luca dell'Agnello 13/12/2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.