Luca dell’Agnello Daniele Cesini GDB - 13/12/2017

Slides:



Advertisements
Similar presentations
Information Technology Disaster Recovery Awareness Program.
Advertisements

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design.
State of Library Technology June 2003 IT Infrastructure David Leonian Library Information Systems, Tech Support.
Lesson 11 – NETWORK DISASTER RECOVERY Disaster recovery plans Network backup and restoration OVERVIEW.
1 Lesson 3 Computer Protection Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Luca dell’Agnello INFN-CNAF FNAL, May
October 23rd, 2009 Visit of CMS Computing Management at CC-IN2P3.
CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
1 Lesson 3 Computer Protection Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Federico Ruggieri INFN-CNAF GDB Meeting 10 February 2004 INFN TIER1 Status.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
Digital Cities 2013 Survey. MAJOR PROJECTS Replaced UPS & PDU’s in City’s Primary Data Center SAN Selection and Replacement VMware 5.0 Up 1 Upgrade Improved.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Spending Plans and Schedule Jae Yu July 26, 2002.
IHEP Computing Center Site Report Shi, Jingyan Computing Center, IHEP.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Nikhef/(SARA) tier-1 data center infrastructure
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Computer Centre Consolidation Project Vincent Doré IT Technical.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.
Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
Vault Reconfiguration IT DMM January 23 rd 2002 Tony Cass —
High Availability Environments cs5493/7493. High Availability Requirements Achieving high availability Redundancy of systems Maintenance Backup & Restore.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
November 28, 2007 Dominique Boutigny – CC-IN2P3 CC-IN2P3 Update Status.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
INFN Site Report R.Gomezel November 5-9,2007 The Genome Sequencing University St. Louis.
Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.
Break out group coordinator:
Open-E Data Storage Software (DSS V6)
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
4.5 K Cold Box Schedule Recovery Directors’ Review
SR1 extension follow up 27-Mar-17 Marco Ciapetti.
Luca dell’Agnello INFN-CNAF
Water Works.
Server Upgrade HA/DR Integration
The Beijing Tier 2: status and plans
LCG Service Challenge: Planning and Milestones
Paul Kuipers Nikhef Site Report Paul Kuipers
INFN CNAF TIER1 Network Service
Daniele Cesini – INFN-CNAF - 19/09/2017
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
COOLING LHC RACKS PROCUREMENT
INFN CNAF TIER1 and TIER-X network infrastructure
Stuart Wild. Particle Physics Group Meeting, January 2010.
Enrico Fattibene CDG – CNAF 18/09/2017
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
The INFN TIER1 Regional Centre
Oxford Site Report HEPSYSMAN
Castor services at the Tier-0
LHC Computing re-costing for
News and computing activities at CC-IN2P3
The INFN Tier-1 Storage Implementation
LPI Moscow, K. Zhukov, MEPhI-LPI Moscow, V. Kantserov
Bernd Panzer-Steindel CERN/IT
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
Disaster Recovery Services
Update to Board of Trustees: November 9 FAC Meeting
CHIPP - CSCS F2F meeting CSCS, Lugano January 25th , 2018.
Presentation transcript:

Luca dell’Agnello Daniele Cesini GDB - 13/12/2017 INFN Tier-1 status Luca dell’Agnello Daniele Cesini GDB - 13/12/2017

11/9 The flood happened early in the morning of November 9 Breaking of one of the Bologna water main pipelines Also the road near CNAF seriously damaged (see fire track) Luca dell'Agnello 13/12/2017

Luca dell'Agnello 13/12/2017

First inspection The water has damaged nearly all the electrical equipment in the electrical room and the lower two units in the racks of the IT halls Also the lower row of tapes in the library submerged Luca dell'Agnello 13/12/2017

Damage to IT equipment* (1) Computing farm ~34 kHS06 are now lost (~14% of the total capacity) Network infrastructure Ethernet core switches unaffected IB/FC SAN:3 Fibre Channel switches damaged Library 4 TSM-HSM servers lost (CMS, LHCb) 1 tape drive, 136 tapes Data mostly recoverable according to tests in lab * In the hypothesis that only submerged equipment was damaged Luca dell'Agnello 13/12/2017

Damage to IT equipment* (2) Nearly all storage disk systems involved 2 Huawei JBODs (all non-LHC experiments excepting AMS, Darkside, Virgo) 11 DDN JBODs (LHC, AMS) RAID parity affected 2 Dell JBODs including controllers (astro-particle experiments) Most critical - 2 trays out of 5 went underwater. High probability of losing the data 4 disk-servers (4 Alice) * In the hypothesis that only submerged equipment was damaged Luca dell'Agnello 13/12/2017

What has been done (1) Data center dried over the first week-end IT services (non scientific computing) moved outside CNAF The General IP connectivity restored few days after the flood Activated a temporary power line (60 kW) Essential for GARR POP equipment Cleaning from remaining dust and mud completed during the first week of December Hall 1, Hall 2, library area and core switch network area Luca dell'Agnello 13/12/2017

What has been done (2) Analyzed a subset of “wet tapes” Decision on recovery strategy this week Inspection on library Cleaning and recertification scheduled for the beginning of January Core switches tested Schedule for foreseen upgrade to 100 Gbit confirmed (December 15) Delivered tender 2017 storage (today) Installation should start this week and take the rest of the month Started recovery of first electrical line (1.4 MW) It should be available before Xmas No UPS Luca dell'Agnello 13/12/2017

Recovery roadmap (1) Recovery of 1st UPS foreseen for the end of January Lease of 500 kW UPS under consideration to start recovery of storage Recovery of second line + 2nd UPS still to be approved Waiting for the replacement of damaged components of storage systems Wet disks dried Oldest systems repaired by our selves using spare parts Data on phasing out systems will be moved to new storage Not all storage back to operation at the same time Luca dell'Agnello 13/12/2017

Recovery roadmap (2) In the meanwhile test and (possibly) recover of the disks Replacement of the enclosures and disks Depending on the available power, test of the file-systems (one by one) After the replacement of the electrical equipment power on the storage and farm systems if possible prioritizing experiments w/o other resources Luca dell'Agnello 13/12/2017

Recover of old DDN completed Recovery roadmap Recover of old DDN completed Farm will be switch on in January if power line recovery will be respected 34 kHS06 loss will be replaced by resources installed at CINECA – timeline to be defined Luca dell'Agnello 13/12/2017