Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Slides:



Advertisements
Similar presentations
Emergency Preparedness and Response
Advertisements

Garrett Park Elementary School Safety Information Meeting September 24, 2008.
Emergency Evacuation Zayed University, Dubai Campus.
SAFETY AND SECURITY. SAFETY These are hazards in any establishment and their prevention is of tremendous importance. The housekeeper, along with other.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Supervision of Production Computers in ALICE Peter Chochula for the ALICE DCS team.
NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 Pertemuan 23 Contingency Planning Matakuliah:A0334/Pengendalian Lingkungan Online Tahun: 2005 Versi: 1/1.
A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF.
1 Lesson 3 Computer Protection Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Managing a computerised PO Operating environment 1.
A New Building Data Center Upgrade capacity and technology step 2011 – May the 4 th– Hepix spring meeting Darmstadt (de) Pascal Trouvé (Facility Manager.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.
Vincenzo Vagnoni LHCb Real Time Trigger Challenge Meeting CERN, 24 th February 2005.
Network Topologies.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
October 23rd, 2009 Visit of CMS Computing Management at CC-IN2P3.
1 QUICK TECH HELP HARDWARE. 2 HARDWARE TROUBLE SHOOTING. S. no. PROBLEMSREMEDY 1Non start of Computer  Check whether Floppy is left in Floppy drive Check.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Managing Computerised Offices Operating environment
Introduction Overclocking essentially is taking your computer and making it work harder than it was shipped to work, to squeeze out better performance.
COMMUNITY FIRE SAFETY. FIRE SAFETY IN THE HOME SUBJECTS COVERED: -SUBJECTS COVERED: - –SMOKE ALARMS –HOME SAFETY PLAN –WHAT TO DO IF A FIRE STARTS –BEDTIME.
Business Continuity Exercise
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
20.3 Electric Circuits
Chapter Ten Safe, Legal, and Green Computer Usage Part II: Energy Efficiency.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks C. Loomis (CNRS/LAL) M.-E. Bégin (SixSq.
Federico Ruggieri INFN-CNAF GDB Meeting 10 February 2004 INFN TIER1 Status.
Mobile Lab Orientation Gwyneth Jones - MHMS Media Specialist and the Office of Educational Media and Educational Technology.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
HEPiX Fall 2014 Tony Wong (BNL) UPS Monitoring with Sensaphone: A cost-effective solution.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Student expectations List. Arrival As you arrive breakfast program will be taking place in the cafeteria if you choose to go. 8:22 bell rings for students.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.
Emergency Actions Emergency Actions are composed of two actions. 1) Tornado & Severe Weather 2) Emergency Evacuation.
I Can Survive Alive! By Ms.Weiner and Room 11. We get ready to board the bus. We know the field trip rules. We are excited!
Advance startup options Shift Restart. Restart options.
Recovery from the earthquake Takashi Sasaki. Disaster recovery “Disaster” comes from human error or hardware failure was considered before We were preparing.
Examinations Information. Examination Timetables Your personal copy of the exam timetable will be sent home before each exam session. You must check this.
Farming Andrea Chierici CNAF Review Current situation.
Unit 7 P2 P2 explain potential risks to consider when installing
PHENIX Smoke Alarms Incident, 27-May-2016 Mickey Chiu, for the PHENIX Operations Crew.
CV works in the non- LHC accelerator complex during 2008 and plans for 2009 ATOP days 2009.
European Organization for Nuclear Research - Organisation européenne pour la recherche nucléaire CO 2 IBL plant failures 16/06/ /06/2016 O.Crespo-Lopez.
The products of this company are quite reasonable as well as durable. One such is used by me these days and it has a long lasting battery, clear picture.
Exam Information for students
Mission Continuity Program
Andrea Chierici On behalf of INFN-T1 staff
Luca dell’Agnello INFN-CNAF
Oxford Site Report HEPSYSMAN
Basic Troubleshooting Techniques
WLCG Service Interventions
News and computing activities at CC-IN2P3
Universita’ di Torino and INFN – Torino
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
USING A FIRE EXTINGUISHER
Computer troubleshooting
The Troubleshooting theory
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

Outline INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions 21/05/2013Andrea Chierici2

INFN-T1 on-call procedure

On-call service CNAF staff on-call on a weekly basis – 2/3 times per year – Must live within 30min from CNAF – Service phone receiving alarm SMSes – Periodic training on security and intervention procedures 3 incidents in last three years – only this last one required the site to be totally powered off  21/05/2013Andrea Chierici4

Service Dashboard 21/05/2013Andrea Chierici5

Incident

What happened on the 9 th of March 1.08am: fire alarm – On-call person intervenes and calls Firefighters 2.45am: fire extinguished 3.18am: high temp warning – Air conditioning blocked – On-call person calls for help 4.40am: decision is taken to shut down the center 12.00pm: chiller under maintenance 17.00pm: chiller fixed, center can be turned back on 21.00pm: farm back on-line, waiting for storage 21/05/2013Andrea Chierici7

10 th of March 9.00am: support call to switch storage back on 6.00pm: center open again for LHC experiments Next day: center fully open again 21/05/2013Andrea Chierici8

Chiller power supply 21/05/2013Andrea Chierici9

Incident representation 21/05/2013Andrea Chierici10 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 2

Incident examination 6 chillers for the computing room 5 share the same power supply for the control logic (we did not know that!) Fire in one of the control logic, power was cut to 5 chillers out of 6 – 1 chiller was still working and we weren’t aware of that! – Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. 21/05/2013Andrea Chierici11

Facility monitoring app 21/05/2013Andrea Chierici12

Chiller n.4 21/05/2013Andrea Chierici13 BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

Incident seen by inside 21/05/2013Andrea Chierici14

Incident seen by outside 21/05/2013Andrea Chierici15

Recover Procedure

Recover procedure Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4 Storage: support call Farming: took the chance to apply all security patches and latest kernel to nodes – Switch on order: LSF server, CEs, UIs – For a moment we were thinking about upgrading to LSF 9 21/05/2013Andrea Chierici17

Failures (1) Old WNs – BIOS battery exhausted, configuration reset PXE boot, hyper-threading, disk configuration (AHCI) – lost IPMI configuration (30% broken) 21/05/2013Andrea Chierici18

Failures (2) Some storage controllers were replaced 1% PCI cards (mainly 10Gbit network) replaced Disks, power supplies and network switches were almost not damaged 21/05/2013Andrea Chierici19

What we learned

We fixed our weak point 21/05/2013Andrea Chierici21 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 6 Ctrl sys Pow 2 Ctrl sys Pow 3 Ctrl sys Pow 4 Ctrl sys Pow 5

We miss an emergency button Shut the center down is not easy: a real “emergency shutdown” procedure is missing – We could have avoided switching down the whole center if we have had more control – Depending on the incident, some services may be left on-line Person on-call can’t know all the site details 21/05/2013Andrea Chierici22

Hosted services Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control – We need an emergency procedure for those too – We need a better understanding of the SLAs 21/05/2013Andrea Chierici23

Conclusions

We benchmarked ourselves 21/05/2013Andrea Chierici25 It took 2 days to get the center back on-line – less than one to open LHC experiments – everyone was aware about what to do – All working nodes rebooted with a solid configuration – A few nodes were reinstalled and put back on line in a few minutes

Lesson learned We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) – The new dashboard appears to be the right place We created a task-force to implement a controlled shutdown procedure – Establish a shutdown order WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches In case of emergency, on-call person is required to take a difficult decision 21/05/2013Andrea Chierici26

Testing shutdown procedure The shutdown procedure we are implementing can’t be easily tested How to perform a “simulation”? – Doesn’t sound right to switch the center off just to prove we can do it safely How do other sites address this? Should periodic bios battery replacements be scheduled? 21/05/2013Andrea Chierici27