Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

Outline INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions 21/05/2013Andrea Chierici2

INFN-T1 on-call procedure

On-call service CNAF staff on-call on a weekly basis – 2/3 times per year – Must live within 30min from CNAF – Service phone receiving alarm SMSes – Periodic training on security and intervention procedures 3 incidents in last three years – only this last one required the site to be totally powered off  21/05/2013Andrea Chierici4

Service Dashboard 21/05/2013Andrea Chierici5

Incident

What happened on the 9 th of March 1.08am: fire alarm – On-call person intervenes and calls Firefighters 2.45am: fire extinguished 3.18am: high temp warning – Air conditioning blocked – On-call person calls for help 4.40am: decision is taken to shut down the center 12.00pm: chiller under maintenance 17.00pm: chiller fixed, center can be turned back on 21.00pm: farm back on-line, waiting for storage 21/05/2013Andrea Chierici7

10 th of March 9.00am: support call to switch storage back on 6.00pm: center open again for LHC experiments Next day: center fully open again 21/05/2013Andrea Chierici8

Chiller power supply 21/05/2013Andrea Chierici9

Incident representation 21/05/2013Andrea Chierici10 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 2

Incident examination 6 chillers for the computing room 5 share the same power supply for the control logic (we did not know that!) Fire in one of the control logic, power was cut to 5 chillers out of 6 – 1 chiller was still working and we weren’t aware of that! – Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. 21/05/2013Andrea Chierici11

Facility monitoring app 21/05/2013Andrea Chierici12

Chiller n.4 21/05/2013Andrea Chierici13 BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

Incident seen by inside 21/05/2013Andrea Chierici14

Incident seen by outside 21/05/2013Andrea Chierici15

Recover Procedure

Recover procedure Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4 Storage: support call Farming: took the chance to apply all security patches and latest kernel to nodes – Switch on order: LSF server, CEs, UIs – For a moment we were thinking about upgrading to LSF 9 21/05/2013Andrea Chierici17

Failures (1) Old WNs – BIOS battery exhausted, configuration reset PXE boot, hyper-threading, disk configuration (AHCI) – lost IPMI configuration (30% broken) 21/05/2013Andrea Chierici18

Failures (2) Some storage controllers were replaced 1% PCI cards (mainly 10Gbit network) replaced Disks, power supplies and network switches were almost not damaged 21/05/2013Andrea Chierici19

What we learned

We fixed our weak point 21/05/2013Andrea Chierici21 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 6 Ctrl sys Pow 2 Ctrl sys Pow 3 Ctrl sys Pow 4 Ctrl sys Pow 5

We miss an emergency button Shut the center down is not easy: a real “emergency shutdown” procedure is missing – We could have avoided switching down the whole center if we have had more control – Depending on the incident, some services may be left on-line Person on-call can’t know all the site details 21/05/2013Andrea Chierici22

Hosted services Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control – We need an emergency procedure for those too – We need a better understanding of the SLAs 21/05/2013Andrea Chierici23

Conclusions

We benchmarked ourselves 21/05/2013Andrea Chierici25 It took 2 days to get the center back on-line – less than one to open LHC experiments – everyone was aware about what to do – All working nodes rebooted with a solid configuration – A few nodes were reinstalled and put back on line in a few minutes

Lesson learned We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) – The new dashboard appears to be the right place We created a task-force to implement a controlled shutdown procedure – Establish a shutdown order WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches In case of emergency, on-call person is required to take a difficult decision 21/05/2013Andrea Chierici26

Testing shutdown procedure The shutdown procedure we are implementing can’t be easily tested How to perform a “simulation”? – Doesn’t sound right to switch the center off just to prove we can do it safely How do other sites address this? Should periodic bios battery replacements be scheduled? 21/05/2013Andrea Chierici27

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Similar presentations

Presentation on theme: "Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Similar presentations

Presentation on theme: "Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014."— Presentation transcript:

Similar presentations

About project

Feedback