Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

Similar presentations


Presentation on theme: "Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014."— Presentation transcript:

1 Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014

2 Outline INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions 21/05/2013Andrea Chierici2

3 INFN-T1 on-call procedure

4 On-call service CNAF staff on-call on a weekly basis – 2/3 times per year – Must live within 30min from CNAF – Service phone receiving alarm SMSes – Periodic training on security and intervention procedures 3 incidents in last three years – only this last one required the site to be totally powered off  21/05/2013Andrea Chierici4

5 Service Dashboard 21/05/2013Andrea Chierici5

6 Incident

7 What happened on the 9 th of March 1.08am: fire alarm – On-call person intervenes and calls Firefighters 2.45am: fire extinguished 3.18am: high temp warning – Air conditioning blocked – On-call person calls for help 4.40am: decision is taken to shut down the center 12.00pm: chiller under maintenance 17.00pm: chiller fixed, center can be turned back on 21.00pm: farm back on-line, waiting for storage 21/05/2013Andrea Chierici7

8 10 th of March 9.00am: support call to switch storage back on 6.00pm: center open again for LHC experiments Next day: center fully open again 21/05/2013Andrea Chierici8

9 Chiller power supply 21/05/2013Andrea Chierici9

10 Incident representation 21/05/2013Andrea Chierici10 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 2

11 Incident examination 6 chillers for the computing room 5 share the same power supply for the control logic (we did not know that!) Fire in one of the control logic, power was cut to 5 chillers out of 6 – 1 chiller was still working and we weren’t aware of that! – Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. 21/05/2013Andrea Chierici11

12 Facility monitoring app 21/05/2013Andrea Chierici12

13 Chiller n.4 21/05/2013Andrea Chierici13 BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)

14 Incident seen by inside 21/05/2013Andrea Chierici14

15 Incident seen by outside 21/05/2013Andrea Chierici15

16 Recover Procedure

17 Recover procedure Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4 Storage: support call Farming: took the chance to apply all security patches and latest kernel to nodes – Switch on order: LSF server, CEs, UIs – For a moment we were thinking about upgrading to LSF 9 21/05/2013Andrea Chierici17

18 Failures (1) Old WNs – BIOS battery exhausted, configuration reset PXE boot, hyper-threading, disk configuration (AHCI) – lost IPMI configuration (30% broken) 21/05/2013Andrea Chierici18

19 Failures (2) Some storage controllers were replaced 1% PCI cards (mainly 10Gbit network) replaced Disks, power supplies and network switches were almost not damaged 21/05/2013Andrea Chierici19

20 What we learned

21 We fixed our weak point 21/05/2013Andrea Chierici21 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 6 Ctrl sys Pow 2 Ctrl sys Pow 3 Ctrl sys Pow 4 Ctrl sys Pow 5

22 We miss an emergency button Shut the center down is not easy: a real “emergency shutdown” procedure is missing – We could have avoided switching down the whole center if we have had more control – Depending on the incident, some services may be left on-line Person on-call can’t know all the site details 21/05/2013Andrea Chierici22

23 Hosted services Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control – We need an emergency procedure for those too – We need a better understanding of the SLAs 21/05/2013Andrea Chierici23

24 Conclusions

25 We benchmarked ourselves 21/05/2013Andrea Chierici25 It took 2 days to get the center back on-line – less than one to open LHC experiments – everyone was aware about what to do – All working nodes rebooted with a solid configuration – A few nodes were reinstalled and put back on line in a few minutes

26 Lesson learned We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) – The new dashboard appears to be the right place We created a task-force to implement a controlled shutdown procedure – Establish a shutdown order WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches In case of emergency, on-call person is required to take a difficult decision 21/05/2013Andrea Chierici26

27 Testing shutdown procedure The shutdown procedure we are implementing can’t be easily tested How to perform a “simulation”? – Doesn’t sound right to switch the center off just to prove we can do it safely How do other sites address this? Should periodic bios battery replacements be scheduled? 21/05/2013Andrea Chierici27


Download ppt "Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014."

Similar presentations


Ads by Google