Download presentation
Presentation is loading. Please wait.
Published bySharyl Chandler Modified over 9 years ago
1
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014
2
Outline INFN-T1 on-call procedure Incident Recover Procedure What we learned Conclusions 21/05/2013Andrea Chierici2
3
INFN-T1 on-call procedure
4
On-call service CNAF staff on-call on a weekly basis – 2/3 times per year – Must live within 30min from CNAF – Service phone receiving alarm SMSes – Periodic training on security and intervention procedures 3 incidents in last three years – only this last one required the site to be totally powered off 21/05/2013Andrea Chierici4
5
Service Dashboard 21/05/2013Andrea Chierici5
6
Incident
7
What happened on the 9 th of March 1.08am: fire alarm – On-call person intervenes and calls Firefighters 2.45am: fire extinguished 3.18am: high temp warning – Air conditioning blocked – On-call person calls for help 4.40am: decision is taken to shut down the center 12.00pm: chiller under maintenance 17.00pm: chiller fixed, center can be turned back on 21.00pm: farm back on-line, waiting for storage 21/05/2013Andrea Chierici7
8
10 th of March 9.00am: support call to switch storage back on 6.00pm: center open again for LHC experiments Next day: center fully open again 21/05/2013Andrea Chierici8
9
Chiller power supply 21/05/2013Andrea Chierici9
10
Incident representation 21/05/2013Andrea Chierici10 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 2
11
Incident examination 6 chillers for the computing room 5 share the same power supply for the control logic (we did not know that!) Fire in one of the control logic, power was cut to 5 chillers out of 6 – 1 chiller was still working and we weren’t aware of that! – Could have avoided turning the whole center off? Probably not! But a controlled shutdown could have been done. 21/05/2013Andrea Chierici11
12
Facility monitoring app 21/05/2013Andrea Chierici12
13
Chiller n.4 21/05/2013Andrea Chierici13 BLACK: Electric Power in (kW) BLUE: Water temp IN (°C) YELLOW: Water temp. OUT (°C) CYAN: Ch. Room temp. (°C)
14
Incident seen by inside 21/05/2013Andrea Chierici14
15
Incident seen by outside 21/05/2013Andrea Chierici15
16
Recover Procedure
17
Recover procedure Facility: support call for an emergency intervention on chiller – recovered the burned bus and the control logic n.4 Storage: support call Farming: took the chance to apply all security patches and latest kernel to nodes – Switch on order: LSF server, CEs, UIs – For a moment we were thinking about upgrading to LSF 9 21/05/2013Andrea Chierici17
18
Failures (1) Old WNs – BIOS battery exhausted, configuration reset PXE boot, hyper-threading, disk configuration (AHCI) – lost IPMI configuration (30% broken) 21/05/2013Andrea Chierici18
19
Failures (2) Some storage controllers were replaced 1% PCI cards (mainly 10Gbit network) replaced Disks, power supplies and network switches were almost not damaged 21/05/2013Andrea Chierici19
20
What we learned
21
We fixed our weak point 21/05/2013Andrea Chierici21 Chiller 1 Chiller 2 Chiller 3 Chiller 4 Chiller 5 Chiller 6 Control System Head Control System Head Ctrl sys Pow 1 Ctrl sys Pow 6 Ctrl sys Pow 2 Ctrl sys Pow 3 Ctrl sys Pow 4 Ctrl sys Pow 5
22
We miss an emergency button Shut the center down is not easy: a real “emergency shutdown” procedure is missing – We could have avoided switching down the whole center if we have had more control – Depending on the incident, some services may be left on-line Person on-call can’t know all the site details 21/05/2013Andrea Chierici22
23
Hosted services Our computing room hosts services and nodes outside our direct supervision, for which it’s difficult to gather full control – We need an emergency procedure for those too – We need a better understanding of the SLAs 21/05/2013Andrea Chierici23
24
Conclusions
25
We benchmarked ourselves 21/05/2013Andrea Chierici25 It took 2 days to get the center back on-line – less than one to open LHC experiments – everyone was aware about what to do – All working nodes rebooted with a solid configuration – A few nodes were reinstalled and put back on line in a few minutes
26
Lesson learned We must have a clearer evidence of which chiller is working at every moment (on-call person does not have it right now) – The new dashboard appears to be the right place We created a task-force to implement a controlled shutdown procedure – Establish a shutdown order WNs should be switched off first, then disk-servers, grid and non grid services, bastions and finally network switches In case of emergency, on-call person is required to take a difficult decision 21/05/2013Andrea Chierici26
27
Testing shutdown procedure The shutdown procedure we are implementing can’t be easily tested How to perform a “simulation”? – Doesn’t sound right to switch the center off just to prove we can do it safely How do other sites address this? Should periodic bios battery replacements be scheduled? 21/05/2013Andrea Chierici27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.