B. Todd, A. Apollonio, S. Gabourin, S. Uznanski Principles and Experience in the 1v2 Design & Operation of Dependable Systems
CERN 2. Dependable Design Principles 3. Experiences to date Dependable systems are the result of good engineering practices Good engineers = good systems failure modes are just as important as rates 1. CERN and the LHC watch out for the dependencies System specifications need dependability requirements
CERN Founded in 1954 Funded by the European Union 20 Member States 8 Observer States and Organisations 35 Non-Member States …Japan, Russia, USA… 580 Institutes World Wide 2500 Staff 8000 Visiting Scientists …Australia, Canada, New Zealand… …most of the EU… European Centre for Nuclear Research Conseil Européen pour la Recherche Nucléaire Pure Science – Particle Physics 1.Pushing the boundaries of research, physics beyond the standard model. 2.Advancing frontiers of technology. 3.Forming collaborations through science 4.Educating the scientists and engineers of tomorrow
CERN particle accelerators and detectors to study the basic constituents of matter. Accelerators boost beams of particles to high energies before they are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions. Our flag-ship project is the Large Hadron Collider…
CERN CERN CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France)
CERN CERN CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France) Proton Synchrotron (PS) Super Proton Synchrotron (SPS) Large Hadron Collider (LHC) 27km long 150m underground
CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France) Proton Synchrotron (PS) Super Proton Synchrotron (SPS) Large Hadron Collider (LHC)
CERN CERN, the LHC and Machine Protection CERN 8 of 23 CERN Accelerator Complex Large Hadron Collider (LHC) Beam-1 Transfer Line (TI2) Beam-2 Transfer Line (TI8) Beam Dumping Systems Super Proton Synchrotron (SPS) 100us for one turn
CERN CERN CERN Accelerator Complex CMS ALICE ATLAS LHC-b
CERN ATLAS – A Toroidal LHC ApparatuS 10
CERN ATLAS – A Toroidal LHC ApparatuS 11
CERN Stored energy in the magnet circuits is 9 GJ LHC Parameters LHC needs high luminosity of [cm -2 s -1 ] 3 x p per beam at 7 TeV 8.3 Tesla dipole fields with circumference of 27 kms (16.5 miles) LHC needs super-conducting magnets <2°K (-271°C) with an operational current of ≈13kA cooled in superfluid helium maintained in a vacuum [11] A magnet will QUENCH with milliJoule deposited energy Stored energy per beam is 360 MJ …to see the rarest events… … to get 7 TeV operation… … to get 8.3 Tesla … two orders of magnitude higher than others x x x x x x LS 1-2 ≈6.5≈3 x ≈1 x Year Peak Energy [TeV] Peak Intensity [p] Peak Luminosity [cm -2 s -1 ] [1,2,3,4] 45 pb fb fb -1 >20 fb -1 Total Physics [yr -1 ]
CERN Dependable Design Principles - a design flow
CERN Systems… a non-complex system… with many components…<1k lines a complex system … with few components … Safe Machine Parameters S M P >80k lines a complex system … with many components … Beam Interlock System B I S Function Generator Controller Lite F G C Lite >>80k lines Critical code
CERN Power Converter Types 15 [4,5]
CERN Power Converter Types Function Generator Controller F G C 16 [5,6] ≈1000 replaced with FGClite
CERN Power Converter
CERN Reliability Requirements For > 1000 units… acceptable failure rate < 40 per year… Mean Time Between Failures > hours electrical SEE radiation cross-section <1 x > hours equipment lifetime > 25 years… electrical DD / TID radiation >200 Grays design for 25 years 18 Techniques such as application of MIL-217 = predict electrical reliability Scientific testing and analysis = predict radiation cross-section and lifetime working on a model to integrate radiation effects with electrical in ISOGRAPH
CERN FGClite Design Flow 19
CERN FGClite Design Flow Class 0 (C 0 ) Class 1 (C 1 ) Class 2 (C 2 ) components known to be resistant, or easily replaced, conceptual design not influenced by these components. components potentially susceptible to radiation, in less-critical parts of the system. Substitution of parts or mitigation of issues is possible with a re-design. components potentially susceptible to radiation, in more-critical parts of the system. The conceptual design is compromised if these components do not perform well. Substitution of parts or mitigation of issues would be difficult. Resistors, capacitors, diodes, transistors… Regulators, memory, level translators… Precision ADC, FPGA… 20
CERN FGClite Design Flow 21
CERN FGClite Design Flow 22
CERN FGClite Design Flow 23
CERN FGClite Design Flow 24 [7]
CERN FGClite Design Flow 25 [7]
CERN Example HW Reliability Optimisation 26
CERN Experiences Running LHC to Date Availability Working Group
CERN Physics Fill Abort Root Cause physics fills [9]
CERN Lost Physics and Fault Time hours = 34 days = lost physics 1524 hours = 64 days = fault time [9]
CERN Machine Protection Faults systems, >250 faults, ≈36 failure modes, >360h repair time BLM QPS Failure modes very important for fault evolution Unrealistic to draw proper conclusions – don’t record raw data consistently [10]
CERN 2005 Predictions… 31 false dumps: failure of system which leads to “fail-safe” premature abort System Predicted 2005 Observed 2010 Observed 2011 Observed 2012 LBDS6.8 ± BIS0.5 ± BLM17.0 ± PIC1.5 ± QPS15.8 ± SIS-424 reliability in line with expectations… (!!) despite the almost-witchcraft used to create the numbers… But the failure modes are not the same. [9]
CERN 2005 Predictions… 32 System Predicted 2005 Observed 2010 Observed 2011 Observed 2012 LBDS6.8 ± BIS0.5 ± BLM17.0 ± PIC1.5 ± QPS15.8 ± SIS-424 [9] false dumps: failure of system which leads to “fail-safe” premature abort reliability in line with expectations… (!!) despite the almost-witchcraft used to create the numbers… But the failure modes are not the same.
CERN Proposal - An LHC Fault Tracker 33 Visualisation of Events of 15 th – 16 th August 2012
CERN Proposal - An LHC Fault Tracker 34 Visualisation of Events of 15 th – 16 th August 2012
CERN Proposal - An LHC Fault Tracker 35 Visualisation of Events of 15 th – 16 th August 2012
CERN Proposal - An LHC Fault Tracker 36 Visualisation of Events of 15 th – 16 th August 2012
CERN Proposal - An LHC Fault Tracker 37 Visualisation of Events of 15 th – 16 th August 2012 LHC “e-logbook” TE-EPC Log TE-MPE-COMS TE-EPC view TE-MPE view OP view Impact on machine easier to infer + +
CERN Personal experience with the Beam Interlock System…
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to?
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? CERN Controls Standard Power PC 8 out of 33 failed to date Outside the analysis scope
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Redundancy is more effective when it goes beyond the system boundary
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance?
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? A.N. Other User System… Where do we start debugging? Beam Interlock Controller
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? open racks… mystery of the missing 220V cable…
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? 100kg of batteries in front of the spares cupboard… and no pallet lifter in sight…
CERN Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance?
CERN 2. Dependable Design Principles 3. Experiences to date Dependable systems are the result of good engineering practices Good engineers = good systems failure modes are just as important as rates 1. CERN and the LHC watch out for the dependencies System specifications need dependability requirements
CERN Fin Thank you!
CERN References From the Chamonix Performance Workshop [1] 49 Extracted from [2] Extrapolated from W. Herr’s talk: “Luminosity Performance Reach After LS1” [3] Total Physics is from ATLAS [4] Figures and flow derived from work by Y. Thurel and S. Uznanski[7] Derived from [5] Photographs courtesy Y. Thurel et al, from: “LHC Power Converters the Proposed Approach” [6] From M. Kwiatkowski’s talk during SMP review at MPP [8] B. Todd et al, “Review 2012 – Operational Availability & Efficiency” [9] B. Todd et al, “Performance & Availability of MPS 2008 – 2012” [10]
CERN Hidden Faults 50 A worked example of potential dormant failure…
CERN Hidden Faults hardware inputs, 4 software inputs
CERN Hidden Faults (48%) never triggered 53 (19%) triggered once 564 (>50%) beam aborts from 12 inputs 165 x Operator Buttons 148 x Programmable Dump 93 x BPM (IR6) 49 x SIS 45 x BLM (SR7) 43 x RF 21 x PIC (US15) testing & maintenance plan needed - periodically ensure function. 564 (>50%) beam aborts from 7 systems: 275 hardware inputs, 4 software inputs
CERN Software versus Programmable Logic 53
CERN – 2012 BIS reliability not enough data yet