Presentation is loading. Please wait.

Presentation is loading. Please wait.

B. Todd, A. Apollonio, S. Gabourin, S. Uznanski Principles and Experience in the 1v2 Design & Operation of Dependable Systems.

Similar presentations


Presentation on theme: "B. Todd, A. Apollonio, S. Gabourin, S. Uznanski Principles and Experience in the 1v2 Design & Operation of Dependable Systems."— Presentation transcript:

1 B. Todd, A. Apollonio, S. Gabourin, S. Uznanski Principles and Experience in the 1v2 Design & Operation of Dependable Systems

2 CERN benjamin.todd@cern.ch 2. Dependable Design Principles 3. Experiences to date Dependable systems are the result of good engineering practices Good engineers = good systems failure modes are just as important as rates 1. CERN and the LHC watch out for the dependencies System specifications need dependability requirements

3 CERN benjamin.todd@cern.ch Founded in 1954 Funded by the European Union 20 Member States 8 Observer States and Organisations 35 Non-Member States …Japan, Russia, USA… 580 Institutes World Wide 2500 Staff 8000 Visiting Scientists …Australia, Canada, New Zealand… …most of the EU… European Centre for Nuclear Research Conseil Européen pour la Recherche Nucléaire Pure Science – Particle Physics 1.Pushing the boundaries of research, physics beyond the standard model. 2.Advancing frontiers of technology. 3.Forming collaborations through science 4.Educating the scientists and engineers of tomorrow

4 CERN benjamin.todd@cern.ch particle accelerators and detectors to study the basic constituents of matter. Accelerators boost beams of particles to high energies before they are made to collide with each other or with stationary targets. Detectors observe and record the results of these collisions. Our flag-ship project is the Large Hadron Collider…

5 CERN benjamin.todd@cern.ch CERN CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France)

6 CERN benjamin.todd@cern.ch CERN CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France) Proton Synchrotron (PS) Super Proton Synchrotron (SPS) Large Hadron Collider (LHC) 27km long 150m underground

7 CERN Accelerator Complex Lake Geneva Geneva Airport CERN LAB 1 (Switzerland) CERN LAB 2 (France) Proton Synchrotron (PS) Super Proton Synchrotron (SPS) Large Hadron Collider (LHC)

8 CERN benjamin.todd@cern.ch CERN, the LHC and Machine Protection CERN 8 of 23 CERN Accelerator Complex Large Hadron Collider (LHC) Beam-1 Transfer Line (TI2) Beam-2 Transfer Line (TI8) Beam Dumping Systems Super Proton Synchrotron (SPS) 100us for one turn

9 CERN benjamin.todd@cern.ch CERN CERN Accelerator Complex CMS ALICE ATLAS LHC-b

10 CERN benjamin.todd@cern.ch ATLAS – A Toroidal LHC ApparatuS 10

11 CERN benjamin.todd@cern.ch ATLAS – A Toroidal LHC ApparatuS 11

12 CERN benjamin.todd@cern.ch Stored energy in the magnet circuits is 9 GJ LHC Parameters LHC needs high luminosity of 10 34 [cm -2 s -1 ] 3 x 10 14 p per beam at 7 TeV 8.3 Tesla dipole fields with circumference of 27 kms (16.5 miles) LHC needs super-conducting magnets <2°K (-271°C) with an operational current of ≈13kA cooled in superfluid helium maintained in a vacuum [11] A magnet will QUENCH with milliJoule deposited energy Stored energy per beam is 360 MJ …to see the rarest events… … to get 7 TeV operation… … to get 8.3 Tesla … two orders of magnitude higher than others 20103.54 x 10 13 2.0 x 10 32 20113.52.0 x 10 14 3.6 x 10 33 201242.2 x 10 14 7.7 x 10 33 LS 1-2 ≈6.5≈3 x 10 14 ≈1 x 10 34 Year Peak Energy [TeV] Peak Intensity [p] Peak Luminosity [cm -2 s -1 ] [1,2,3,4] 45 pb -1 5.3 fb -1 21.7 fb -1 >20 fb -1 Total Physics [yr -1 ]

13 CERN benjamin.todd@cern.ch Dependable Design Principles - a design flow

14 CERN benjamin.todd@cern.ch Systems… a non-complex system… with many components…<1k lines a complex system … with few components … Safe Machine Parameters S M P >80k lines a complex system … with many components … Beam Interlock System B I S Function Generator Controller Lite F G C Lite >>80k lines Critical code

15 CERN benjamin.todd@cern.ch Power Converter Types 15 [4,5]

16 CERN benjamin.todd@cern.ch Power Converter Types Function Generator Controller F G C 16 [5,6] ≈1000 replaced with FGClite

17 CERN benjamin.todd@cern.ch Power Converter

18 CERN benjamin.todd@cern.ch Reliability Requirements For > 1000 units… acceptable failure rate < 40 per year… Mean Time Between Failures > 200000 hours electrical SEE radiation cross-section <1 x 10 -12 > 300000 hours equipment lifetime > 25 years… electrical DD / TID radiation >200 Grays design for 25 years 18 Techniques such as application of MIL-217 = predict electrical reliability Scientific testing and analysis = predict radiation cross-section and lifetime working on a model to integrate radiation effects with electrical in ISOGRAPH

19 CERN benjamin.todd@cern.ch FGClite Design Flow 19

20 CERN benjamin.todd@cern.ch FGClite Design Flow Class 0 (C 0 ) Class 1 (C 1 ) Class 2 (C 2 ) components known to be resistant, or easily replaced, conceptual design not influenced by these components. components potentially susceptible to radiation, in less-critical parts of the system. Substitution of parts or mitigation of issues is possible with a re-design. components potentially susceptible to radiation, in more-critical parts of the system. The conceptual design is compromised if these components do not perform well. Substitution of parts or mitigation of issues would be difficult. Resistors, capacitors, diodes, transistors… Regulators, memory, level translators… Precision ADC, FPGA… 20

21 CERN benjamin.todd@cern.ch FGClite Design Flow 21

22 CERN benjamin.todd@cern.ch FGClite Design Flow 22

23 CERN benjamin.todd@cern.ch FGClite Design Flow 23

24 CERN benjamin.todd@cern.ch FGClite Design Flow 24 [7]

25 CERN benjamin.todd@cern.ch FGClite Design Flow 25 [7]

26 CERN benjamin.todd@cern.ch Example HW Reliability Optimisation 26

27 CERN benjamin.todd@cern.ch Experiences Running LHC to Date Availability Working Group

28 CERN benjamin.todd@cern.ch Physics Fill Abort Root Cause 2012 28 585 physics fills [9]

29 CERN benjamin.todd@cern.ch Lost Physics and Fault Time 2012 29 812 hours = 34 days = lost physics 1524 hours = 64 days = fault time [9]

30 CERN benjamin.todd@cern.ch Machine Protection Faults 2012 30 7 systems, >250 faults, ≈36 failure modes, >360h repair time BLM QPS Failure modes very important for fault evolution Unrealistic to draw proper conclusions – don’t record raw data consistently [10]

31 CERN benjamin.todd@cern.ch 2005 Predictions… 31 false dumps: failure of system which leads to “fail-safe” premature abort System Predicted 2005 Observed 2010 Observed 2011 Observed 2012 LBDS6.8 ± 3.69114 BIS0.5 ± 0.5210 BLM17.0 ± 4.00415 PIC1.5 ± 1.2250 QPS15.8 ± 3.9244856 SIS-424 reliability in line with expectations… (!!) despite the almost-witchcraft used to create the numbers… But the failure modes are not the same. [9]

32 CERN benjamin.todd@cern.ch 2005 Predictions… 32 System Predicted 2005 Observed 2010 Observed 2011 Observed 2012 LBDS6.8 ± 3.69114 BIS0.5 ± 0.5210 BLM17.0 ± 4.00415 PIC1.5 ± 1.2250 QPS15.8 ± 3.9244856 SIS-424 [9] false dumps: failure of system which leads to “fail-safe” premature abort reliability in line with expectations… (!!) despite the almost-witchcraft used to create the numbers… But the failure modes are not the same.

33 CERN benjamin.todd@cern.ch Proposal - An LHC Fault Tracker 33 Visualisation of Events of 15 th – 16 th August 2012

34 CERN benjamin.todd@cern.ch Proposal - An LHC Fault Tracker 34 Visualisation of Events of 15 th – 16 th August 2012

35 CERN benjamin.todd@cern.ch Proposal - An LHC Fault Tracker 35 Visualisation of Events of 15 th – 16 th August 2012

36 CERN benjamin.todd@cern.ch Proposal - An LHC Fault Tracker 36 Visualisation of Events of 15 th – 16 th August 2012

37 CERN benjamin.todd@cern.ch Proposal - An LHC Fault Tracker 37 Visualisation of Events of 15 th – 16 th August 2012 LHC “e-logbook” TE-EPC Log TE-MPE-COMS TE-EPC view TE-MPE view OP view Impact on machine easier to infer + +

38 CERN benjamin.todd@cern.ch Personal experience with the Beam Interlock System…

39 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to?

40 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? CERN Controls Standard Power PC 8 out of 33 failed to date Outside the analysis scope

41 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Redundancy is more effective when it goes beyond the system boundary

42 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance?

43 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? A.N. Other User System… Where do we start debugging? Beam Interlock Controller

44 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? open racks… mystery of the missing 220V cable…

45 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance? 100kg of batteries in front of the spares cupboard… and no pallet lifter in sight…

46 CERN benjamin.todd@cern.ch Blurred Lines at System Boundaries Identify and account for dependencies - Services- Infrastructure- Controls Not part of analysis… …failures attributed to? Consider dependability during installation: Connections between systems influence reliability Maintenance directly influences availability Reliability-Centred-Maintenance? Preventive Maintenance?

47 CERN benjamin.todd@cern.ch 2. Dependable Design Principles 3. Experiences to date Dependable systems are the result of good engineering practices Good engineers = good systems failure modes are just as important as rates 1. CERN and the LHC watch out for the dependencies System specifications need dependability requirements

48 CERN benjamin.todd@cern.ch Fin Thank you!

49 CERN benjamin.todd@cern.ch References From the Chamonix Performance Workshop 2011 http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=103957 [1] 49 Extracted from http://lhc-statistics.web.cern.ch/LHC-Statistics/index.php http://lhc-statistics.web.cern.ch/LHC-Statistics/index.php [2] Extrapolated from W. Herr’s talk: “Luminosity Performance Reach After LS1” [3] Total Physics is from ATLAS https://twiki.cern.ch/twiki/bin/view/AtlasPublic/LuminosityPublicResults https://twiki.cern.ch/twiki/bin/view/AtlasPublic/LuminosityPublicResults [4] Figures and flow derived from work by Y. Thurel and S. Uznanski[7] Derived from http://cdsweb.cern.ch/record/1123729/files/LHC-PROJECT-REPORT-1133.pdf?version=1 [5] Photographs courtesy Y. Thurel et al, from: “LHC Power Converters the Proposed Approach” [6] From M. Kwiatkowski’s talk during SMP review at MPP http://lhc-mpwg.web.cern.ch/lhc-mpwg/MPP-Meetings/No40-04-03-2011/SMP_at_MPP_1v1.pptx [8] B. Todd et al, “Review 2012 – Operational Availability & Efficiency” https://indico.cern.ch/internalPage.py?pageId=0&confId=211614 [9] B. Todd et al, “Performance & Availability of MPS 2008 – 2012” https://indico.cern.ch/conferenceOtherViews.py?confId=227895 [10]

50 CERN benjamin.todd@cern.ch Hidden Faults 50 A worked example of potential dormant failure…

51 CERN benjamin.todd@cern.ch Hidden Faults 51 275 hardware inputs, 4 software inputs

52 CERN benjamin.todd@cern.ch Hidden Faults 52 136 (48%) never triggered 53 (19%) triggered once 564 (>50%) beam aborts from 12 inputs 165 x Operator Buttons 148 x Programmable Dump 93 x BPM (IR6) 49 x SIS 45 x BLM (SR7) 43 x RF 21 x PIC (US15) testing & maintenance plan needed - periodically ensure function. 564 (>50%) beam aborts from 7 systems: 275 hardware inputs, 4 software inputs

53 CERN benjamin.todd@cern.ch Software versus Programmable Logic 53

54 CERN benjamin.todd@cern.ch 54 2006 – 2012 BIS reliability not enough data yet


Download ppt "B. Todd, A. Apollonio, S. Gabourin, S. Uznanski Principles and Experience in the 1v2 Design & Operation of Dependable Systems."

Similar presentations


Ads by Google