TRACKING OF FAULTS AND FOLLOW-UP Accelerator Fault Tracking project Jakub Janczyk (TE-MPE-PE / BE-CO-DS) with input from: Andrea Apollonio, Chris Roderick, Rudiger Schmidt, Benjamin Todd, Daniel Wollmann
Agenda Purpose of fault tracking What has been done in the Past Accelerator Fault Tracking project – plans & status Summary 10/14/2014R2E/Availability Workshop 2
Purpose of fault tracking Complete and consistent tracking allows to identify: Problems as early as possible to allow for timely mitigation Key issues which will limit performance of accelerators or equipment in the future (Run2, Run3, HL-LHC) Increase availability, in both short- and long-term, by dealing with issues ASAP Track Faults in two areas: 1. Directly affecting accelerator operation – identify root causes (e.g. R2E effects, glitches in electrical network, etc.) 2. Equipment (electronic) faults independently of immediate impact on accelerator operation 10/14/2014R2E/Availability Workshop 3
What has been done in the Past A lot of different tools for logging of faults, used by different teams: eLogbook, Post-Mortem, RadWG page, tools in equipment groups (JIRA, Excel, Onenote, eLogbook) A lot of effort was required from individual teams/working groups to gather and exploit fault data Nevertheless, difficult to get a consistent picture 10/14/2014R2E/Availability Workshop 4
Credit M. Brugger
Cardiogram - „life” of LHC from operational point of view Graphical analytic tool for combining data from different sources Initially created by members of Availability WG: B. Todd, L. Ponce, A. Apollonio Tedious work to gather and prepare all the necessary data several months for cardiogram 10/14/2014R2E/Availability Workshop 6
Cardiogram - example 10/14/2014R2E/Availability Workshop 7 Accelerator Mode (Proton Physics, Ion Physics, etc.) Access Fill Number Particle Momentum Beams Intensities Stable Beams PM Beam Dump Beam Dump Classification Fault Fault Lines (Systems/ Fault Classifications) Credit AWG
Cardiogram – data preparation 10/14/2014R2E/Availability Workshop 8 Credit Benjamin Todd
Accelerator Fault Tracking project Project launched February 2014 (BE/CO, BE/OP, TE/MPE collaboration) Based on initial inputs from: Evian Workshops Availability Working Group Workshop on Machine Availability & Dependability for Post-LS1 LHC BE/OP Goals: Capture consistent and complete fault data Facilitate fault tracking from perspective of all interested parties (OP, equipment groups, working groups) Single source of data – easier to complete, clean and analyse. Provide consistent / standardized statistics, analyses, reports for different users (8:30 meetings, weekly reports / summaries) Interactive overview of faults (cardiogram on demand) Proactively identify incomplete data 10/14/2014R2E/Availability Workshop 10
Plans (as presented by Chris LMC )as presented by Chris LMC Provide infrastructure to consistently & coherently capture, persist and make available accelerator fault data for further analysis. Foreseen project stages: 1. Put in place a fault tracking infrastructure to capture LHC fault data from an operational perspective Enable data exploitation by others (e.g. AWG and OP) to identify areas to improve accelerator availability for physics Ready before LHC beam commissioning Infrastructure should already support capture of equipment group fault data, but not primary focus 2. Focus on equipment group fault data capture 3. Explore integration with other CERN data management systems (e.g. Infor EAM) potential to perform deeper analyses of system and equipment availability in turn - start predicting and improving dependability To support data analysis, AFT data extraction infrastructure should also provide data complimentary to the actual fault data - such as accelerator operational modes and states. Scope: Initial focus on LHC, but aim to provide a generic infrastructure capable of handling fault data of any CERN accelerator. We are here... Time
Status AFT is under development – Web application, available for different users, and integration with eLogbook for LHC operators Functionalities available from day 1 will be as planned for first stage of the project AFT test version available We’re open to start discussion with equipment groups 10/14/2014R2E/Availability Workshop 12
10/14/2014R2E/Availability Workshop 13
10/14/2014R2E/Availability Workshop 14
10/14/2014R2E/Availability Workshop 15
Turnaround Time 10/14/2014R2E/Availability Workshop 16
Summary Consistent and complete tracking of faults is the key to identify and efficiently mitigate issues The AFT will ease the recording of faults and their root causes in a complete and consistent way Run2 data will be essential to identify future performance/availability limitations towards HL-LHC Quality and completeness of the data requires effort from all involved parties Open to discuss integration of equipment groups data 10/14/2014R2E/Availability Workshop 17
Questions 10/14/2014R2E/Availability Workshop 18
Extra Slides 10/14/2014R2E/Availability Workshop 19
Roles and simplified workflow 10/14/2014R2E/Availability Workshop 20
10/14/2014R2E/Availability Workshop
Multiple failures It is easy to see if there are multiple failures at the same time, but it’s not obvious if they are related. One of the goal of AFT project is to capture data that will allow to show the relations between faults. 10/14/2014R2E/Availability Workshop 22 Faults related Water leak Problems caused by water leak Faults not related – QPS failed and rest of them are accesses in shadow
Access without faults In 2012, around 40 times there was access without any fault The reasons for these accesses are not classified, but often something is repaired Inconsistent data – cardiogram allows to spot this 10/14/2014R2E/Availability Workshop 23
Access without faults - examples 10/14/2014R2E/Availability Workshop 24 Few accesses: ATLAS, Change of PC, repair of QPS, intervention on the crates of the BPMD LHCb – fixing muon detectors Accesses in shadow of QPS fail: QPS – reset cards, ALICE and CMS, Cryogenics – valve regulation, RF – replacing broken attenuator ATLAS access