Data quality, or how to keep afloat in the growing data flood P. Sollander, CERN 16/4/2013 ARW2013
Outline Control system architecture and data flows/floods Data quality problems and consequences False negatives Unknown state False positives System and software strategies Processes and procedures Summary 16/4/2013 ARW2013
Control system architecture logging Middleware ~1M / day Application server DAQ DAQ DAQ DAQ DAQ 100+ What could possibly go wrong? 16/4/2013 ARW2013
Control system architecture What could possibly go wrong??? ✗ logging Middleware ~1M / day Middleware ~1M / day ✗ ✗ What could possibly go wrong? ✗ 100+ ✗ ✗ 16/4/2013 ARW2013
False negatives No alarm ≠ no problem 11/1/11 – Big power cut at LHC No network no alarms no problem? Broken PLC-SCADA connection Monitoring OK operator confident hours spent looking elsewhere April 3 2013, inundation alarm on LHC P5. Pumps stopped, but no alarm. The PLC to SCADA connection was not monitored… Must be minimized, zero is impossible? 16/4/2013 ARW2013
Monitoring the system Data Tag Value Timestamp Quality Middleware ~1M / day What could possibly go wrong? 100+ 16/4/2013 ARW2013
Indicating quality on alarms Active alarms get [?] prefix New alarm on faulty controls component Help Alarm 16/4/2013 ARW2013
Indicating quality on synoptics 16/4/2013 ARW2013
Indicating quality on applications 16/4/2013 ARW2013
Acting on bad quality data Indicate to operator What about other applications using the data? Software Interlocks for example? 16/4/2013 ARW2013
Panicky software interlocks LHC Beam dump Data Tag Value: closed Timestamp Quality: OK Data Tag Value: closed? Timestamp Quality: NOK Software Interlock System Software Interlock System Middleware ~1M / day Reboot of an element Software Interlock System tolerance for doubtful data Reduce false positives by waiting a reasonable amount of time before taking action 100+ 16/4/2013 ARW2013
False positive False alarms 1% of Technical infrastructure alarms are real! Easy to miss out on an important one 24/1/2007 – Constant false alarms mask one real alarm 400kV breaker trips, 7 hours to switch everything back 16/4/2013 ARW2013
Software strategies Software Interlock System tolerance for doubtful data Reduce false positives by waiting a reasonable amount of time before taking action Add indications of bad quality, [?] and color 16/4/2013 ARW2013
Operator strategies Wait to see if the alarm stays? Check the trend Poor reading gives brief 0 reading. Diagnose with good tools Worth investing in good tools 1% real alarms for CERN’s technical infrastructure 16/4/2013 ARW2013
Processes to improve quality Alarm and data configuration process Every alarm checked by operation Long and tedious Cannot work without it Test procedures Correction procedures Operating instructions HelpAlarm Diagnostic tools in system 16/4/2013 ARW2013
Data integration process Create request Equipment group System check Computerized Data check Operators Data validation Tests Equipment group and Operators 16/4/2013 ARW2013
Summary CERN technical infrastructure system is huge, a million alarms per year! Control system is event based False negatives – reduced by thorough monitoring of the system itself, diagnostic tools False positives – reduced mainly by procedure Strict integration rules, testing, correction, etc 16/4/2013 ARW2013
16/4/2013 ARW2013