Hannes Sakulin, CERN/EP on behalf of the CMS DAQ group

Slides:



Advertisements
Similar presentations
CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop
Advertisements

Clara Gaspar on behalf of the LHCb Collaboration, “Physics at the LHC and Beyond”, Quy Nhon, Vietnam, August 2014 Challenges and lessons learnt LHCb Operations.
André Augustinus ALICE Detector Control System  ALICE DCS is responsible for safe, stable and efficient operation of the experiment  Central monitoring.
Clara Gaspar, May 2010 The LHCb Run Control System An Integrated and Homogeneous Control System.
First operational experience with the CMS Run Control System Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group 17 th IEEE Real Time Conference,
Data Quality Monitoring of the CMS Tracker
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Clara Gaspar, October 2011 The LHCb Experiment Control System: On the path to full automation.
1 Alice DAQ Configuration DB
André Augustinus 10 September 2001 DCS Architecture Issues Food for thoughts and discussion.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
ALICE, ATLAS, CMS & LHCb joint workshop on
André Augustinus 21 June 2004 DCS Workshop Detector DCS overview Status and Progress.
A.Golunov, “Remote operational center for CMS in JINR ”, XXIII International Symposium on Nuclear Electronics and Computing, BULGARIA, VARNA, September,
Dynamic configuration of the CMS Data Acquisition Cluster Hannes Sakulin, CERN/PH On behalf of the CMS DAQ group Part 1: Configuring the CMS DAQ Cluster.
Clara Gaspar, March 2005 LHCb Online & the Conditions DB.
CMS pixel data quality monitoring Petra Merkel, Purdue University For the CMS Pixel DQM Group Vertex 2008, Sweden.
Introduction CMS database workshop 23 rd to 25 th of February 2004 Frank Glege.
Databases in CMS Conditions DB workshop 8 th /9 th December 2003 Frank Glege.
L0 DAQ S.Brisbane. ECS DAQ Basics The ECS is the top level under which sits the DCS and DAQ DCS must be in READY state before trying to use the DAQ system.
ALICE Pixel Operational Experience R. Santoro On behalf of the ITS collaboration in the ALICE experiment at LHC.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
MICE CM28 Oct 2010Jean-Sebastien GraulichSlide 1 Detector DAQ o Achievements Since CM27 o DAQ Upgrade o CAM/DAQ integration o Online Software o Trigger.
Pixel DQM Status R.Casagrande, P.Merkel, J.Zablocki (Purdue University) D.Duggan, D.Hidas, K.Rose (Rutgers University) L.Wehrli (ETH Zuerich) A.York (University.
Online Monitoring System at KLOE Alessandra Doria INFN - Napoli for the KLOE collaboration CHEP 2000 Padova, 7-11 February 2000 NAPOLI.
Clara Gaspar, April 2006 LHCb Experiment Control System Scope, Status & Worries.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Marco Haag - Institute of Experimental Nuclear.
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH2 Overview Automation Features added to CMS DAQ over Run 1 of the LHC added.
André Augustinus 18 March 2002 ALICE Detector Controls Requirements.
The ATLAS Run 2 Expert System Giuseppe Avolio - CERN.
M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.
CHEP 2010 – TAIPEI Robert Gomez-Reino on behalf of CMS DAQ group.
ATLAS Detector Resources & Lumi Blocks Enrico & Nicoletta.
THIS MORNING (Start an) informal discussion to -Clearly identify all open issues, categorize them and build an action plan -Possibly identify (new) contributing.
Central DQM Shift Tutorial Online/Offline
Web Programming Language
BaBar Transition: Computing/Monitoring
CMS DCS: WinCC OA Installation Strategy
2007 IEEE Nuclear Science Symposium (NSS)
Srećko Morović Institute Ruđer Bošković
CMS High Level Trigger Configuration Management
NUUO Tools Welcome to NUUO general education service. This session allows users to have the overview of NUUO tools for system design. (Click)
Working with Client-Side Scripting
ATLAS MDT HV – LV Detector Control System (DCS)
CMS – The Detector Control System
Shift instructions August 16, 2017 Antoni Aduszkiewicz
Chapter 2: System Structures
Controlling a large CPU farm using industrial tools
WinCC-OA Upgrades in LHCb.
Pixel DQM Status & Plans
ProtoDUNE SP DAQ assumptions, interfaces & constraints
Data Quality Monitoring of the CMS Silicon Strip Tracker Detector
RCMS Internet - Intranet UI 9-1.
Valeria Bartsch UCL David Decotigny LLR Tao Wu RHUL
The LHCb Run Control System
The IFR Online Detector Control at the BaBar experiment at SLAC
LHCCWG Meeting R. Alemany, M. Lamont, S. Page
CMS Pixel Data Quality Monitoring
The IFR Online Detector Control at the BaBar experiment at SLAC
The Online Detector Control at the BaBar experiment at SLAC
Design Principles of the CMS Level-1 Trigger Control and Hardware Monitoring System Ildefons Magrans de Abril Institute for High Energy Physics, Vienna.
DQM for the RPC subdetector
AIMS Equipment & Automation monitoring solution
LHC BLM Software audit June 2008.
Pierluigi Paolucci & Giovanni Polese
Pierluigi Paolucci & Giovanni Polese
Tools for the Automation of large distributed control systems
Banafsheh Hajinasab Based on presentation by K. Strnisa, Cosylab
Chapter 13: I/O Systems.
Presentation transcript:

Hannes Sakulin, CERN/EP on behalf of the CMS DAQ group New operator assistance features in the CMS Run Control System 22nd International Conference on Computing in High Energy and Nuclear Physics (CHEP) San Francisco, USA, 10th Oct 2016 Hannes Sakulin, CERN/EP on behalf of the CMS DAQ group

System Overview

Front-end Electronics Control and Monitoring Systems in CMS errors alarms monitor clients DAQ Operator DCS Operator Level-0 Function Manager SM Conf Ind. Config DBs … Run Control System Subs 1 … Subs n Detector Control System … Monitoring Services XDAQ Online Software HLT, merge & transfer control data data control Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

Level-0 Function Manager – the top-level control node State machine Configuration Handling Indiv. Subs. Control GCM Configuration DBs Resource Service Equipment HLT

Level-0 Function Manager GUI sdf Top level Run Control GUI Initially flexibility needed – many manual settings

At the beginning of Run-1 … Full control possible But everything had to be done manually Only experts able to operate run control manual operation very error prone

Configuration Handling

Configuration handling Operator initially had to select Compatible first and high level trigger configurations Compatible set of RUN_KEYs for each sub-system We grouped these First into a combined trigger key Then into a combined CMS run mode Combines all subsystem and trigger configuration into a single configuration item Run modes for Collisions Cosmics Various special runs Run mode may be automatically selected based on LHC beam mode

The top-level Run Control web GUI auto-select run mode based on LHC run mode in turn determines most other settings

Configuration handling (II) Initially shifter needed to know what subsystems need to be reconfigured / recycled after a certain configuration change When to change / recover the clock Now a guidance system constantly compares the applied configuration with the selected configuration for each sub-system Indicators are displayed prompting operators to do the correct action Checks for updates to the selected configuration ( configuration databases ) Selected configuration can be tied to LHC mode through run mode Configuration DBs Resource Service L1 Trg Equipment HLT GCM State machine Guidance system Configuration Handling Indiv. Subs. Control Level-0 Function Manager Run Mode

Guidance system Ensures that all settings are applied … in the correct order.

Following the cycle of the LHC

Manual actions throughout an LHC cycle … LHC dipole current … STABLE BEAMS ADJUST STABLE BEAMS section 1 DCS ramps up high voltages DCS ramps down high voltages … Initially, new run needed when LHC start/stops ramping when high voltages are ramped Subsystem operators needed to change settings: ramp start Mask sensitive trigger channels ramp done Unmask sensitive trigger channels Tracker HV on Enable payload (Tk) raise gains (Pixel) Tracker HV off Disable payload (Tk) reduce gains (Pixel)

Front-end Electronics Control and Monitoring Systems in CMS errors alarms monitor clients DAQ Operator DCS Operator Level-0 Function Manager LHC HV status, LHC state DCS Gd SM Conf Ind. Ind. SM Conf Config DBs … Run Control System Subs 1 … Subs n Detector Control System … Monitoring Services XDAQ Online Software HLT, merge & transfer control data data control Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

Run control automatically handles run section changes LHC dipole current … STABLE BEAMS ADJUST STABLE BEAMS section 1 Start run at FLAT TOP. The rest is automatic DCS ramps up tracker HV DCS ramps down tracker HV … section 1 section 2 section 1 Section 2 Special run with circulating beam Automatic actions in DAQ : ramp start Mask sensitive trigger channels ramp done Unmask sensitive trigger channels Tracker HV on Enable payload (Tk) raise gains (Pixel) Tracker HV off Disable payload (Tk) reduce gains (Pixel)

Soft Error Recovery

Automatic soft error recovery With higher instantaneous luminosity in 2011 more and more frequent “soft errors” causing the run to get stuck Proportional to integrated luminosity Believed to be due to single event upsets Recovery procedure Stop run (30 sec) Re-configure a sub- detector (2-3 min) Start new run (20 sec) One Single Event Upset (needing recovery) every 73 pb-1 Single-event upsets in the electronics of the Si-Pixel detector. Proportional to integrated luminosity. 3-10 min down-time

Automatic soft error recovery From 2012, new automatic recovery procedure in top-level control node Sub-system detects soft error and signals by changing its state Top-level control node invokes recovery procedure Pause Triggers Invoke newly defined selective recovery transition on requesting detector In parallel perform preventive recovery of other detectors Resynchronize Resume DCS SE Gd Ind. SM Conf … 12 seconds down-time At least 46 hours of down-time avoided in 2012

DAQ Doctor

Front-end Electronics DAQ Doctor DAQ Doctor Run-1 errors alarms monitor clients DAQ Operator DCS Operator Level-0 Function Manager LHC HV status, LHC state DCS SE Gd Ind. SM Conf Config DBs … Run Control System Subs 1 … Subs n Detector Control System … Monitoring Services XDAQ Online Software HLT, merge & transfer control data data control Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

Towards the end of Run-1 Improved configuration handling Guidance Automatic actions following DCS / LHC state changes Soft Error recovery DAQ Doctor

New operator assistance for Run-2

Some errors still need to be recovered manually How to improve further Guidance indicates all necessary steps to the operator … … but operator still needs to follow them manually Click, wait, click, wait a few minutes, click … not always efficient Some errors still need to be recovered manually Rare / new / not well understood errors Want to speed up the typical recovery Stop the run Reconfigure / recycle a sub-system Start a new run Potentially recover secondary errors Prepare to trigger typical recovery by expert system

Front-end Electronics Automator errors alarms monitor clients DCS Operator DAQ Operator DAQ Doctor Run-1 Automator Function Manager Level-0 Function Manager LHC HV status, LHC state DCS SE Gd Ind. SM Conf Config DBs … Run Control System Subs 1 … Subs n Detector Control System … Monitoring Services XDAQ Online Software HLT, merge & transfer control data data control Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

Level-0 Automator

The top-level Run Control web GUI with Automator Recover run with 2 clicks (1) (2) Full Level-0 functionality still accessible

Timeline history of all manual or automatic actions Recovery triggered by operator schedule: start/stop only Configurable number of attempts to recover transitions going to Error

One-click start of run Can do this from any state of the system all indications of the guidance system are followed

One-click start of run - timeline Automator follows Guidance System in Level-0 Timeline shows history of all manual or automatic actions

Offline timeline – for post mortem analysis

New DAQ Expert

Front-end Electronics New DAQ Expert DAQ Expert Run-2 errors alarms monitor clients DAQ Operator DCS Operator Automator Function Manager Level-0 Function Manager LHC HV status, LHC state DCS SE Gd Ind. SM Conf Config DBs … Run Control System Subs 1 … Subs n Detector Control System … Monitoring Services XDAQ Online Software HLT, merge & transfer control data data control Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

New DAQ Expert New tool based on Java / Web Technologies New rules for DAQ-2 system Gives detailed recovery instructions for known error situations Simplified model of monitoring data Reasoning encapsulated in logic modules Easy to extend

Today Automatic actions following DCS / LHC state changes Improved configuration handling Guidance Soft Error recovery Automator : two-click recovery / 1-click start of run Recovery instructions by the DAQ Expert

Summary Expt System controlled by DAQ Expert Future ? Expt DCS SM Gd SE Conf Ind. Expt Automator: 2-click recovery DAQ Expert advice Run 2 DCS SM Gd SE Conf Ind. Automatic actions, Guidance, Soft Error recovery DAQ Doctor (Run-1) End of Run 1 DCS SM Gd SE Conf Ind. Manual Operations Start of Run 1 SM Conf Ind.

Thank You

Run-2 DAQ Expert

DAQ Expert

New DAQ expert

Run-1 DAQ Doctor

Expert System Sound alerts Spoken alerts Advice Automatic actions DAQ Doctor DAQ Operator Spoken alerts Advice Monitoring Run Control System Level-0 TRG … DAQ Automatic actions Live Access Servers … DCS Trigger Tracker ECAL DAQ DQM Trigger Supervisor … Slice Slice Monitor collectors XDAQ Online Software … CMSSW CMSSW XDAQ monitoring & alarming Computing farm monitoring data L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ farm, High Level Trigger & storage

The DAQ Doctor Expert tool based on the same technology as High level scripting language (Perl) Generic framework & pluggable modules Detection of high level anomaly triggers further investigation Archive (web based) All Notes Sub-system errors CRC errors Dumps (of all monitoring data) for expert analysis in case of anomalies

The DAQ Doctor Diagnoses Anomalies in Automatic actions L1 rate HLT physics stream rate Dead time Backpressure Resynchronization rate Farm health Event builder and HLT farm data flow HLT farm CPU utilization … Automatic actions Triggers computation of a new central DAQ configuration in case of PC hardware failure(great help for on-call experts since 2012)

Technology

Front-end Electronics Control and Monitoring Systems in CMS DAQ Operator DCS Operator … Function Manager Node in the Run Control Tree defines a State Machine & parameters Specific actions, automation etc. implemented in Java User function managers dynamically loaded into the web application Run Control System – Java, Web Technologies Defines the control structure HTML, CSS, JavaScript, AJAX GUI in a web browser Run Control Web Application Apache Tomcat Servlet Container JSP, Tag Libraries Production system: 40 tomcats on 20 virtual machines Axis Web Service SOAP Servlet Communication Detector Control System Run Control System Level-0 Tracker … ECAL L1 Trigger TCDS … Tracker ECAL DAQ DQM … … data data Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage

Front-end Electronics Control and Monitoring Systems in CMS DAQ Operator DCS Operator XDAQ Application XDAQ Framework – C++ XDAQ applications control hardware and handle data flow Hardware Access, Transport Protocols, XML configuration, SOAP communication, HyperDAQ web server Several 1000 applications to control data SOAP Detector Control System Run Control System Level-0 Tracker … ECAL L1 Trigger TCDS … Tracker ECAL DAQ DQM … … … XDAQ Online Software data data Low voltage High voltage Gas, Magnet Trigger Control and Distribution System Front-end Electronics L1 trigger electronics Sub-det DAQ electronics Central DAQ electronics Central DAQ Event Builder File based Filter Farm & Storage