Download presentation
Presentation is loading. Please wait.
Published byDale Mathews Modified over 9 years ago
1
First operational experience with the CMS Run Control System Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group 17 th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal
2
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 2 The Compact Muon Solenoid Experiment Drift-Tube chambers Cathode Strip Chambers Resistive Plate Chambers Iron Yoke 4 T Superconducting Coil Trackers Silicon StripSilicon Strip Silcon PixelSilcon Pixel Electromagnetic Calorimeter Hadronic Calorimeter LHC p-p collisions, E CM =14 TeV (2010: 7 TeV), heavy ion Bunch crossing frequency 40 MHz CMS Multi-purpose detector, broad physics programme 55 million readout channels
3
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 3 CMS Trigger and DAQ design First Level Trigger (hardware) up to 100 kHz Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent event builder / filter slices High level trigger running on filter farm ~700 PCs ~6000 cores In total around 10000 applications to control First Level Trigger (hardware) up to 100 kHz Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent event builder / filter slices High level trigger running on filter farm ~700 PCs ~6000 cores In total around 10000 applications to control Filter farm Frontend Readout Links
4
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 4 CMS Control Systems Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm DAQTrigger Slice ECALTracker … … DCS Trigger Supervisor XDAQ C++ Run Control System Java, Web Technologies Front-end Electronics data
5
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 5 CMS Control Systems Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm DAQTrigger Slice ECALTracker … … ECAL Detector Control System DCS Run Control System Java, Web Technologies … … Low voltage High voltage Gas, Magnet Front-end Electronics data PVSS (Siemens ETM) SMI (State Management Interface ) Trigger Supervisor XDAQ C++
6
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 6 CMS Run Control System XDAQ Application … Function Manager Node in the Run Control Tree defines a State Machine & parameters User function managers dynamically loaded into the web application Run Control World – Java, Web Technologies Defines the control structure XDAQ World – C++, XML, SOAP XDAQ applications control hardware and data flow XDAQ is the framework of CMS online software It provides Hardware Access, Transport Protocols, Services etc. ~10000 applications to control data HTML, CSS, JavaScript, AJAX GUI in a web browser Run Control Web Application Apache Tomcat Servlet Container Java Server Pages, Tag Libraries, Web Services (WSDL, Axis, SOAP)
7
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 7 Child Resource Proxy Child Resource Proxy Event Processor Child Resource Proxy – Run Control State Machine Engine State Machine Definition Event Handler Parameter Set Web Service from/to Parent Function Manager / GUI Asynchronous Notifications to / from Child Function Manager Web service Lifecycle + Configuration Command Parameter Monitor Servlet / Web Service Job Control Ev State Machine Callback Lifecycle Command Parameter Child Resource Proxy - XDAQ Function Manager Framework State, Errors Parameters YY C. Resource Proxy – PSX Servlet to / from Detector Control System Custom code Function Manager XX Frame- work code Legend
8
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 8 Child Resource Proxy Child Resource Proxy Event Processor Child Resource Proxy – Run Control State Machine Engine State Machine Definition Event Handler Parameter Set Web Service from/to Parent Function Manager / GUI Resource Service DB Run Info DB Asynchronous Notifications to / from Child Function Manager Web service Lifecycle + Configuration Command Parameter Monitor Servlet / Web Service Logs XDAQ Monitoring & Alarming System DAQ Structure DB Job Control Ev State Machine Callback Lifecycle Command Parameter Child Resource Proxy - XDAQ Function Manager Framework Log Collector State, Errors Parameters YY C. Resource Proxy – PSX Servlet to / from Detector Control System Custom code Function Manager XX Frame- work code Conditions Configuration FM + XDAQ Legend Monitoring Errors
9
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 9 Entire DAQ System Structure is Configurable Job Control Service Database XML Control structure Function Managers to load (URL) Parameters Child nodes Configuration of XDAQ Executives (XML) libraries to be loaded applications (e.g. builder unit, filter unit) & parameters network connections collaborating applications Control structure Function Managers to load (URL) Parameters Child nodes Configuration of XDAQ Executives (XML) libraries to be loaded applications (e.g. builder unit, filter unit) & parameters network connections collaborating applications Resource Service API Flow of configuration data SOAP High-level tools to generate configurations versioning
10
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 10 CMS Control Tree Level-0 DAQ TTSFB Trigger FECSlice 0Slice 7 FBRBHLT ECALTracker FED … Level-0: Control and parameterization of Run Level-1: Common state machine and Parameters Level-2: GUI (Web browser) DT Level-n: Sub-system specific … RPC … Sub-system Run Control developed by sub-system teams Framework and Top-Level Run Control developed by central team Frontend controller Frontend driver Trigger Throttling System FED Builder Readout Builder High Level Trigger
11
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 11 RCMS Level-1 State Machine (simplified) Halted Created Configured Running Paused Pre- Configured Error Creation Load & start Level-1 Function Managers Initialization Start further levels of function managers Start all XDAQ processes on the cluster New: Pre-Configuration (trigger only – few seconds) Sets up the clock and periodic timing signals Configuration Load configuration from database Configures hardware and applications Start run Pause / Resume Pauses / resumes the trigger (and trackers which may need to change settings) Stop run Halt
12
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 12 Top-Level Run Control (Level-0) Central point of control Global State Machine Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local) Central point of control Global State Machine Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local)
13
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 13 Masking of components Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking) Connection to readout (SLINK) Connection to Trigger Throttling System Mask out DAQ slices ( = 1/8 of central DAQ) Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking) Connection to readout (SLINK) Connection to Trigger Throttling System Mask out DAQ slices ( = 1/8 of central DAQ)
14
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 14 Commissioning and First Operation with the LHC
15
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 15 Commissioning and First Operation Independent parallel commissioning of sub-detectors Mini DAQ setups allow for standalone operation
16
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 16 Mini DAQ (“partitioning”) Dedicated small DAQ setups for most sub-systems Low bandwidth but sufficient for most tests Mini DAQ may be used in parallel to the Global Runs Dedicated small DAQ setups for most sub-systems Low bandwidth but sufficient for most tests Mini DAQ may be used in parallel to the Global Runs Level-0 Global DAQ Global Trigger Slice 0Slice 7 Tracker … Level-0 MiniDAQECALLTC … DT Local Trigger Controller (or Global Trigger) Global Run MiniDAQ Run ( heavily used in commissioning phase)
17
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 17 Commissioning and First Operation Independent parallel commissioning of sub-detectors Mini DAQ setups allow for standalone operation Run start time End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)
18
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 18 Optimization of run startup time Globally Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements
19
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 19 Optimization of run startup time Globally Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications combine commands and parameters into single message New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write
20
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 20 Optimization of run startup time Globally Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications combine commands and parameters into single message New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write
21
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 21 Run Start timing (May 2010) Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start sub-system time (seconds)
22
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 22 Commissioning and First Operation Independent parallel commissioning of sub-detectors Mini DAQ setups allow for standalone operation Run start time End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute Initially some stability issues Problems solved by debugging user code (thread leaks)
23
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 23 Commissioning and First Operation Independent parallel commissioning of sub-detectors Mini DAQ setups allow for standalone operation Run start time End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute Initially some stability issues Problems solved by debugging user code (thread leaks) Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only)
24
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 24 Commissioning and First Operation Independent parallel commissioning of sub-detectors Mini DAQ setups allow for standalone operation Run start time End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute Initially some stability issues Problems solved by debugging user code (thread leaks) Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only) Operator efficiency Operation is complex Subsystem inter-dependencies when configuring partially Dependencies on internal & external parameters Procedures to follow (Clock change) Operators are no longer DAQ experts but colleagues from the entire collaboration Built-in cross checks to guide the operator
25
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 25 Built-in cross-checks Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source changed or LHC has been unstable Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source changed or LHC has been unstable Improved operator efficiency
26
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 26 Operation with the LHC Cosmic run 1. Bring the detector into the desired state (Detector Control system) 2. Start Data Acquisition (Run Control System) LHC Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready Cosmic run 1. Bring the detector into the desired state (Detector Control system) 2. Start Data Acquisition (Run Control System) LHC Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready LHC dipole current … LHC clock stable Ramp: clock variations may unlock some links in the trigger Tracking detector high voltage only ramped up when beams are stable (detector safety)
27
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 27 Integration with DCS & automatic actions In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents) Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents) Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run Level-0 DAQTracker … DCS TrackerECAL Detector Control System DCS Run Control System … 0 PSX LHC PVSS SOAP eXchange XDAQ service
28
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 28 Automatic actions LHC dipole current … LHC clock stable start ramp start Mask sensitive trigger channels ramp done Unmask sensitive trigger channels Tracker HV on Enable payload lower thresholds log HV state in data Ramp up tracker HV stop Ramp down tracker HV CMS run: Tracker HV off Disable payload raise thresholds log HV state in data Automatic actions …
29
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 29 Observations Standardizing the experiment’s software is important for long-term maintenance Almost successful considering the size of the collaboration Run Control Framework was available early in the development of the experiment’s software (2003) Adopted by all sub-systems But some sub-systems built their own framework, underneath Ease-of-use becomes more and more important Run Control / DAQ is now operated by members of the entire CMS collaboration Running with high life-time: > 95 % so far for stable-beam periods in 2010
30
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 30 Observations – Web Technology Operations Typical advantages of a web application: multiple clients, remote login Stability of the server (Apache Tomcat + Run Control Web Application) very good: running for weeks Stability of the GUI depends on third-part products (browser) Behavior changes from one release to the next Not a big problem - GUI can be restarted without affecting the run Development Knowledge of Java and the Run Control Framework sufficient for basic function managers Web-based GUI & web technologies handled by framework Development of complex GUIs such as the top-level control node more difficult Many technologies need to be mastered Modern web toolkits not yet used by Run Control
31
IEEE Real Time 2010, 27 May 2010H. Sakulin / CERN PH 31 Summary & Outlook CMS Run Control System is based on Java & Web Technologies Good stability Top-Level Control node optimized for efficiency Flexible operation of individual sub-systems Built-in cross-checks to guide the operator Automatic actions triggered by detector and LHC state High CMS data-taking efficiency life-time > 95% Next developments Further improve fault tolerance Automatic recovery procedures Auto Pilot candidate event
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.