Download presentation
Presentation is loading. Please wait.
Published byElian Mellis Modified over 9 years ago
1
CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop 2014@Pusan
2
Outline ▶ Motivation ▶ A brief overview of data taking operations ▶ Lessons learned from Run 1 ▶ CCM Overview ▶ Performance tests ▶ Next steps ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 2
3
Motivation ▶ Why do we need a Control System ? ▶ Start and stop processes ▶ Sequence of operations, synchronization ▶ External systems ▶ Automation ▶ Why do we need a Configuration System ? ▶ Configure processes ▶ Why do we need a Monitoring System ? ▶ Detect abnormal conditions ▶ Automation ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 3
4
Team ▶ CERN ▶ KMUTT, Thailand ▶ See next presentation by Khanasin for an update ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 4
5
A brief overview of data taking operations ▶ A typical LHC year ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 5 JanFebMarAprMayJunJulyAugSepOctNovDec Shutdown for maintenance proton-proton collisions Heavy-ion collisions Disclaimer: current system, not O 2
6
A brief overview of data taking operations ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 6 Beam Injection Stable beams Beam dump ALICE safe Prepare trigger configuration Detector calibration Partial ALICE READY Full ALICE READY Data taking Detector calibration Ideally a single run Disclaimer: current system, not O 2 ▶ A typical LHC Fill (up to 30 hours) JanFebMarAprMayJunJulyAugSepOctNovDec
7
A brief overview of data taking operations ▶ A typical ALICE run ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 7 Disclaimer: current system, not O 2 Start-of-Run Config detectors electronics Start online systems Store data taking conditions Data taking Readout Event building Online data monitoring Online calibration data End-of-Run Export data taking conditions and calibration data to Offline Stop online systems
8
A brief overview of data taking operations ▶ Run 1 SOR sequence (high level) ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 8 Disclaimer: current system, not O 2
9
Lessons learned from Run 1 (2010-2013) ▶ Must be fast when changing run ▶ More runs than expected ▶ Not everything needs to be restarted ▶ Must be flexible ▶ Not every problem needs to stop a run ▶ Must monitor everything ▶ Data flow monitoring ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 9 Run 2: Fast SOR/EOR Run 2: Pause and Recover Run 2: MAD
10
Control in O 2 - Overview ▶ Process Management ▶ Start/stop processes ▶ Send commands to processes (CONFIGURE, PAUSE/RESUME, etc.) ▶ Estimated: O(100k) processes ▶ Task Management ▶ Ensure that actions are executed in the correct order ▶ Automation ▶ Automatically recover from errors ▶ Automatically react to internal events (e.g. need more EPNs), external events (e.g. start of LHC collisions) ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 10
11
Control in O 2 - Notes ▶ Includes processes from online and offline ▶ Must control both synchronous and asynchronous tasks ▶ Cannot be seen as a batch system ▶ Bound to external events (e.g. start of collisions) ▶ Sequence of operations, synchronization points ▶ Low latency very important ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 11
12
Configuration in O 2 - Overview ▶ Configuration distribution ▶ Provide processes with needed configuration parameters ▶ Dynamic process (re)configuration ▶ Essential to achieve fast run transition ▶ O(1GB) of configuration data ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 12
13
Monitoring in O 2 - Overview ▶ Data collection and archival ▶ System monitoring (CPU, memory, I/O, etc.) ▶ Application monitoring (data rates, link backpressure, internal buffer status, etc.) ▶ O(600KHz) of monitoring data ▶ Alarms and action triggering ▶ Support shift crew, experts ▶ Feedback to Control system ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 13
14
Monitoring in O 2 - Notes ▶ Includes metrics from online and offline ▶ Includes both low and high frequency metrics ▶ Low: every 30 seconds, system metrics ▶ High: every second, link status ▶ Permanent storage will be the limiting factor ▶ No need to store everything, can filter “interesting” values ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 14
15
Performance Tests: Control ▶ Tool: SMI (State Machine Interface) ▶ Setup: ▶ Level 0 SMI domain: Partition CCM ▶ Level 1 SMI domain: Detector CCMs EPN Cluster CCM ▶ Level 2 SMI domain: FLP CCMs, EPN CCMs ▶ Level 2 SMI proxy: local process ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 15
16
Performance Tests: Control ▶ Setup: ▶ 46 hosts ▶ 1 Level 0 domain ▶ 20 Level 1 domains ▶ 1350 Level 2 domains ▶ 67500 proxies ▶ Increase due to initial lookup in DIM DNS ▶ Conclusion: cannot use in current version ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 16
17
Performance Tests: Monitoring ▶ MonALISA + ApMon ▶ Setup: ▶ 10 sender nodes, up to 1000 threads per host (ApMon) ▶ 1 MonALISA service, all historical record disabled ▶ Result: 52 KHz without data loss ▶ Conclusion: could use 12+ collectors to reach 600 KHz ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 17 By Costin Grigoras
18
Performance Tests: Monitoring ▶ Zabbix ▶ Setup: ▶ 10 sender nodes, up to 10 processes per host ▶ 1 Zabbix Server node, 200 threads, permanent storage disabled (in-memory history enabled) ▶ Result: 30 KHz without data loss ▶ Conclusion: could use 20+ collectors to reach 600 KHz ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 18 By Andres Gomez Ramirez
19
Next steps ▶ Finalise TDR ▶ Perform more tests: ▶ Control: boost library + ZeroMQ ▶ Configuration: ZooKeeper ▶ Monitoring: MonALISA, Zabbix with permanent storage ▶ Provide CCM systems for ALFA prototype (CWG13) ▶ Refine design ALICE O2 CWG10 Control, Configuration and Monitoring | ALICE O2 Asian Workshop 2014 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.