Download presentation
Presentation is loading. Please wait.
1
CWG10 Control, Configuration and Monitoring
Status Report 13 July 2016
2
Outline CCM milestones Control Status Configuration Status
Monitoring Status Services running in O2 development cluster Next steps Conclusion ALICE O2 CWG10 Control, Configuration and Monitoring
3
CCM milestones AOGM-1 AOGM-4 AOGM-7 AOGM-2 AOGM-5 AOGM-8 AOGM-3 AOGM-6
Summary Estimated AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-4 CWG10 - CCM Monitoring Library AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 AOGM-1 CWG10 - CCM Control Library JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring
4
Control Status DDS v1.2 released on 06-07-2016 Anar & Co
dds_intercom_lib SLURM plugin (+ ssh and localhost) Ongoing tests with DDS + FairMQ State Machine + Zookeeper Goal is to Provide feedback to developers Identify possible limitations Use QC as test bench Multiple processes Multiple devices in single process Anar & Co Vasco, Barth & Sylvain ALICE O2 CWG10 Control, Configuration and Monitoring
5
Control Status Apache Mesos in ALICE – Can it be used in O2 ?
Apache Mesos successfully used in production by Offline release building and validation cluster (24 nodes, cores, mixed bare-metal / OpenStack setup). Mesos DDS plugin being worked on by Kevin Napoli (Openlab summer student) under Giulio's supervision. As part of the work Kevin is also investigating how to integrate a network topology aware scheduler for Mesos. Evaluation of Mesos based "solutions" in progress: Current "homegrown" solution (talk at CHEP2016 accepted) Mesosphere DC/OS (Dario & Giulio) CISCO Mantl (Kevin) Giulio, Dario & Kevin ALICE O2 CWG10 Control, Configuration and Monitoring
6
Configuration Status Configuration library Sylvain & Pascal Etcd
Simple put/get interface Allow processes to read/write configuration from repository Supports multiple backends From file From etcd Etcd Distributed key-value store Raft consensus algorithm RESTful HTTP API Watch values for changes Claim to be focused on being Simple, Secure, Fast and Reliable Sylvain & Pascal Pascal ALICE O2 CWG10 Control, Configuration and Monitoring
7
Configuration Status – etcd benchmarking
Pascal More details here ALICE O2 CWG10 Control, Configuration and Monitoring
8
Monitoring Status Monitoring library Adam Monitoring backends
Send application specific values to monitoring system Perform self-monitoring of processes (CPU, mem) Generate derived metrics (rate, average) Support multiple backends (cumulatively) Monitoring backends Logging MonALISA InfluxDB Adam Adam, Vasco & Costin ALICE O2 CWG10 Control, Configuration and Monitoring
9
Monitoring Status – MonALISA
Sensors for system monitoring ApMon for application/process monitoring MonALISA Service for transport/aggregation/ processing MonALISA Repository for historical record Extensive experience in ALICE Running in ALICE Offline since 10 years 131 Services: 8M active parameters from 70K running jobs 130KHz of collected data Central Repository instance 2M actively tracked parameters, ~100K are persistently stored 200K dynamic pages per day served to users Running in ALICE DAQ since 2 years (MAD) 1 Service, 2 kHz of monitoring data, near-real-time display to shift crew ALICE O2 CWG10 Control, Configuration and Monitoring
10
Monitoring Status – IT setup
Collect Transport Process Visualize Elastic Search Kibana Future: influxdb ? Future: grafana ? Lemon sensors (legacy) Apache Flume Apache Spark Future: collectd ? Hadoop ALICE O2 CWG10 Control, Configuration and Monitoring
11
Monitoring Status: IT setup
collectd Collects system and services metrics More than 100 available plugins (sensors) cpu, mem, apache, mysql, oracle, ipmi, nfs, … Can write to many backends carbon, csv, rrd, graphite, http, kafka, mongodb, redis, network, … Apache Flume A bit like MonALISA Service (also Java-based) Allows to process, aggregate, transport data Source: consumes events from external source Channel: data store (Memory, File, JDBC, … , custom) Sink: writes to external target (HDFS, elasticSearch,…) ALICE O2 CWG10 Control, Configuration and Monitoring
12
Monitoring Status: IT setup
influxdb Time series database Input data from UDP, HTTP, collectd, graphite, … Dashboards via Chronograf (same company), grafana Grafana Dashboard generation Supports multiple data sources graphite, elasticsearch, influxdb, opentsdb, … FLP Prototype Meeting
13
Services running in O2 development cluster
Located in basement of building 4 (DAQ lab) 2 web servers, 1 DB server, 1 Monitoring server, 4 10G servers, 4 40G servers, 2 FLP servers Coming soon: 1 Control server, 1 Configuration server, GPN connectivity ~ 45 older machines can be used for larger scale tests Monitoring: MonALISA sensors installed on all nodes service and repository running on mon server Monitoring: collectd + influxdb + grafana collectd running on all nodes (cpu + mem + network) influxdb running on mon server grafana running on web server, available here Configuration: etcd etcd server running on 10G server (will be moved to conf server) etcd-browser running on web server, available here Uli Costin Vasco Pascal ALICE O2 CWG10 Control, Configuration and Monitoring
14
Next steps - Control Continue tests with FairMQ, DDS, Zookeeper
Continue tests with Mesos & Co ALICE O2 CWG10 Control, Configuration and Monitoring
15
Next steps - Configuration
Move Configuration library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OCONF-3 Benchmark etcd backend OCONF-6 OCONF-3 Benchmark with "multiple tree" data structure OCONF-7 OCONF-3 Benchmark multiple etcd servers running on same physical node OCONF-8 OCONF-3 Benchmark with authentication ON OCONF-9 OCONF-3 Benchmark etcd proxy OCONF-13 OCONF-3 Benchmark etcd in linearized read mode OCONF-5 Explore Consul as potential backend for Configuration system OCONF-12 Benchmark file based backend with shared file system OCONF-14 Create class to interface with MySQL backend OCONF-15 Benchmark MySQL backend OCONF-17 Benchmark with TPCC configuration data ALICE O2 CWG10 Control, Configuration and Monitoring
16
Next steps - Monitoring
Move Monitoring library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OMON-22 Process's heartbeat OMON-8 Expore Grafana as dashboard OMON-7 Expore CollectD as potential tool for metrics collection OMON-2 Explore Apache Flume as potential tool for data aggregation OMON-6 Explore statsD as potential tool for metrics collection OMON-3 Explore InfluxDB as potential repository for monitoring data ALICE O2 CWG10 Control, Configuration and Monitoring
17
Conclusion Work is ramping up
Configuration and Monitoring libraries ready to be moved to AliceO2 repo Control a bit behind but some progress made JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.