CWG10 Control, Configuration and Monitoring Status Report 13 July 2016
Outline CCM milestones Control Status Configuration Status Monitoring Status Services running in O2 development cluster Next steps Conclusion ALICE O2 CWG10 Control, Configuration and Monitoring
CCM milestones AOGM-1 AOGM-4 AOGM-7 AOGM-2 AOGM-5 AOGM-8 AOGM-3 AOGM-6 Summary Estimated AOGM-8 CWG10 - CCM Configuration System Prototype 2016-06-30 AOGM-4 CWG10 - CCM Monitoring Library 2016-03-31 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-6 CWG10 - CCM Monitoring System Release V1.0 2016-12-21 AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 2016-03-04 AOGM-1 CWG10 - CCM Control Library JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library 2016-03-31 AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype 2016-06-30 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 2016-12-21 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring
Control Status DDS v1.2 released on 06-07-2016 Anar & Co dds_intercom_lib SLURM plugin (+ ssh and localhost) Ongoing tests with DDS + FairMQ State Machine + Zookeeper Goal is to Provide feedback to developers Identify possible limitations Use QC as test bench Multiple processes Multiple devices in single process Anar & Co Vasco, Barth & Sylvain ALICE O2 CWG10 Control, Configuration and Monitoring
Control Status Apache Mesos in ALICE – Can it be used in O2 ? Apache Mesos successfully used in production by Offline release building and validation cluster (24 nodes, 316 cores, mixed bare-metal / OpenStack setup). Mesos DDS plugin being worked on by Kevin Napoli (Openlab summer student) under Giulio's supervision. As part of the work Kevin is also investigating how to integrate a network topology aware scheduler for Mesos. Evaluation of Mesos based "solutions" in progress: Current "homegrown" solution (talk at CHEP2016 accepted) Mesosphere DC/OS (Dario & Giulio) CISCO Mantl (Kevin) Giulio, Dario & Kevin ALICE O2 CWG10 Control, Configuration and Monitoring
Configuration Status Configuration library Sylvain & Pascal Etcd Simple put/get interface Allow processes to read/write configuration from repository Supports multiple backends From file From etcd Etcd Distributed key-value store Raft consensus algorithm RESTful HTTP API Watch values for changes Claim to be focused on being Simple, Secure, Fast and Reliable Sylvain & Pascal Pascal ALICE O2 CWG10 Control, Configuration and Monitoring
Configuration Status – etcd benchmarking Pascal More details here ALICE O2 CWG10 Control, Configuration and Monitoring
Monitoring Status Monitoring library Adam Monitoring backends Send application specific values to monitoring system Perform self-monitoring of processes (CPU, mem) Generate derived metrics (rate, average) Support multiple backends (cumulatively) Monitoring backends Logging MonALISA InfluxDB Adam Adam, Vasco & Costin ALICE O2 CWG10 Control, Configuration and Monitoring
Monitoring Status – MonALISA Sensors for system monitoring ApMon for application/process monitoring MonALISA Service for transport/aggregation/ processing MonALISA Repository for historical record Extensive experience in ALICE Running in ALICE Offline since 10 years 131 Services: 8M active parameters from 70K running jobs 130KHz of collected data Central Repository instance 2M actively tracked parameters, ~100K are persistently stored 200K dynamic pages per day served to users Running in ALICE DAQ since 2 years (MAD) 1 Service, 2 kHz of monitoring data, near-real-time display to shift crew ALICE O2 CWG10 Control, Configuration and Monitoring
Monitoring Status – IT setup Collect Transport Process Visualize Elastic Search Kibana Future: influxdb ? Future: grafana ? Lemon sensors (legacy) Apache Flume Apache Spark Future: collectd ? Hadoop ALICE O2 CWG10 Control, Configuration and Monitoring
Monitoring Status: IT setup collectd Collects system and services metrics More than 100 available plugins (sensors) cpu, mem, apache, mysql, oracle, ipmi, nfs, … Can write to many backends carbon, csv, rrd, graphite, http, kafka, mongodb, redis, network, … Apache Flume A bit like MonALISA Service (also Java-based) Allows to process, aggregate, transport data Source: consumes events from external source Channel: data store (Memory, File, JDBC, … , custom) Sink: writes to external target (HDFS, elasticSearch,…) ALICE O2 CWG10 Control, Configuration and Monitoring
Monitoring Status: IT setup influxdb Time series database Input data from UDP, HTTP, collectd, graphite, … Dashboards via Chronograf (same company), grafana Grafana Dashboard generation Supports multiple data sources graphite, elasticsearch, influxdb, opentsdb, … FLP Prototype Meeting
Services running in O2 development cluster Located in basement of building 4 (DAQ lab) 2 web servers, 1 DB server, 1 Monitoring server, 4 10G servers, 4 40G servers, 2 FLP servers Coming soon: 1 Control server, 1 Configuration server, GPN connectivity ~ 45 older machines can be used for larger scale tests Monitoring: MonALISA sensors installed on all nodes service and repository running on mon server Monitoring: collectd + influxdb + grafana collectd running on all nodes (cpu + mem + network) influxdb running on mon server grafana running on web server, available here Configuration: etcd etcd server running on 10G server (will be moved to conf server) etcd-browser running on web server, available here Uli Costin Vasco Pascal ALICE O2 CWG10 Control, Configuration and Monitoring
Next steps - Control Continue tests with FairMQ, DDS, Zookeeper Continue tests with Mesos & Co ALICE O2 CWG10 Control, Configuration and Monitoring
Next steps - Configuration Move Configuration library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OCONF-3 Benchmark etcd backend OCONF-6 OCONF-3 Benchmark with "multiple tree" data structure OCONF-7 OCONF-3 Benchmark multiple etcd servers running on same physical node OCONF-8 OCONF-3 Benchmark with authentication ON OCONF-9 OCONF-3 Benchmark etcd proxy OCONF-13 OCONF-3 Benchmark etcd in linearized read mode OCONF-5 Explore Consul as potential backend for Configuration system OCONF-12 Benchmark file based backend with shared file system OCONF-14 Create class to interface with MySQL backend OCONF-15 Benchmark MySQL backend OCONF-17 Benchmark with TPCC configuration data ALICE O2 CWG10 Control, Configuration and Monitoring
Next steps - Monitoring Move Monitoring library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OMON-22 Process's heartbeat OMON-8 Expore Grafana as dashboard OMON-7 Expore CollectD as potential tool for metrics collection OMON-2 Explore Apache Flume as potential tool for data aggregation OMON-6 Explore statsD as potential tool for metrics collection OMON-3 Explore InfluxDB as potential repository for monitoring data ALICE O2 CWG10 Control, Configuration and Monitoring
Conclusion Work is ramping up Configuration and Monitoring libraries ready to be moved to AliceO2 repo Control a bit behind but some progress made JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library 2016-03-31 AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype 2016-06-30 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 2016-12-21 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring