Presentation is loading. Please wait.

Presentation is loading. Please wait.

CWG10 Control, Configuration and Monitoring

Similar presentations


Presentation on theme: "CWG10 Control, Configuration and Monitoring"— Presentation transcript:

1 CWG10 Control, Configuration and Monitoring
Status Report 13 July 2016

2 Outline CCM milestones Control Status Configuration Status
Monitoring Status Services running in O2 development cluster Next steps Conclusion ALICE O2 CWG10 Control, Configuration and Monitoring

3 CCM milestones AOGM-1 AOGM-4 AOGM-7 AOGM-2 AOGM-5 AOGM-8 AOGM-3 AOGM-6
Summary Estimated AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-4 CWG10 - CCM Monitoring Library AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 AOGM-1 CWG10 - CCM Control Library JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring

4 Control Status DDS v1.2 released on 06-07-2016 Anar & Co
dds_intercom_lib SLURM plugin (+ ssh and localhost) Ongoing tests with DDS + FairMQ State Machine + Zookeeper Goal is to Provide feedback to developers Identify possible limitations Use QC as test bench Multiple processes Multiple devices in single process Anar & Co Vasco, Barth & Sylvain ALICE O2 CWG10 Control, Configuration and Monitoring

5 Control Status Apache Mesos in ALICE – Can it be used in O2 ?
Apache Mesos successfully used in production by Offline release building and validation cluster (24 nodes, cores, mixed bare-metal / OpenStack setup). Mesos DDS plugin being worked on by Kevin Napoli (Openlab summer student) under Giulio's supervision. As part of the work Kevin is also investigating how to integrate a network topology aware scheduler for Mesos. Evaluation of Mesos based "solutions" in progress: Current "homegrown" solution (talk at CHEP2016 accepted) Mesosphere DC/OS (Dario & Giulio) CISCO Mantl (Kevin) Giulio, Dario & Kevin ALICE O2 CWG10 Control, Configuration and Monitoring

6 Configuration Status Configuration library Sylvain & Pascal Etcd
Simple put/get interface Allow processes to read/write configuration from repository Supports multiple backends From file From etcd Etcd Distributed key-value store Raft consensus algorithm RESTful HTTP API Watch values for changes Claim to be focused on being Simple, Secure, Fast and Reliable Sylvain & Pascal Pascal ALICE O2 CWG10 Control, Configuration and Monitoring

7 Configuration Status – etcd benchmarking
Pascal More details here ALICE O2 CWG10 Control, Configuration and Monitoring

8 Monitoring Status Monitoring library Adam Monitoring backends
Send application specific values to monitoring system Perform self-monitoring of processes (CPU, mem) Generate derived metrics (rate, average) Support multiple backends (cumulatively) Monitoring backends Logging MonALISA InfluxDB Adam Adam, Vasco & Costin ALICE O2 CWG10 Control, Configuration and Monitoring

9 Monitoring Status – MonALISA
Sensors for system monitoring ApMon for application/process monitoring MonALISA Service for transport/aggregation/ processing MonALISA Repository for historical record Extensive experience in ALICE Running in ALICE Offline since 10 years 131 Services: 8M active parameters from 70K running jobs 130KHz of collected data Central Repository instance 2M actively tracked parameters, ~100K are persistently stored 200K dynamic pages per day served to users Running in ALICE DAQ since 2 years (MAD) 1 Service, 2 kHz of monitoring data, near-real-time display to shift crew ALICE O2 CWG10 Control, Configuration and Monitoring

10 Monitoring Status – IT setup
Collect Transport Process Visualize Elastic Search Kibana Future: influxdb ? Future: grafana ? Lemon sensors (legacy) Apache Flume Apache Spark Future: collectd ? Hadoop ALICE O2 CWG10 Control, Configuration and Monitoring

11 Monitoring Status: IT setup
collectd Collects system and services metrics More than 100 available plugins (sensors) cpu, mem, apache, mysql, oracle, ipmi, nfs, … Can write to many backends carbon, csv, rrd, graphite, http, kafka, mongodb, redis, network, … Apache Flume A bit like MonALISA Service (also Java-based) Allows to process, aggregate, transport data Source: consumes events from external source Channel: data store (Memory, File, JDBC, … , custom) Sink: writes to external target (HDFS, elasticSearch,…) ALICE O2 CWG10 Control, Configuration and Monitoring

12 Monitoring Status: IT setup
influxdb Time series database Input data from UDP, HTTP, collectd, graphite, … Dashboards via Chronograf (same company), grafana Grafana Dashboard generation Supports multiple data sources graphite, elasticsearch, influxdb, opentsdb, … FLP Prototype Meeting

13 Services running in O2 development cluster
Located in basement of building 4 (DAQ lab) 2 web servers, 1 DB server, 1 Monitoring server, 4 10G servers, 4 40G servers, 2 FLP servers Coming soon: 1 Control server, 1 Configuration server, GPN connectivity ~ 45 older machines can be used for larger scale tests Monitoring: MonALISA sensors installed on all nodes service and repository running on mon server Monitoring: collectd + influxdb + grafana collectd running on all nodes (cpu + mem + network) influxdb running on mon server grafana running on web server, available here Configuration: etcd etcd server running on 10G server (will be moved to conf server) etcd-browser running on web server, available here Uli Costin Vasco Pascal ALICE O2 CWG10 Control, Configuration and Monitoring

14 Next steps - Control Continue tests with FairMQ, DDS, Zookeeper
Continue tests with Mesos & Co ALICE O2 CWG10 Control, Configuration and Monitoring

15 Next steps - Configuration
Move Configuration library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OCONF-3 Benchmark etcd backend OCONF-6 OCONF-3 Benchmark with "multiple tree" data structure OCONF-7 OCONF-3 Benchmark multiple etcd servers running on same physical node OCONF-8 OCONF-3 Benchmark with authentication ON OCONF-9 OCONF-3 Benchmark etcd proxy OCONF-13 OCONF-3 Benchmark etcd in linearized read mode OCONF-5 Explore Consul as potential backend for Configuration system OCONF-12 Benchmark file based backend with shared file system OCONF-14 Create class to interface with MySQL backend OCONF-15 Benchmark MySQL backend OCONF-17 Benchmark with TPCC configuration data ALICE O2 CWG10 Control, Configuration and Monitoring

16 Next steps - Monitoring
Move Monitoring library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OMON-22 Process's heartbeat OMON-8 Expore Grafana as dashboard OMON-7 Expore CollectD as potential tool for metrics collection OMON-2 Explore Apache Flume as potential tool for data aggregation OMON-6 Explore statsD as potential tool for metrics collection OMON-3 Explore InfluxDB as potential repository for monitoring data ALICE O2 CWG10 Control, Configuration and Monitoring

17 Conclusion Work is ramping up 
Configuration and Monitoring libraries ready to be moved to AliceO2 repo Control a bit behind but some progress made JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring


Download ppt "CWG10 Control, Configuration and Monitoring"

Similar presentations


Ads by Google