CWG10 Control, Configuration and Monitoring

Slides:



Advertisements
Similar presentations
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Advertisements

19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report
CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop
Evaluation of NoSQL databases for DIRAC monitoring and beyond
Test results Test definition (1) Istituto Nazionale di Fisica Nucleare, Sezione di Roma; (2) Istituto Nazionale di Fisica Nucleare, Sezione di Bologna.
SEEM4570: XAMPP, Eclipse, Summary of Html Kangfei Zhao Room 711,ERB
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Hsu Chun-Hung Network Benchmarking Lab
System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.
Update on Database Issues Peter Chochula DCS Workshop, June 21, 2004 Colmar.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Stairway to the cloud or can we take the highway? Taivo Liik.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Distributed Time Series Database
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Status & development of the software for CALICE-DAQ Tao Wu On behalf of UK Collaboration.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Monitoring with InfluxDB & Grafana
The DCS Databases Peter Chochula. 31/05/2005Peter Chochula 2 Outline PVSS basics (boring topic but useful if one wants to understand the DCS data flow)
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Monitoring for the ALICE O 2 Project 11 February 2016.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Alfresco Monitoring with OpenSource Tools Miguel Rodriguez Technical Account Manager.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.
IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.
Service Providers & Metrics: Feed your Customers Matt Toback.
Metrics data published Via different methods Monitoring Server
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Daniele Bonacorsi Andrea Sciabà
Progress Apama Fundamentals
Architecture Review 10/11/2004
Monitoring Evolution and IPv6
Jacek Otwinowski (Data Preparation Group)
Data Virtualization Tutorial… SSL with CIS Web Data Sources
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
PROTECT | OPTIMIZE | TRANSFORM
WP18, High-speed data recording Krzysztof Wrona, European XFEL
System Monitoring with Lemon
Database Replication and Monitoring
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Data Analytics and CERN IT Hadoop Service
O2 Project Status Pierre Vande Vyvre
Monitoring with Clustered Graphite & Grafana
CMS High Level Trigger Configuration Management
Data Transport for Online & Offline Processing
ALICE Monitoring
Open Source distributed document DB for an enterprise
Consulting Services JobScheduler Architecture Decision Template
Collecting heterogeneous data into a central repository
DI4R, 30th September 2016, Krakow
CERN-Russia Collaboration in CASTOR Development
IT Monitoring Service Status and Progress
Jacek Otwinowski (for the DPG QA tools and WP7 groups)
Conditions Data access using FroNTier Squid cache Server
Monitoring of the infrastructure from the VO perspective
Monitoring for large infrastructure
Container cluster management solutions
The ELK stack - get to know logs
Distributing META-pipe on ELIXIR compute resources
EAST MDSplus Log Data Management System
Presentation transcript:

CWG10 Control, Configuration and Monitoring Status Report 13 July 2016

Outline CCM milestones Control Status Configuration Status Monitoring Status Services running in O2 development cluster Next steps Conclusion ALICE O2 CWG10 Control, Configuration and Monitoring

CCM milestones AOGM-1 AOGM-4 AOGM-7 AOGM-2 AOGM-5 AOGM-8 AOGM-3 AOGM-6 Summary Estimated AOGM-8 CWG10 - CCM Configuration System Prototype 2016-06-30 AOGM-4 CWG10 - CCM Monitoring Library 2016-03-31 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-6 CWG10 - CCM Monitoring System Release V1.0 2016-12-21 AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype AOGM-3 CWG10 - CCM Control Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 2016-03-04 AOGM-1 CWG10 - CCM Control Library JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library 2016-03-31 AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype 2016-06-30 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 2016-12-21 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring

Control Status DDS v1.2 released on 06-07-2016 Anar & Co dds_intercom_lib SLURM plugin (+ ssh and localhost) Ongoing tests with DDS + FairMQ State Machine + Zookeeper Goal is to Provide feedback to developers Identify possible limitations Use QC as test bench Multiple processes Multiple devices in single process Anar & Co Vasco, Barth & Sylvain ALICE O2 CWG10 Control, Configuration and Monitoring

Control Status Apache Mesos in ALICE – Can it be used in O2 ? Apache Mesos successfully used in production by Offline release building and validation cluster (24 nodes, 316 cores, mixed bare-metal / OpenStack setup). Mesos DDS plugin being worked on by Kevin Napoli (Openlab summer student) under Giulio's supervision. As part of the work Kevin is also investigating how to integrate a network topology aware scheduler for Mesos. Evaluation of Mesos based "solutions" in progress: Current "homegrown" solution (talk at CHEP2016 accepted) Mesosphere DC/OS (Dario & Giulio) CISCO Mantl (Kevin) Giulio, Dario & Kevin ALICE O2 CWG10 Control, Configuration and Monitoring

Configuration Status Configuration library Sylvain & Pascal Etcd Simple put/get interface Allow processes to read/write configuration from repository Supports multiple backends From file From etcd Etcd Distributed key-value store Raft consensus algorithm RESTful HTTP API Watch values for changes Claim to be focused on being Simple, Secure, Fast and Reliable Sylvain & Pascal Pascal ALICE O2 CWG10 Control, Configuration and Monitoring

Configuration Status – etcd benchmarking Pascal More details here ALICE O2 CWG10 Control, Configuration and Monitoring

Monitoring Status Monitoring library Adam Monitoring backends Send application specific values to monitoring system Perform self-monitoring of processes (CPU, mem) Generate derived metrics (rate, average) Support multiple backends (cumulatively) Monitoring backends Logging MonALISA InfluxDB Adam Adam, Vasco & Costin ALICE O2 CWG10 Control, Configuration and Monitoring

Monitoring Status – MonALISA Sensors for system monitoring ApMon for application/process monitoring MonALISA Service for transport/aggregation/ processing MonALISA Repository for historical record Extensive experience in ALICE Running in ALICE Offline since 10 years 131 Services: 8M active parameters from 70K running jobs 130KHz of collected data Central Repository instance 2M actively tracked parameters, ~100K are persistently stored 200K dynamic pages per day served to users Running in ALICE DAQ since 2 years (MAD) 1 Service, 2 kHz of monitoring data, near-real-time display to shift crew ALICE O2 CWG10 Control, Configuration and Monitoring

Monitoring Status – IT setup Collect Transport Process Visualize Elastic Search Kibana Future: influxdb ? Future: grafana ? Lemon sensors (legacy) Apache Flume Apache Spark Future: collectd ? Hadoop ALICE O2 CWG10 Control, Configuration and Monitoring

Monitoring Status: IT setup collectd Collects system and services metrics More than 100 available plugins (sensors) cpu, mem, apache, mysql, oracle, ipmi, nfs, … Can write to many backends carbon, csv, rrd, graphite, http, kafka, mongodb, redis, network, … Apache Flume A bit like MonALISA Service (also Java-based) Allows to process, aggregate, transport data Source: consumes events from external source Channel: data store (Memory, File, JDBC, … , custom) Sink: writes to external target (HDFS, elasticSearch,…) ALICE O2 CWG10 Control, Configuration and Monitoring

Monitoring Status: IT setup influxdb Time series database Input data from UDP, HTTP, collectd, graphite, … Dashboards via Chronograf (same company), grafana Grafana Dashboard generation Supports multiple data sources graphite, elasticsearch, influxdb, opentsdb, … FLP Prototype Meeting

Services running in O2 development cluster Located in basement of building 4 (DAQ lab) 2 web servers, 1 DB server, 1 Monitoring server, 4 10G servers, 4 40G servers, 2 FLP servers Coming soon: 1 Control server, 1 Configuration server, GPN connectivity ~ 45 older machines can be used for larger scale tests Monitoring: MonALISA sensors installed on all nodes service and repository running on mon server Monitoring: collectd + influxdb + grafana collectd running on all nodes (cpu + mem + network) influxdb running on mon server grafana running on web server, available here Configuration: etcd etcd server running on 10G server (will be moved to conf server) etcd-browser running on web server, available here Uli Costin Vasco Pascal ALICE O2 CWG10 Control, Configuration and Monitoring

Next steps - Control Continue tests with FairMQ, DDS, Zookeeper Continue tests with Mesos & Co ALICE O2 CWG10 Control, Configuration and Monitoring

Next steps - Configuration Move Configuration library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OCONF-3 Benchmark etcd backend OCONF-6 OCONF-3 Benchmark with "multiple tree" data structure OCONF-7 OCONF-3 Benchmark multiple etcd servers running on same physical node OCONF-8 OCONF-3 Benchmark with authentication ON OCONF-9 OCONF-3 Benchmark etcd proxy OCONF-13 OCONF-3 Benchmark etcd in linearized read mode OCONF-5 Explore Consul as potential backend for Configuration system OCONF-12 Benchmark file based backend with shared file system OCONF-14 Create class to interface with MySQL backend OCONF-15 Benchmark MySQL backend OCONF-17 Benchmark with TPCC configuration data ALICE O2 CWG10 Control, Configuration and Monitoring

Next steps - Monitoring Move Monitoring library to AliceO2 repo Set up community service reachable from outside CERN Key Summary OMON-22 Process's heartbeat OMON-8 Expore Grafana as dashboard OMON-7 Expore CollectD as potential tool for metrics collection OMON-2 Explore Apache Flume as potential tool for data aggregation OMON-6 Explore statsD as potential tool for metrics collection OMON-3 Explore InfluxDB as potential repository for monitoring data ALICE O2 CWG10 Control, Configuration and Monitoring

Conclusion Work is ramping up  Configuration and Monitoring libraries ready to be moved to AliceO2 repo Control a bit behind but some progress made JIRA Issue Summary Estimated due date AOGM-1 CWG10 - CCM Control Library 2016-03-31 AOGM-4 CWG10 - CCM Monitoring Library AOGM-7 CWG10 - CCM Configuration Library AOGM-2 CWG10 - CCM Control Prototype 2016-06-30 AOGM-5 CWG10 - CCM Monitoring Prototype AOGM-8 CWG10 - CCM Configuration System Prototype AOGM-3 CWG10 - CCM Control Release V1.0 2016-12-21 AOGM-6 CWG10 - CCM Monitoring System Release V1.0 AOGM-9 CWG10 - CCM Configuration System Release V1.0 ALICE O2 CWG10 Control, Configuration and Monitoring