WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.

WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC

Outline  Motivation and goals  Organization of work  Progress so far  JINR contribution  Conclusions 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 2

Motivation  Substantial part of manpower involved in the monitoring of the OSG/WLCG infrastructure was covered by EGI/InSPIRE funding which is coming to its’ end => considerable reduction of manpower  The system put in place for monitoring of the EGI infrastructure was designed for a fully distributed operational model with strong contributions from the regional teams. While the WLCG operational model is more centralized. Re-scoping of the monitoring system to the WLCG infrastructure creates an opportunity for simplification  The WLCG operational model implies major contribution from the LHC experiments. Therefore the WLCG monitoring system should be flexible enough to be easily adapted to the needs of the LHC experiments and should be inline with their operational modes 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 3

Goal of the project  The goal of the project is the consolidation of the WLCG monitoring This includes critical analysis of  what is monitored,  technology used  deployment and support model  This should allow to:  reduce the complexity of the system  ensure simplified and more effective operations, support and service management  encourage an efficient deployment strategy, with a common development process  unify, where possible, the implementation of the monitoring components  Wherever reasonable, the effort should be aligned with the activities of the Agile Infrastructure Monitoring team at CERN.  As a result by autumn 2014 the monitoring services should be operated and modified (when required) with at least twice smaller team, compared to spring 2013 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 4

Organization of work (1)  The project is split in two stages:  First one which should be accomplished by the end of September:  Review of the current systems and metrics used for WLCG monitoring  Collecting and summarizing requirements from the LHC computing community to understand the changes needed in monitoring.  Highlight areas that are expected to be problematic  Propose a revised strategy and architecture for WLCG monitoring  Suggest implementation technologies and approaches for the transitions 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 5

Organization of work (2)  Second stage: implementation of the new monitoring framework and components needed for the transition to the new toolchain.  A first prototype should be ready by the end of 2013. All monitoring should have transitioned to the new approach by summer 2014.  Working group led by Pablo Saiz (CERN IT-SDC-MI) consists of representatives of  WLCG monitoring development teams (SAM and Dashboard)  Agile monitoring infrastructure development team  4 LHC experiments  WLCG operations team  Sites are represented by the members of the WLCG operations team  Small taskforces on dedicated subjects  Mailing list: wlcg-mon-consolidation@cern.ch  Fortnightly meetings with summary reports  http://go.cern.ch/6XQQ http://go.cern.ch/6XQQ 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 6

Review of the current systems used for WLCG monitoring  Monitoring applications currently provided by the WLCG monitoring team can be split in 4 groups:  Monitoring and accounting of the job processing activity  Monitoring and accounting of data access and data transfers  Remote testing and monitoring of the distributed infrastructure  Tools for dissemination purposes  Both experiments and sites were asked to provide input regarding usefulness of the existing applications, possible improvements and missing functionality (if any). Input from sites was provided through the WLCG operations team which sent a questionaire to the sites. The review did not include VO-specific monitoring systems developed inside the experiments  According to the collected input, SAM and most of the Dashboard applications are considered to be useful or essential by the experiments ( at least ATLAS and CMS) and by the WLCG operations team. No important gaps in the monitoring functionality were identified.  Detailed list with applications and their usage, architecture and technology review are included in the backup slides of this presentation. 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 7

How to move forward  Stop support of applications not used by the community  Reduce scope  Unify implementation  Evaluate and apply new technologies 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 8

Reduce scope  Re-scoping to WLCG will allow to: - reduce the complexity of our services - concentrate on things required by experiments (for example no need for OPS tests, availability reports are based on the experiment tests) - decrease number of service instances we have to support - pass responsibility for SAM central services used for the EGI infrastructure to Greece/Croatia  To be followed up next week at the EGI Technical Forum 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 9

13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 10 Unifying implementations (Infrastructure testing and monitoring as an example)

Infrastructure testing and monitoring  Most of effort of the WLCG monitoring team is currently dedicated to the infrastructure testing and monitoring.  Historically there are multiple implementations. This group of applications includes most of applications and services we currently support  This area was identified as the main one where we can save most of effort and do things in a more efficient way 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 11

Where we are now 13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 12 Experiment- defined metrics SSB SSB WEB service Experiment- specific processing Nagios Nagios SAM probes Message queue WEB service SUM MyWLCG MRS ACE POEM ATP WEB interface Hammer Cloud server Hammer Cloud server Hammer Cloud DB Hammer Cloud jobs 3 different colours 3 different systems 3 different colours 3 different systems

Drawbacks of the current state  Variety of schemas for metric repositories  Variety of processing engines  Different implementation of data collectors  Different implementation of APIs and visualization layers  Not all components provide enough flexibility for introducing new metrics, for combining metrics, for integrating custom processing algorithms  While functional testing (Nagios) and stress testing (Hammer Cloud) by its nature might dictate different implementation of the test submission mechanism, the following part of the chain looks very similar : collecting metrics, processing them (with generic or custom algorithm), providing access to data through APIs and UIs Ideally the common workflow should be applied to any kind of monitoring data 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 13

Should follow the real experiment workflows  Does our testing and monitoring check things which would allow to provide a realistic view of whether infrastructure can be effectively used by the experiments?  SAM tests are submitted through WMS which LHC experiments do not use (or do not intend to use in the close future) for their workflows  SAM tests are pretty static and in general check only basic functionality of the services  Not all experiments intend to use SAM/Nagios in future (ALICE).  One of the possibilities would be to perform tests inside the experiment pilots  All experiments would like to be able to provide their own metrics to be taken into account for evaluation of site usability and performance 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 14

Site Status Board (SSB) as a generic solution for a metric store  SSB looks to be generic enough to be used as a common metric store.  Provides access to the snapshot and historical data  Provides data compaction and archiving  Has a built in mechanism for calculating status of a given instance over a given time range (similar to availability calculation)  It has necessary concepts for combining metrics which can be of interest for a particular task (SSB views)  Was successfully used for operations by ATLAS and CMS Having a common implementation for metric store would  simplify a lot the current architecture  allow to easily evaluate and apply new technologies (for data storage and processing) 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 15

Where we would like to get 13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 16 SAM Nagios tests Hammer Cloud Hammer Cloud Metrics defined by experiments Pledges TRANSPORTTRANSPORT TRANSPORTTRANSPORT SSB-like metric store Downtime Built-in processing engine Built-in processing engine External processing engine WEB service Topology description Profile description UI WEB service

Current status  Prototype is being deployed:  SSB instance deployed  Is being filled with SAM production test results  Working on aggregation (availability calculation)  SSB web interface ready:  Summary page, current status, historical plots, ranking plots, ‘readiness’ plots 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 17

NEC`2013, Varna, Julia Andreeva, CERN 18 And current metrics already published 13-Sep-2013

Evaluation of the new technologies  Following recommendations of the AI monitoring team, Elasticsearch was evaluated for several WLCG monitoring applications including SSB 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 19

WLCG Transfer Dashboard prototype using Elasticsearch as a storage solution 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 20 For details on other layers, see: http://cern.ch/go/pw7 F See backup slide # for conclusions of current evaluation For details on other layers, see: http://cern.ch/go/pw7 F See backup slide # for conclusions of current evaluation

Packaging and deployment 13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 21  Using openstack, puppet, hiera, foreman  Quota of 100 nodes, 240 cores  Multiple templates already created  Development machine (9 nodes)  Web servers ( SSB, SUM, WLCG transfers, Job: 20 nodes )  Elastic Search (6 nodes), Hadoop (4 nodes)  Currently working on nagios installation  Migrating machines from quattor to AI  Koji and Bamboo for build system and continuous integration

JINR contribution  Russian institutions, in particular JINR are actively participating in the WLCG monitoring work including WLCG monitoring consolidation project  Part of this work is performed in the framework of CERN IT- Russia collaboration program (WLCG monitoring task)  JINR team: Vladimir Korenkov (leading activity from the Russian side), Sergey Belov, Ivan Kadochnikov, Sergey Mitsin, Elena Tikhonenko, students from Dubna university  JINR colleagues contribute to the development of the WLCG transfer Dashboard, xrootd Dashboards for ATLAS and CMS, evaluation of the NoSQL technologies for storage and processing of the WLCG monitoring data 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 22

Conclusions  First phase of the project consisting of the assessment of the current monitoring systems is accomplished  A revised strategy and architecture for WLCG monitoring is being defined  Working in a very close collaboration with experiments, operations, AI monitoring  By spring 2014 the transition to a new toolchain should be performed.  As a result by autumn 2014 the monitoring services should be operated and modified (when required) with at least twice smaller team, compared to spring 2013 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 23

Backup slides 13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 24

Overview of the current monitoring applications (usage) (1) 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 25

13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 26 Overview of the current monitoring applications (usage) (2)

Review of the current systems used for WLCG monitoring (architecture)  Regarding architecture, the Dashboard applications are all very similar and are developed in the common framework.  For the SAM system, one of the main limitations reported in the review was the fact that various components of the system not always communicate through the APIs, but have shared tables in the DB, which creates hidden dependencies and does not allow to modify any of the components independently of the complete chain  In general, all applications have common functional blocks like data publishers, transfer layer, data collectors, data repository, data processing engines, web service and UIs. SAM in addition has a test submission framework based on Nagios. 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 27

Review of the current systems used for WLCG monitoring (technology) 13-Sep-2013NEC`2013, Varna, Julia Andreeva, CERN 28

Conclusions of Elasticsearch evaluation for WLCG Transfer Dashboard  Elasticsearch 0.90.3 does not support grouping by terms of multiple fields for statistical aggregations.  Using Elasticsearch 0.90.3 for WLCG Transfer Monitoring we could achieve similar performance to 2 nd hit, i.e. cached, Oracle performance but this means using diverse workarounds for multi-field grouping.  Elasticsearch 1.0 includes a new Aggregation Module that will support grouping by terms of multiple fields for statistical aggregations  Recommend re-evaluating ElasticSearch for WLCG Transfer Monitoring when 1.0 is available. 13-Sep-2013 NEC`2013, Varna, Julia Andreeva, CERN 29

WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.

Similar presentations

Presentation on theme: "WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.

Similar presentations

Presentation on theme: "WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC."— Presentation transcript:

Similar presentations

About project

Feedback