Evolution of tools for WLCG operations Julia Andreeva, CERN IT

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.

WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

Julia Andreeva on behalf of the MND section MND review.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.

WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.

The Grid Information System Maria Alandes Pradillo IT-SDC White Area Lecture, 4th June 2014.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.

IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.

WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,

Daniele Bonacorsi Andrea Sciabà

GridPP37, Ambleside Adrian Coveney (STFC)

Bob Jones EGEE Technical Director

Dynamic Extension of the INFN Tier-1 on external resources

WLCG IPv6 deployment strategy

Monitoring Evolution and IPv6

WLCG Workshop 2017 [Manchester] Operations Session Summary

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Update on CERN IT Unified Monitoring Architecture (UMA)

WLCG Network Discussion

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

Data Analytics and CERN IT Hadoop Service

ATLAS Grid Information System

WLCG Accounting Task Force Update Julia Andreeva CERN IT on behalf of the WLCG Accounting Task Force GDB

Key Activities. MND sections

POW MND section.

How to enable computing

Moving from CREAM CE to ARC CE

Experiment Dashboard overviw of the applications

New Big Data Solutions and Opportunities for DB Workloads

IT Monitoring Service Status and Progress

Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.

Update from the HEPiX IPv6 WG

WLCG Accounting Task Force Update Julia Andreeva CERN WLCG Workshop 08

A Messaging Infrastructure for WLCG

ADC Requirements and Recommendations for Sites

Alerting/Notifications (MadAlert)

LCG Operations Workshop, e-IRG Workshop

Cloud Computing R&D Proposal

Monitoring of the infrastructure from the VO perspective

Data Analytics – Use Cases, Platforms, Services

DGAS Today and tomorrow

User Accounting Integration Spreading the Net.

Presentation transcript:

Evolution of tools for WLCG operations Julia Andreeva, CERN IT NEC 2017 , Budva, 26.09.2017

WLCG Operations. Introduction. Over last years we accumulated a lot of experience in operating huge heterogeneous infrastructure Still there are new challenges ahead of us Amount of required resources in particular for High Luminosity LHC is growing while dedicated effort is rather decreasing Technology evolution -> new kind of the resources like HPC, commercial clouds, volunteer computing. Resources are heterogeneous and more dynamic than classical GRID resources. How to integrate them transparently? WLCG relies on the resources provided by several GRID infrastructures (OSG, EGI, NorduGrid). Should work transparently across infrastructures boundaries and in close collaboration with their operations teams Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Operational tasks and corresponding tools Knowledge of infrastructure topology and configuration Information System: GocDB, OIM, REBUS, BDII, experiment-specific topology and configuration systems like ATLAS GRID Information System (AGIS) Testing of the distributed sites and services Experiment Test Framework (ETF) and Site Availability Monitor (SAM) Perfsonar for Network Accounting of the performed work and understanding whether MoU obligations are met Accounting systems : APEL, Gratia, EGI accounting system, accounting systems of the LHC experiments, REBUS (for pledges) Monitoring of the computing activities Multiple monitoring systems covering all areas of the LHC computing activities: data processing, data transfer, data access, network utilization, etc… Problem & request reporting and tracking Global Grid User Support (GGUS) Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Evolution of the WLCG operations tools. Common trends. Identify areas where things can be simplifies. Where there are potential bottlenecks need for modifications considering new requirements, growing scale or technology evolution Turn into a common solution systems which proved to work well in a scope of a single experiment Where possible use open source SW Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Information System Evolution Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

WLCG IS Landscape Primary sources Experiment-specific topology and configuration systems (AGIS, SiteDB…) Experiment workload and data management systems, operations, testing monitoring accounting GocDB OIM REBUS WLCG central systems for operations testing monitoring accounting BDII Others Julia Andreeva CERN IT, WLCG Workshop, Manchester, 21.06.2017

Drawbacks and limitations Multiple information sources. Sometime contradictory information. No single place where data can be integrated, validated and fixed. No single source providing information for all consumers in a consistent way. No complete and correct description of storage resources (with protocols and storage shares). Every experiment has to perform this task on its’ own. Therefore, we do not have a global picture on the WLCG scope. No clear and common way for integration of non-traditional/non-grid resources (commercial clouds, HPC, etc…). As a rule they are also mainly described in the experiment-specific systems and again missing in the global WLCG resource description. Clearly, there is a potential to address all these problems in a common way See next presentation of Alexey Anisenkov on Computing Resource Information Catalogue (CRIC) Julia Andreeva CERN IT, WLCG Workshop, Manchester, 21.06.2017

Testing distributed sites and services Thanks to Marian Babik for contribution to this presentation Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Experiments Test Framework (ETF) Measurement middleware Actively checks status of services Focuses on atomic tests Direct job submissions Worker node testing Core storage operations Generic, based on open source OMD, check_mk, nagios, stompclt Main source for monthly WLCG A/R reports Actively used in WLCG deployment campaigns HTTP, IPv6, CC7, RFC, HTCONDOR-CEs, etc. ETF Core Framework Involves configuration, scheduling, API, etc. Plugins Wide range of available plugins - covering broad range of services Contributed by experiments, PTs, TFs, etc. Worker Node Framework Micro-framework to run tests on the WN Numbers Test frequency ~ 10Hz, 33k tests/hour 8 clusters: 4 experiments + central instance, IPv6, T0, CC7, pS (@OSG) Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

ETF Architecture Types of tests supported by ETF: Remote API – simple storage tests (get/put/del), job submissions to multiple backends (ARC, HT-Condor-CE, CREAM, etc.) Agent-based testing (local testing on the worker nodes, remote hosts) Results published via message bus to Dashboards, ElasticSearch, etc. Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

ETF Roadmap In the short-term focusing on IPv6 transition – establishing new instances and adding new tests Writing native Condor job submission plugin with multi-backend support and putting in production the new WN micro-framework Adding new Xroot tests Integration with MONIT/Grafana, new SAM dashboards Migration to fully container-based (docker) deployment model Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Testing network infrastructure Thanks to Marian Babik for contribution to this presentation Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

WLCG Network Throughput Working Group WLCG Network Throughput WG formed in the fall of 2014 within the scope of WLCG operations with the following objectives: Ensure sites and experiments can better understand and fix networking issues Measure end-to-end network performance and use the measurements to single out on complex data transfer issues Improve overall transfer efficiency and help us determine the current status of our networks Core activities include: Deployment and operations of perfSONAR infrastructure Gain visibility into how our networks operate and perform Network analytics Improve our ability to fully utilize the existing network capacity Network performance incidents response team Provide support to help debug and fix network performance issues More information at https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

WLCG perfSONAR Deployment 246 Active perfSONAR instances 160 Running latest version (4.0.1) New experiments based meshes CMS Latency/Bandwidth ATLAS Latency/Bandwidth LHCb Latency/Bandwidth BelleII Latency/Bandwidth In total 93 latency services 115 traceroute services 102 bwclt/pscheduler services Initial deployment coordinated by WLCG perfSONAR TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Global deployment, continuous measurements and advanced processing platform run by OSG makes it possible to create novel dashboards, alerts/alarming as well as perform complex network analytics Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Network Throughput WG Roadmap Near-term - focus on improving efficiency of our current network links Continuing developments in the network analytics, integration of the further flow/router information and FTS data, alerting/notifications, etc. Validation and deployment campaign for perfSONAR 4.0.2 and 4.1 perfSONAR 4.1 is planned mid next year and will drop SL6 support, this will require a major campaign since all sites will need to re-install Updates to central services (configuration, monitoring, collectors, messaging, etc.) Long-term - plan for significant changes in the long-term Increased importance to oversee network capacities Sharing the future capacity will require greater interaction with networks While unclear on what technologies will become mainstream, we know that software will play a major role in the networks of the future HEP likely no longer the only domain that will need high throughput Tracking the evolution in the SDNs where new pilots/demonstrator projects will be proposed Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Accounting Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

CPU accounting CPU consumption is one of the main metrics which is used to understand how the sites fulfill their MoU obligations For CPU accounting WLCG uses the EGI accounting system based on APEL WLCG Accounting Task Force created in spring 2016 aimed to fix multiple problems with the CPU accounting (data correctness and UI) Reviewed the set of accounting metrics and agreed on those which should be exposed in the accounting portal and included in the monthly accounting reports Data validation had been performed , an automatic system for data validation has been setup Provided feedback on the EGI accounting portal and followed up on portal improvements Improved generation of monthly accounting reports Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Storage Space Accounting Currently we do not have Storage Space accounting working on the WLCG global scope Existing solutions inside experiments are mostly based on SRM which is not a mandatory service any more, therefore will provide only partial solution in the future In technical terms aim to apply a solution which will work both for accounting and operations Julia Andreeva CERN IT, WLCG Workshop, Manchester, 21.06.2017

Requirements After the WLCG accounting review in April 2016, some work has been performed to understand the requirements for the global WLCG storage space accounting system. Input from the experiments has been discussed at the WLCG Accounting Task Force meetings and Grid Deployment Board. It is summarized in the storage resource reporting proposal The main goal of the WLCG Storage space accounting is to provide high level overview of the free and used storage space with the granularity of VO storage shares at the sites with possible split of those shares by areas for some dedicated usage Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

High level overview of the used and available space Total used and total available for storage shares used by the experiments (should be available by at least one non-SRM protocol) Number of files if possible but not strictly required Frequency not higher than once per 30 minutes Accuracy order of tens GB CLI or API for querying Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Implementation The same way of querying storage for free/used space for accounting and operations CRIC for description of the storage topology MONIT infrastructure for data flow (data transport, data repository, UI/API) Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Architecture Topology Usage CRIC Thanks to Dimitrios Christidis for his contribution to this presentation Storage Systems MONIT infrastructure WSSA Tool Backend Grafana Collection Transport layer Storage Visualisation Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Current Status Data from ATLAS Rucio, AGIS, SRM queries Storage on Elasticsearch and InfluxDB Prototype dashboard Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

User Interface Possibly to group/filter by VO, site, country, federation, tier Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Gradual transition and validation Use what is available now Switch to new data source as it becomes available Data from SRM is retained Validation Comparison with experiment accounting systems Similar to the CPU accounting approach Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Monitoring Thanks to Alberto Aimar for contribution to this presentation Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Monitoring use cases Data Centres Monitoring Monitoring of DCs at CERN and Wigner Hardware, operating system, and services Data Centres equipment (PDUs, temperature sensors, etc.) Experiment Dashboards WLCG Monitoring Sites availability, data transfers, job information, reports Used by WLCG, experiments, sites and users Both hosted by CERN IT, in different teams Mandate Regroup monitoring activities hosted by CERN/IT (Data Centres, Experiment Dashboards and others) Move to external open source technologies Continue existing services, but conform with CERN IT practices Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Data Centres Monitoring

Experiment Dashboards Job monitoring, sites availability, data management and transfers Used by experiments operation teams, sites, users, WLCG

Processing & Aggregation Unified Monitoring Data Centres Monitoring Data Sources Transport z Storage &Search Data Access Metrics Manager Hadoop HDFS ElasticSearch InfluxDB Collectd / Lemon XSLS Data mgmt and transfers ATLAS Rucio FTS Servers Flume AMQ Kafka Jupyter (Swan) Zeppelin Grafana Kibana DPM Servers XROOTD Servers CRAB3 Monitoring Job WM Agent Farmout Processing & Aggregation Grid Control CMS Connect PANDA WMS Spark Hadoop Jobs GNI ProdSys Infrastructure Monitoring Nagios VOFeed OIM GOCDB REBUS

Monitoring strategy and implementation Regroup monitoring activities Data Centres, WLCG and Experiment Dashboards, ETF, HammerCloud CERN IT practices, management of services, tools (e.g. GGUS, SNOW) Adopt only established open source technologies Input JSON, sources based on Collectd, Flume, JDBC, ActiveMQ, HTTP Transport and data buffering based on Kafka (500 GB / 72 h) Processing of data streams based on Kafka and Spark Storage based on ElasticSearch (30 days, raw), InfluxDB (5 years, aggregated), HDFS (forever, raw) Dashboards based on Grafana and Kibana Reports/notebooks on Zeppelin and analytics based on HDFS, Spark Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Summary WLCG tools for operations are constantly evolving following the technology evolution and computing requirements for High Luminosity LHC Need to be flexible and agile in order to integrate transparently new types of resources Where possible use open source solutions Analyze accumulated experience, on the global scope and inside the experiments. Decide which tools and/or concepts proved their effectiveness inside a single experiment can be extended to the WLCG scope Julia Andreeva CERN IT, NEC 2017 , Budva, 26.09.2017

Backup slide. Unified Monitoring Architecture Data Sources Storage & Search Data Access Transport AMQ Flume AMQ Flume sinks Flume Kafka sink FTS Kafka cluster (buffering) * HDFS Rucio DB Flume DB XRootD Jobs User Views HTTP feed Flume HTTP ElasticSearch Processing Data enrichment Data aggregation Batch Processing UserData User Jobs … Logs Flume Log GW Lemon … syslog Lemon metrics Flume Metric GW CLI, API app log Others (influxdb) Today: > 500 GB/day, 72h buffering