CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.

Slides:



Advertisements
Similar presentations
MONITORING TOOLS Open Source Security Tools to monitor your network.
Advertisements

A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 1 Service Composition for LHC Computing Grid Monitoring Beob Kyun Kim e-Science Division, KISTI
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Recent improvements in HLRmon, an accounting portal suitable for national Grids Enrico Fattibene (speaker), Andrea Cristofori, Luciano Gaido, Paolo Veronesi.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number GGUS Service Provider GGUS –
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Daniele Bonacorsi Andrea Sciabà
Monitoring Evolution and IPv6
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Introduction to Data Management in EGI
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August 2012 GridKa School – Karlsruhe, Germany

Outline What’s / Why Monitoring ? Terminology and Architecture Technologies and Tools WLCG and CERN Examples Current Problems & Future Solutions 2

What’s Monitoring ? “Observe and check the progress or quality of something over a period of time.” Google define “Capability to execute continuous observance and analysis of the operational state of systems and provide decision support regarding situational awareness and deviations from expectations.” US National Institute of Standards and Technology 3

Why Monitoring ? It’s a core IT function The value of a service to an organization is proportional to the availability of the service It’s powerful Understanding and using monitoring data correctly gives a competitive advantage It’s fun Many people enjoy looking to nice dashboards and graphs of service status and availability 4

What’s the challenge ? Define the monitoring scope and coverage Understand the services dependencies Define the correct monitoring toolchain Define what to do with monitoring data Do not forget the evolving infrastructure 5

Terminology Many tools in the market… vast terminology ! 6 metric probe correlation analytics collection event dashboards probes resource context

Terminology Terms must be categorized 7 Monitored Elements Services Nodes Network Data … Monitoring Code Probes Tests Sensors Agents … Monitoring Tasks Aggregation Collection Reporting Probation … Monitoring Outputs Notifications Alarms Dashboards Reports …

Probe test code to check the status of a given attribute Metric probe execution result (numeric/boolean data point) Context metadata about a metric Event metric combined with the context Notification result of processing an event Terminology 8

Probation Probes Tests Sensors Active/Passive Continuous Aggregation Collection Synchronization Repository Catalog Computation Trending Filtering Grouping Tagging Correlation Presentation Alarms Notifications Reports Dashboards APIs Architecture 9

Monitoring data is the core –Data transport, data storage, data format Probes should be kept as simple as possible –Clear focus with simple computing logic Scalability can be addressed in different ways –Horizontally scaling –Adding other layers: pre-aggregation, pre-processing Different tools for different layers of the architecture –Many tools available in the market –Each individual tool must be easily replaced –Based on standard protocols 10

Technologies and Tools Monitoring frameworks are the core technology –Configure the environment you want to monitor –Schedule the execution of probes –Provide some degree of reporting/notification –Several types of solutions available More complex scenarios may require other technology –Messaging for data transport and data aggregation ActiveMQ, Apollo, RabbitMQ, etc. –NoSQL for data analysis and data storage Hadoop, HBase, Cassandra, etc. –Many tools available for data analysis/presentation 11

Technologies and Tools 12

Technologies and Tools New Relic Librato Metrics Pingdom PagerDuty Splunk Graphite Statsd Riemann Logstash … 13 Open Source SaaS

WLCG The Worldwide LHC Computing Grid (WLCG) is a distributed infrastructure composed by 152 sites and >320,000 logical CPUs serving the computing needs of WLCG VOs. It brings together resources provided by the EGI and OSG infrastructures. Multi-organization infrastructure, few services. Monitoring is based on the SAM system: –SAM-Nagios testing sites and services (for VOs/NGIs) –SAM-Gridmon aggregating/computing monitoring data –WLCG Dashboards providing dedicated portals to VOs 14

WLCG 15 Italy NGI SAM-Nagios Italy NGI SAM-Nagios SAM-Gridmon France NGI SAM-Nagios France NGI SAM-Nagios ATLAS VO SAM-Nagios ATLAS VO SAM-Nagios Site A (Italy, CMS) Site A (Italy, CMS) Site B (France, CMS) Site B (France, CMS) Site C (France, ATLAS) Site C (France, ATLAS) WLCG Dashboards MYWLCG External Sources (EGI, OSG) ActiveMQ

WLCG Some statistics –236 probes for operations and VO checks –+4000 service endpoints being tested –41 NGI SAM-Nagios + 4 WLCG VO SAM-Nagios –500k metrics results processed per day Key features –Based on open source systems: Nagios, ActiveMQ –Distributed responsibility for tests execution: NGIs, VOs –Centralized computation of status and availability –Supports integration of 3 rd party monitoring systems –Provides web REST API to consume monitoring data 16

CERN CERN Computer Centre houses servers and data storage systems for the WLCG Tier-0, physics analysis, and other CERN services. It currently provides 30 PB storage on disk and 65,000 cores (plus 20,000 cores and 5,5 PB of storage from new Computer Centre in Budapest). Single-organization infrastructure, many services. Monitoring is based on the Lemon system plus many application specific tools. A new monitoring architecture is being defined under the AI project. 17

CERN 18 Apollo Analysis/Storage Feed Analysis/Storage Feed Hadoop Alarm Feed Alarm Feed SNOW Splunk WepApp1 Custom Feed Sensor1 Sensor2 Application1 Application2

CERN Prototype system for new architecture being implemented and tested (no statistics for now). Key features –Based on open source systems: Hadoop, Apollo, etc. –Distributed responsibility for probes execution –High scalability based on messaging: Apollo –Centralized storage and analysis cluster: Hadoop –Powerful and highly configurable dashboard: Splunk –Real time feed for notifications and alarms 19

Current Problems Tools complexity –Web interfaces full of “traffic lights” ! Can we actually get anything out of there ? –No automation of configuration Should be better integrated with configuration systems Toolchain diversity –One single tool is not enough (most of the time!) –Orchestration of different tools difficult to implement Low quality of processed monitoring data –Incorrect notifications can be easily seen as SPAM 20

Current Problems Monitoring granularity –Most of the times based on hosts –Why can’t we have other 1 st class monitoring elements? Sources of truth –Different tools hold different databases of hosts/services –Is there a master? Are there copies? Cache data? Timing (depends of tool!) –Long intervals between checks, high latency 21

Future Solutions Should target at –Easy and reliable access to monitoring data –Simple and reliable dashboards and APIs –Correctly targeted real-time notifications –Enable complex queries between different data sets 22

Future Solutions Well designed monitoring toolchain –Appropriate for the infrastructure being monitored Consider alternatives to monitoring frameworks –Messaging infrastructure as transport layer Give top priority to the monitoring data ! –Single data format, flexible schema –Store all monitoring data (incl. historical) for analysis –Feed the system with curated monitoring data 23

Thank You ! 24