James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.

Slides:

Advertisements

Similar presentations

CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.

Advertisements

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

1 1 Service Composition for LHC Computing Grid Monitoring Beob Kyun Kim e-Science Division, KISTI

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,

WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Visualization Ideas for Management Dashboards

ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.

Julia Andreeva on behalf of the MND section MND review.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.

SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Computation of Service Availability Metrics in Gridview Digamber Sonvane, Rajesh Kalmady, Phool Chand, Kislay Bhatt, Kumar Vaibhav Computer Division, BARC,

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.

Flexible Availability Computation Engine for WLCG Rajesh Kalmady, Phool Chand, Vaibhav Kumar, Digamber Sonvane, Pradyumna Joshi, Vibhuti Duggal, Kislay.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays.

Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.

Transition to EGI PSC-06 Istanbul Ioannis Liabotis Greece GRNET

Daniele Bonacorsi Andrea Sciabà

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

NGI and Site Nagios Monitoring

Use of Nagios in Central European ROC

POW MND section.

Pedro Andrade ACE Status Update Pedro Andrade

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Experiment Dashboard overviw of the applications

Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.

A Messaging Infrastructure for WLCG

Monitoring of the infrastructure from the VO perspective

Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford

EGEE Operation Tools and Procedures

Site availability Dec. 19 th 2006

Presentation transcript:

James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring

Tools for WLCG Monitoring WLCG provides a set of tools for operational monitoring and management Aim is to – Enable sites to operate a reliable infrastructure – Report on the reliability and usage to WLCG users Many of the tools developed previously within EGEE/OSG – Now operated by EGI.eu, OSG, other NGIs

Tools GOCDB – Configuration management SAM/Nagios – Checking the operational status of resources Gstat – Information system monitoring and reporting Gridview – Availability/reliability calculation and reporting Gridmap – High level views of the infrastructure

Tools I will talk about most of the previous tools – SAM/Nagios, GStat, Gridview, Gridmap Other exist too – Accounting – APEL – VO Cards - CIC Portal – 1 st line support - Operations Dashboard And in OSG – OIM, MyOSG, Gratia, …

Open-source at the core – Avoid NIH ! All these tools depend on common low-level components – Nagios – an open-source monitoring system – Apache ActiveMQ – an open source messaging system We many other open-source components when developing these tools – Python, Django, Jquery, RRD, Google charts A short detour on what they are and why we use them

Nagios What is Nagios? – open source monitoring framework – highly flexible with advanced features – widely used & actively developed Why do we need it? – Many tests need to be scheduled for execution – avoid development & maintenance of home-grown tools – provide solution that site admins are familiar with Nagios is a standard monitoring component at many sites 6

Nagios Architecture Nagios Core – Scheduler: Runs checks at a predefined interval Plugins – Scripts used to check particular pieces of functionality Web interface Powerful notification system – , SMS, Pager, … All parts are pluggable and extensible

Nagios Web Interface

Site Nagios – CE Tests

Messaging What is a messaging system? – Method of communication between applications – Standardized, asynchronous and scalable communication between distributed entities – Reliable network of brokers that provides guaranteed delivery of messages – Messaging is for applications what IM is for people – Mainly acts as an integration framework between many separate applications 10

JMS messaging models 11

Why messaging ? Why do we need it? – Interaction between distributed monitoring components – Standard interfaces enables easy integration of monitoring software – Scalable Main use-cases are in finance for high message rate ( > 1M/sec)reliable multicast e.g. trading floor – Reliable – Distributed Messaging is pre-existing grid-scale technology

Implementation details FUSE Message Broker – based on Apache ActiveMQ Good performance characteristics – 1K – 20K messages per second depending on features used Distributed network of 4 brokers, hosted by EGI.eu – CERN, Croatia, Greece – Provides reliability and locality 13

Vendor tests 14 From “Optimizing FUSE Message Broker” -

CERN openlab tests 15

Messaging is a key technology for WLCG WLCG Experiments are buying into messaging – ATLAS DDM Moving to a production messaging service – Ganga – VO Job monitoring – Alice data transfers dCache can use it for distributed pools – Developments by NDGF CERN Beams use it for monitoring in the control room

SAM and Nagios Service Availability Monitoring (SAM) – A distributed monitoring system – Based on open-source components Nagios for test execution ActiveMQ for communication via messaging – With custom visualization MyEGEE/MyWGI/MyWLCG/... Aims to test all resources on the production grid – And provides data to other components for availability and reliability calculation

Nagios at a Site Simplest model – A site wants fabric monitoring for the grid services 1.Download ‘EGEE-Nagios’ meta-package 2.Configure it as a site Nagios – Point at your site BDII – Give it a certificate & of local administrator 3.Nagios now will test all resources in your site – Mail admin list on errors – Provides web interface for more details – Detailed low-level tests for all services

Nagios Web Interface

Site Nagios – CE Tests

Nagios at the region A NGI or ROC monitors all it’s sites – “Simulates users actions via the public interfaces” – At a higher level than the site monitoring Allows regional operations to help manage the site Feeds into availability calculations Feeds back into the site monitoring – You see the view the ROC has of you – And it can trigger local alerts into the operational process

Architecture - Regions 22

Architecture

Current Status 27 national level Nagios servers – Should grow out to full WLCG scale in next few months Clients distributed across 40 countries 315 sites 5K services 500,000 test results/day 5 consumers of full data stream to database for analysis and post processing

MyEGEE – Resource summary

MyEGEE – Status history

MyEGEE portal & iGoogle 27

Computation of Availability Metrics Gridview computes Service Availability Metrics per VO using SAM test results Computed Metrics include – Service Status, Availability, Reliability All Metrics are computed – per Service Instance, per Service (eg. CE) for a site – per Site, Aggregate of all Tier-1/0 sites Various periodicities like Hourly, Daily, Weekly and Monthly Also shows: – statistics of data transfers, FTS file transfers, jobs running

29 Metric Computation Example Consider Status for a day as – UP – 12 Hrs – Scheduled Down – 6 Hrs – Unknown – 6 Hrs Availability Graphs (1 st bar in Graph) would show – Availability (Green) – 50 % – Sch. Down (Yellow) – 25 % – Unknown (Grey) – 25 % Reliability Graph (1 st bar in Graph) would show 100% Reliability = Availability(Green) / (Availability(Green)+ Unscheduled Downtime(Red)) Reliability not affected by Scheduled Downtime or Unknown Interval Sample Reliability Graph Sample Availability Graph

Service & Site Service Status Calculation test status per (test, si, vo) Test Results Service Instance Status ServiceStatusSiteStatus aggregate test status per (si, vo) Service = a service type (e.g. CE, SE, sBDII,...) Serviceinstance (si) = (service, node) combination consider only critical tests for a vo ANDing Service marked as scheduled down (sd)  sd all test statuses are ok up at least one test status is down(failed)  down No test status down and at least one test status is unknown unknown aggregate service instance status for site services per (site, service, vo) ORing At least one service instance status up up No instance up and at least one is sd sd No instance up or sd and at least one instance is down  down All instances are unknown  unknown aggregate site service status per (site, vo) ANDing all service statuses up up at least one service status down down no service down and at least one is sd  sd no service down or sd and at least one is unknown  unknown

Gridview – Site availability details

Gridview – ROC report

Gridview – ROC Drilldown

Gridview – Data Transfers

Gstat – Information System visualization Information system contains the middleware view of the infrastructure Main usage: – Service Discovery – what is there? – Installed Capacity – how much is there ? – VO Views – what can a VO use ? Gstat provides visual representation of this – Management tool for NGI/WLCG managers – Debugging tool for site admins

Gstat – LDAP Browser

Gstat – ROC Summary

Gstat - Site View drilldown

WLCG Topology view

GridMap Visualization site regions Size of rectangle is e.g. - size of site (#CPUs) - #running jobs -... Idea – visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap

GridMap Visualization Idea – visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service -... okdegradeddown

Multiple Views GridMaps can be used for top-level, geographical and VO views VO Views cross-location Top-level View Geographical Views Federation, Partner, Site, etc. Next level of GridMaps Large-scale Federated Grid Services Infrastructure Global GridMap Application Domain GridMap Local GridMap Alert Corrective action effect

Trends Trends can be understood by looking at a sequence of GridMaps 25 Sep Sep Sep 2010 Site Availability over time: 22 Sep Sep Sep 2010

More Views Correlations of metrics can be discovered by switching between different views LHCbCMSAtlasAliceOPS Site Availability from different VO perspectives: site BDIISRMSECEOverall Site Status of different Site Services: sites without colour do not support the VO

MyEGEE…EGI…WLCG – The future An integration of the existing visualization tools – Gridmap – MyEGEE – Gridview Being developed for EGI right now – To satisfy the NGI and site manager requirements A single portal for operational information

MyEGI homepage

MyEGI service status drilldown

Summary Wide range of tools available for you Aim is to help you to manage your site Integrates well with the Glite middleware and WLCG operational processes The future leads towards better integrated portals for complete monitoring of your systems – All open source – Contributions always welcome !!! 48

Links – Nagios – MyEGEE – Gstat – Gridview Availability – Gridmap 49

SAM Demo Watch our demo: – (YouTube) –

Questions ?