Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford

Slides:



Advertisements
Similar presentations
Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Tools used for operations at GridKa Angela Poschlad, SCC.
Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Dashboard Cyril L’Orphelin - CNRS/IN2P3.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Update of SAM Implementation ALICE TF Meeting 18/10/07.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
TSA1.4 Infrastructure for Grid Management Tiziana Ferrari, EGI.eu EGI-InSPIRE – SA1 Kickoff Meeting1.
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operational Tools M2 Update James Casey.
Nordic NE ROC Face 2 Face Meeting
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the SAM/Nagios/GSTAT Components.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Daniele Bonacorsi Andrea Sciabà
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
EGI Operations Management Board
PPS All sites Meeting: - CODs and PPS - Monitoring Tools
NGI and Site Nagios Monitoring
Use of Nagios in Central European ROC
GOCDB New Requirements
POW MND section.
Pedro Andrade ACE Status Update Pedro Andrade
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Regional Grid Monitoring - timeline
Security Monitoring in a Nagios world
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
March Availability Report for EGEE Sites based on Nagios
Operations & Coordination Tools
A Messaging Infrastructure for WLCG
Solutions for federated services management EGI
Monitoring of the infrastructure from the VO perspective
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford REGIONAL NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford

Regional Nagios Grid Monitoring has been switched from CERN based Nagios machine to Regional Nagios based in Oxford from 26th May Dashboard receives alarms from Regional Nagios and ROD create tickets based on those alarms Reliability and Availability of resources will also be based on metrics from these tests. OPS testing by SAM will be stopped sometime in July. Testing from experiment VO’s by SAM will continue for some time. Eventually all testing by Experiment VO’s would be switch to Nagios. HEPSysMAN June 2010 25/04/2019

Monitoring Architecture HEPSysMAN June 2010 25/04/2019

Nagios Configuration Generator Nagios Configuration Generator (NCG) It dynamically configure Nagios based on information from different sources. Aggregated Topology provider (ATP) What will be tested NCG How it will be tested Metric Description Database (MDDB) HEPSysMAN June 2010 25/04/2019

Aggregated Topology Provider Aggregated Topology Provider (ATP) is a service which aggregate information from different sources like Different grid infrastructure (EGEE,OSG) Projects (WLCG) Sites, Services and VOs Downtime A history of the above All the information is stored in a database. ATP database is part of regional nagios and configured by YAIM. But Regional Nagios is still using SAMPI for topology information. HEPSysMAN June 2010 25/04/2019

Metric Description Database Metric Description Database (MDDB) stores metrics which are used to test grid infrastructure It can store different profiles for different availabilities and configuration of Nagios installations. This is also part of regional nagios package and configured by YAIM Not fully functional yet Nagios uses text file for this information HEPSysMAN June 2010 25/04/2019

Aggregated Topology provider (ATP) Metric Result Store Metric Result Store Historic metric results for services It would be the replacement for SAM DB Aggregated Topology provider (ATP) Metric result Store (MRS) What will be tested NCG How it will be tested Metric Description Database (MDDB) HEPSysMAN June 2010 25/04/2019

CE Testing HEPSysMAN June 2010 25/04/2019

CE Testing Org.sam.CE-JobSubmit service submits job to CE every hour Org.sam.CE-JobMonit service monitor all jobs submitted from Nagios every 5 min and update the status to org.sam.CE-JobSubmit. If job completes successfully on CE, result is uploaded to message bus directly from WN Nagios subscribes to message bus and the results are displayed as passive test result. If job stays in waiting state for 45 min or not completed with in 55 min, job is killed and status change to critical. HEPSysMAN June 2010 25/04/2019

SE Testing Org.sam.SRM-All is the parent test wrapper which is executed every hour. Test results are published to message bus. HEPSysMAN June 2010 25/04/2019

Re-scheduling test through Nagios Nagios equivalent of SAMAP functionality with a twist. People registered as admin in GOCDB can reschedule test for their site, those registered as regional staff can submit job to any site in ROC. Test can only be rescheduled if org.sam.CE-jobSubmit status is in terminal state, i.e success, aborted or cancel It can not be rescheduled if it is in non terminal state like running, waiting, etc HEPSysMAN June 2010 25/04/2019

Re-schedule test HEPSysMAN June 2010 25/04/2019

If you want to see only your site https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?hostgroup=site-UKI-SOUTHGRID-BRIS-HEP&style=detail http://tinyurl.com/36zs562 Adding this with firefox plugin gives instant error notification for your site https://addons.mozilla.org/en-US/firefox/addon/3607/ HEPSysMAN June 2010 25/04/2019

From Nagios to Operation Dashboard If a critical test fails 2 times in a row, a notification is sent to a special notification queue in message bus Operation dashboard subscribe to Notification queue and after some complicated filtering display those result to dashboard. Regional Operator on Duty create a Ticket based on alarm. Site can also look at Operation Dashboard for alarms against their site. https://operations-portal.in2p3.fr/ HEPSysMAN June 2010 25/04/2019

MyEGEE or MyEGI ? HEPSysMAN June 2010 25/04/2019

MyEGEE MyEGEE is the visualization tool for operations to check metric results and resource statuses It is a separate component but installed with Regional Nagios https://gridppnagios.physics.ox.ac.uk/myegee It takes it input from Metric Resource Database HEPSysMAN June 2010 25/04/2019

Site Nagios Sites can install Nagios in two flavour Advantages PROBE_TYPE=Remote It subscribes to topic in Message Bus and display the result as passive test PROBE_TYPE=Remote,Local It also submit some local test. Advantages It is possible to configure email notification. It can be configured to add non production services for testing. Disadvantage Extra service to maintain Useless in case of network failure or machine room problem HEPSysMAN June 2010 25/04/2019

Thank You HEPSysMAN June 2010 25/04/2019