Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka,

Slides:



Advertisements
Similar presentations
Passive Monitoring with Nagios Jim Prins
Advertisements

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EG recent developments T. Ferrari/EGI.eu ADC Weekly Meeting 15/05/
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
UK NGI Operations John Gordon 15 th May Helpdesk Ticket Workflow NGI Core Services.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
Validation of SAM3 monitoring data (availability & reliability of services) Ivan Dzhunov, Pablo Saiz (CERN), Elena Tikhonenko (JINR, Dubna) April 11, 2014.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Grid Oversight in Service Level Agreement environment Małgorzata Krakowian,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI SAM New Requirements from the SA1 Survey.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Requirements Status EGI.eu UCB
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI TS8.10 A new approach to Computing Availability/Reliability reports for EGI.
SUM like functionality with WLCG-MON Ivan Dzhunov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Security Monitoring Daniel Kouřil EGI-TF 2011.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
GOCDB Handover + Status Update Quite heavy GGUS ticketing traffic; responding to user issues has been quite timely, especially in first few weeks (expected.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regionalisation summary Prague 1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Status of ARGUS support Peter Solagna – EGI.eu.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI GGUS Report Generator Günter Grein, KIT Helmut Dres, KIT Torsten Antoni,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI GLUE 2: Deployment and Validation Stephen Burke egi.eu EGI OMB March 26 th.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED ADMINISTRATION.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI First Ops Tools Long Term Sustainability F2F David Collados 1First Ops Tools.
Flexible Availability Computation Engine for WLCG Rajesh Kalmady, Phool Chand, Vaibhav Kumar, Digamber Sonvane, Pradyumna Joshi, Vibhuti Duggal, Kislay.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Release Process Michel Drescher, EGI Kostas Koumantaros, GRNET 7/5/2016.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
TSA1.4 Infrastructure for Grid Management Tiziana Ferrari, EGI.eu EGI-InSPIRE – SA1 Kickoff Meeting1.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI APEL Regional Accounting Alison Packer (STFC) Iván Díaz Álvarez (CESGA) APEL.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Operations Portal OTAG September, 21th 2011 Cyril L’Orphelin – CCIN2P3/CNRS.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI MPI VT report OMB Meeting 28 th February 2012.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI COD activity in EGI-InSPIRE Marcin Radecki CYFRONET, Poland & COD Team 9/29/2016.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the SAM/Nagios/GSTAT Components.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
NGI and Site Nagios Monitoring
Use of Nagios in Central European ROC
SA1.4 Infrastructure for Grid Management Overview
Pedro Andrade ACE Status Update Pedro Andrade
Introduction to OAT presentations
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Security Monitoring in a Nagios world
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Maite Barroso, SA1 activity leader CERN 27th January 2009
TS4.10 Comp Reports A new approach to Computing Availability/Reliability reports for EGI Progress Report C. Kanellopoulos GRNET 9/14/2018.
Solutions for federated services management EGI
Operational Tools & Middleware Versions Monitoring
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka, Pedro Andrade, Paloma Fuente, Jacobo Tarragon (CERN) Emir Imamagic (SRCE) Christos Triantafyllidis (AUTH)

EGI-InSPIRE RI Introduction Aim –provide detailed technical overview of SAM improve understanding how the system works help you to solve most common issues –get feedback from 2nd level Approach: –overview of architecture –per component (3 slides) configuration, debugging what are the most common issues, how to resolve them

EGI-InSPIRE RI Introduction GGUS 2 nd level –69 tickets GGUS 3 rd level –249 tickets

EGI-InSPIRE RI Disclaimer many internal/development APIs will be shown they can change anytime and shouldn’t be considered public public API is documented at: – AMDOC/Web+Services+Specification

EGI-InSPIRE RI Terminology service – endpoint (hostname, port) service flavour – service type (GOCDB) profile – set of tuples (flavour, metric, vo, fqan) status – discrete state (one of ok, critical, warning, unknown) availability – time period for which status was ok (- downtime) reliability – availability (+ downtime)

EGI-InSPIRE RI SAM Architecture

EGI-InSPIRE RI SAM Architecture

EGI-InSPIRE RI ATP - Configuration atp_synchro.conf : main configuration file –debug level –external data sources location (GOCDB, CIC, VOMS, etc) –location of vo feed and roc configuration files –synchronizer selector atp_db.conf : database connection configuration atp_logging_files.conf : location of log configuration file atp_logging_parameters_config.conf : log configuration roc.conf : list of enabled regions vo_feeds.conf : list of enabled vo feeds All configuration files are based on key-value pairs Default configuration structure distributed in ATP package 8

EGI-InSPIRE RI ATP - Debugging Log of last execution: /var/log/atp/atp.log Log of all executions: /var/log/atp/atp_full.log (with logrotate) Errors are also sent to system logging Six levels of debugging: –CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET –Default configuration is on INFO (20) Standard log file line: –“ :24:02,308 - ATP - INFO - CIC - Execution – Starting” –CIC: synchronizer name (e.g. CIC, GOCDB Topology, VOFeeds, etc) –Execution: task type (e.g. configuration, validation, execution) –Starting: action description ATP_sync probe POEM/NCG calls (for all non-deleted VOs): –localhost/atp/api/search/servicemap/json?vo= &ismonitored=on 9

EGI-InSPIRE RI ATP – Common Issues A line-by-line analysis of atp.log allows to understand 99% of the problems with atp synchronizer ATP synchronizes data from several distinct external data sources. Sometimes ATP execution fails due to “invalid” or “not available” input data –Check for “Validation” tag in the log to understand which data source was not reachable or was providing invalid data ATP is based on several PL/SQL procedures/functions –If you detect ORA-* error codes please assign the ticket to 3 rd level 10

EGI-InSPIRE RI POEM sync /etc/poem/poem_sync.ini –logging –database details –POEM_SYNC_NS_URLS – list of URLs from which to synchronize (NGI defaults to grid-monitoring, VO defaults to localhost) –POEM_SYNC_NS_RESTRICT – space separated list of namespace!profile which should be synchronized for given namespace (ch.cern.sam!ROC ch.cern.sam reasonable defaults are provided debugging –localhost/poem_sync/api/0.1/json/servicemetricinstances –localhost/poem_sync/api/0.1/json/profiles Poem_sync probe (dumps log information)

EGI-InSPIRE RI POEM Web /etc/poem/poem.ini : main configuration file for poem web –database details –logging –namespace poem web instance, list of defined profiles, metrics –localhost/poem/api/0.1/json/profiles/ –localhost/poem/api/0.1/json/namespace/ poem web (mod_wsgi), django admin –DEBUG=True in /etc/poem/poem.ini

EGI-InSPIRE RI POEM known issues no history –changes take effect immediately (critical profiles need to be changed at beginning of a month – PROC10) metric configuration is not integrated with poem –poem web doesn’t filter metrics in any way –no guidance in terms of dependencies, internal metrics, etc. FQAN support –if fqan is null this means results with any fqan will be accepted –local profiles with custom fqans can overwrite results of the central profiles

EGI-InSPIRE RI NCG configuration /etc/ncg/ncg.conf –basic structure /etc/ncg/ncg outputs to /etc/nagios/wlcg.d/ log /var/log/ncg/ncg.log

EGI-InSPIRE RI NCG debugging review /var/log/ncg/ncg.log check metric configuration –/etc/ncg-metric-config.conf –/etc/ncg-metric-config.d probes –NCGPidFile (freshness) –ncg_sync

EGI-InSPIRE RI NCG known issues

EGI-InSPIRE RI voms2htpasswd Authorization for Nagios Configuration files: –/etc/voms2htpasswd.conf Major configuration file –/etc/voms2htpasswd-bans.conf Banned DNs –/etc/voms2htpasswd-static.d/ Files containing list of DNs Sample entries for /etc/voms2htpasswd.conf: –atps://grid-monitoring.cern.ch/atp/api/search/contactgroup/json?groupname=NGI_HU –atps://grid- monitoring.cern.ch/atp/api/search/contactgroup/json?groupname=NGI_PL&role=Regional %20Manager –atps://grid-monitoring.cern.ch/atp/api/search/contactsite/json?sitename=KR-KISTI-GSDC- 01 Sample entries for /etc/voms2htpasswd-bans.conf and /etc/voms2htpasswd-static.d/ –/C=GR/O=HellasGrid/OU=auth.gr/CN=Christos Triantafyllidis Debugging: –Check existence of entries in: /etc/httpd/httpd.users

EGI-InSPIRE RI Messaging config brokers: –/var/cache/msg/broker-cache-file/broker-list msg-to-handler daemon: –/etc/msg-to-handler.conf (/etc/msg-to- handler.d) Nagios probes: –org.egee.SendToMsg – publishes config and metrics –org.egee.RecvFromQueue – imports results

EGI-InSPIRE RI MRS configuration basic configuration –mrs.conf is located at: /etc/mrs.d/mysql-mrs.conf (MySQL) /etc/mrs.d/oracle-mrs.conf (Oracle) send_to_db.ini is located at –/etc/nagios/plugins/send_to_db.ini structure: –[send_to_db] –db_uri=mrs;host=localhost –db_user=msuser –db_pwd=mspass

EGI-InSPIRE RI MRS debugging select uts_to_w3ctime(max(check_time)) from metricdata_spool; (ORACLE) select FROM_UNIXTIME(max(check_time)) from metricdata_spool; (MySQL) latest entry in metricdata_spool, it shouldn’t be old (if too old.. maybe metrics aren’t received from messaging) select uts_to_w3ctime(max(check_time)) from metricdata; (ORACLE) select FROM_UNIXTIME(max(check_time)) from metricdata; (MySQL) latest entry in metricdata, it shouldn’t be old (if too old.. maybe metrics aren’t received from metricdata_spool) select uts_to_w3ctime(m.check_time), uts_to_w3ctime(m.insert_time), m.* from metricdata_rejected m; select FROM_UNIXTIME(m.check_time), from_unixtime(m.insert_time), m.* from metricdata_rejected m; see reason to understand why metric was rejected Nagios probes: SendToMetricStore, MrsDirSize, MrsCheckMissingProbes

EGI-InSPIRE RI MRS known issues no known issues Basic contracts –metric is marked as REMOVED if status is MISSING and service is marked as deleted –metric is marked as REMOVED if its tuple disappears from mrs bootstrapper –metric is marked as MISSING after 24 hours statuschange_service_profile table keeps data for 12 months metricdata table keeps data for 6 months metricdata_rejected table keeps data for 1 month metricdata_latest table contains metric results newer than 7 days.

EGI-InSPIRE RI SAM reloading /etc/rc.d/init.d/sam-sync /var/log/sam-sync.log reloads SAM: –suspends ATP, POEM –ncg.reload.sh –mrs bootstrapping –resumes ATP, POEM

EGI-InSPIRE RI myEGI config and debug /etc/mywlcg/mywlcg.ini –database connection /var/log/httpd/error.log based on django (mod_wsgi) –you can get more explicit errors if you set DEBUG=True myegi tests myegi web service tests

EGI-InSPIRE RI ACE - Configuration ace.conf: main configuration file –database configuration file path –logging level and configuration file path –computation_delay: used to set a maximum time for which computations can be performed. ie: Current time: Computation delay: 15 (minutes) When calculations are performed, last period considered will end at ace_db.conf : database connection configuration atp_logging.conf : log path and logging configuration All configuration files are based on key-value pairs Default configuration structure distributed in ACE package 24

EGI-InSPIRE RI ACE - Debugging Log of last execution: /var/log/ace/ace.log –Used for both ace_status and ace_availability Five levels of logging: –CRITICAL, ERROR, WARNING, INFO, DEBUG –Default configuration is on ERROR (40) Logging of performed actions –Status auto-summarization (missing status calculations in the past 24h) –Regular status summarization (from last summarization to current time – delay) –Availability auto-summarization (missing availability calculations in the past 24h) –Regular availability summarization (from last summarization to current time – delay) Hourly, daily, weekly and monthly calculations for each hour, day, week and month within the period. 25

EGI-InSPIRE RI ACE – Common Issues Availability recomputation requests –Must follow request policy:request policy caused by problems in the monitoring infrastructure requested up to 10 days after the publication of the monthly report –If coming from site admin, assign to regional operations staff policy for EGI sites and regions: –If coming from regional operations staff, assign to 3 rd level Apparently wrong values caused by external reasons –topology issues –MRS data 26

EGI-InSPIRE RI Documentation y/SAMDOC/Homehttps://tomtools.cern.ch/confluence/displa y/SAMDOC/Home y/SAMDOC/FAQshttps://tomtools.cern.ch/confluence/displa y/SAMDOC/FAQs y/SAMDOC/Troubleshootinghttps://tomtools.cern.ch/confluence/displa y/SAMDOC/Troubleshooting y/SAMDOC/Released+Probes