Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n. 306819.

Slides:



Advertisements
Similar presentations
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) VOMS Installation and configuration Bouchra
Advertisements

EVOLUTION OF THE EXPERIMENT PROBE SUBMISSION FRAMEWORK (SAM/NAGIOS) Marian Babik.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Workload Management System + Logging&Bookkeeping Installation.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp - SWITCH EGI TF Prague.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EPIKH School for Grid Site Administrators, Amman, /32 Introductions BDII Installation and Configuration Miguel Angel Díaz Corchero
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) The Egyptian Grid Infrastructure Maha Metawei
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Presentation of the results khiat abdelhamid
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp (SWITCH) – Argus Product Team.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
II EGEE conference Den Haag November, ROC-CIC status in Italy
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regionalisation summary Prague 1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
TSA1.4 Infrastructure for Grid Management Tiziana Ferrari, EGI.eu EGI-InSPIRE – SA1 Kickoff Meeting1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operational Tools M2 Update James Casey.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the SAM/Nagios/GSTAT Components.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Site Administration Tools: Ansible
NGI and Site Nagios Monitoring
Use of Nagios in Central European ROC
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Security Monitoring in a Nagios world
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Operational Tools & Middleware Versions Monitoring
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Site availability Dec. 19 th 2006
Presentation transcript:

Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n AfricaGrid ROC Monitoring Tools : NAGIOS Christos Kanellopoulos GRNET SAGrid All-Hands Meeting, 26 March 2013

What is SAM?  Service Availability Monitoring is a fully distributed monitoring framework with the following features:  Based on Open Source Systems (Nagios, ActiveMQ,etc)  High scalability based on Open Messaging  Support integration of 3 rd Party Monitoring Systems  Advanced notification and reporting system  Web interface to visualize service status availability  Web REST API to consume service status and availability 2

Architecture  SAM is a system made up of several components, some commodity and some specifically designed and developed for SAM. These include:  Nagios to execute tests  Messaging to transport test results between components,  databases to store both configuration information: the  Aggregate Topology Provider (ATP) and  the Profile Management Database (POEM),  databases to store the test results produced by Nagios:  the Metric Result Store (MRS).  Other components such as the Availability Calculation Engine (ACE) processes the raw test results to calculation metrics such as site and service availability and reliability  A portal, MyWLCG or MyEGI is provided to visualize both test results and availability calculations. 3

Architecture  ACE (Availability Computation Engine)  is the calculation engine used for availability and reliability calculations. Starting from metric results stored in the central database ACE computes statuses, availabilities, and reliabilities for services, service flavours, and sites.  ATP (Aggregated Topology Provider)  provides topology information by aggregating grid topology information and downtimes from different external sources (GOCDB, OIM, CIC, BDII, GSTAT, feeds)  POEM (Profile Management Database)  describes existing metrics and groups them in profiles in order to run tests. It replaces the former MDDB component.  MRS (Metric Results Store)  stores metric output and computes service statuses/ It provides views of data such as site and service status, availability and reliability 4

Architecture 5

 MyWLCG/MyEMI  is the main visualization tool to present a grid-aware view of the data collected by the Service Availability Monitoring framework  Nagios  is the heart of SAM. It schedules tests and forwards via messaging the test results to various components which require them, such as the central metric Store and the MyEGI portal  Probes  are contributed by many developer and system managers and are used to test the specific grid services.  Messaging  Apache ActiveMQ is used as an integration framework, adding flexibility, reliability and scalability to the distributed SAM monitoring system. 6

Nagios Installations  Disabled selinux in /etc/selinux/config  SELINUX=disabled  Install Host Certificates  ls -l /etc/grid-security/host*  -rw-r--r-- 1 root root 2286 Oct 28 19:26 /etc/grid-security/hostcert.pem  -r root root 887 Oct 28 19:25 /etc/grid-security/hostkey.pem  Install YUM and rpmforge packages  yum-conf-5X-8.slc5.noarch (or later)  rpmforge-release el5.rf.x86_64.rpm (or later)  Remove the old lcg-CA repository, if installed  rm -f /etc/yum.repos.d/lcg-CA.repo 7

Nagios Installations  Configure the following repositories  EGI CAs (egi-trustanchors.repo)  gLite BDII (glite-BDII.repo)  gLite UI (glite-UI.repo)  EPEL (epel.repo)  DAG (dag.repo)  SAM [egi-sam] name=EGI SAM repo baseurl= enabled=1 gpgcheck=0 protect=1 priority=10  Install Yum prioritiies  yum install yum-priorities 8

Nagios Installation  Installation  yum install lcg-CA  yum install httpd  yum install nagios.x86_64  # make sure that nagios from EGI SAM repository is installed  yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'  yum install sam-nagios 9

Nagios Configuration  A SAM-Nagios node type can be configured in three different way in order to monitor different sets of sites/services:  NGI SAM-Nagios : to monitor all sites/services belonging to a given region  Site SAM-Nagios : to monitor all sites/services made available by one site  VO SAM-Nagios : to monitor all sites/services that support a given VOs (list of services can be based on VO feed or not) 10

Nagios Configuration  NGI SAM-Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST= VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/" VO_OPS_VOMSES="'ops lcg-voms.cern.ch /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= # put to your NGI WMSes VO_OPS_WMS_HOSTS= # put to your NGI WMSes # Nagios 11

Nagios Configuration # Nagios NAGIOS_HOST= NAGIOS_ADMIN_DNS= NCG_NAGIOS_ADMIN= NAGIOS_ROLE=ngi NCG_PROBES_TYPE=local NCG_VO=ops NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NAGIOS_NSCA_PASS=MY_PASS # NGI/ROC Nagios COUNTRY_NAME= NCG_GOCDB_ROC_NAME= NAGIOS_SUDO_ENABLE_CONFIG=true 12

Nagios Configuration # DB data MYSQL_ADMIN="MY_MYSQL_PASS" DB_PASS="MY_MRS_PASS" MYEGI_ADMIN_NAME= MYEGI_ADMIN_ = MYEGI_DEFAULT_PROFILE="ROC" # profile to be displayed by default in MyEGI MYEGI_REGION=  Run yaim  /opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS 13

Nagios Configuration  Site SAM Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST=r VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/" VO_OPS_VOMSES="'ops lcg-voms.cern.ch /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= 14

Nagios Configuration # Nagios NAGIOS_HOST= NAGIOS_ADMIN_DNS= NCG_NAGIOS_ADMIN= NAGIOS_ROLE=site NCG_PROBES_TYPE=remote,local NCG_VO=dteam NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NCG_REMOTE_USE_NAGIOS=true NAGIOS_NSCA_PASS=MY_PASS  Run yaim  /opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS 15

Failover Nagios - configurable hot-standby mode  Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:  Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below.  SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf).  Backup instance constantly monitors resources, but it has the following features:  alarms are not sent to Operations portal  notifications are disabled  results are not sent to the central MRS database.  note: results are stored to local MRS so the MyEGI shows correct history on both instances.  In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance. 16

Failover Nagios - configurable hot-standby mode  Backup instance can be defined in several ways:  Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism)  NCG_BACKUP_INSTANCE=true  Fast backup configuration without YAIM execution:  Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover):  BACKUP_INSTANCE=true  Setting global variable in ncg.conf file:  BACKUP_INSTANCE=1  Using ncg.pl argument:  ncg.pl --backup-instance  In case of backup configuration without Yaim the following additional step is needed:  /sbin/chkconfig send-to-dashboard off  /sbin/service send-to-dashboard stop 17

Failover Nagios - configurable hot-standby mode  In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable.  If the Yaim is not used the following additional step is needed:  /sbin/chkconfig send-to-dashboard on  /sbin/service send-to-dashboard start 18

Validation  After successful running of Yaim you should be able to access Nagios web interface at the address  If you enabled local probes make sure that you first check if MyProxy credential works by running hr.srce.GridProxy-Get-VO metric on NAGIOS_SERVER. You can do this by force scheduling check via web interface or via command line:  nagios-run-check NAGIOS_SERVER hr.srce.GridProxy-Get-VO  MyEGI interface is at the address:  Check resource BDII: $ ldapsearch -x -LLL -h NAGIOS_SERVER -p b Mds-Vo-Name=resource,O=grid "(GlueServiceType=*- NAGIOS)" GlueServiceEndpoint dn: GlueServiceUniqueID=NAGIOS_SERVER_XXXXXX-NAGIOS_ ,Mds-Vo-name= resource,o=grid GlueServiceEndpoint: 19

Probes  SAM Probes  grid-monitoring-probes-ch.cern.sam  Metrics: MrsCheckDBInserts, MrsCheckDBInsertsDetailed, MrsCheckSpool  NodeType: sam-opsnagios  Description: monitors if MRS is receiving metric results.  nagios-gocdb-downtime  Metrics: org.egee.ImportGocdbDowntimes  NodeType: sam-nagios, sam-gridmon  Description:  mrs  Metrics: org.egee.MrsCheckMissingProbes, org.egee.CentralMrsCheckMissingProbes, org.egee.SendToMetricStore  NodeType: sam-nagios, sam-gridmon  Description: 20

Probes  SAM Probes  mddb-synchronizer  Metrics: org.egee.MDDBSync  NodeType: sam-nagios, sam-gridmon  Description:  atp  Metrics: org.egee.ATPSync  NodeType: sam-nagios, sam-gridmon  Description: executes atp_synchronizer and verifies if execution was completed without errors |  msg-nagios-bridge  Metrics: org.egee.SendToMsg, org.egee.RecvFromQueue, org.egee.CheckConfig  NodeType: sam-nagios, sam-gridmon  Description: 21

Probes  EGI Probes  grid-monitoring-org.ggus-probes  NodeType: sam-nagios, ops-monitor  Description: queries GGUS to check for open tickets for a SU or by site name that has been notified  grid-monitoring-org.nagiosexchange-probes  NodeType: sam-nagios, ops-monitor  Description: checks from Nagios Exchange  grid-monitoring-probes-hr.srce  NodeType: sam-nagios  Description: CAdist, CertLifetime, DPM, DPNS, GRAM, GridFTP, GridProxy, MyProxy and VOMS probes  grid-monitoring-probes-org.sam.sec  NodeType: sam-nagios  Description: several security probes executed from the WNs 22

Probes  EMI Probes  grid-monitoring-org.activemq-probes  NodeType: ops-monitor  Description: checks for ActiveMQ Messaging Brokers  grid-monitoring-probes-ch.cern  NodeType: sam-nagios  Description: FTS, LFC and RGMA probes  grid-monitoring-probes-org.bdii  NodeType: sam-nagios  Description: Nagios checks for a WLCG Information System instance (BDII)  grid-monitoring-probes-org.ndgf  NodeType: sam-nagios  Description: ARC-CE, ARC-LFC, ARC-SRM and ARC-RLS probes 23

Probes  EMI Probes  grid-monitoring-probes-org.sam  NodeType: sam-nagios  Description: CE, CREAMCE, LFC, SRM, WMS and WN probes  gstat-validation  NodeType: sam-nagios  Description: Gstats checks 24