Presentation is loading. Please wait.

Presentation is loading. Please wait.

Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n. 306819.

Similar presentations


Presentation on theme: "Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n. 306819."— Presentation transcript:

1 Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n. 306819 AfricaGrid ROC Monitoring Tools : NAGIOS Christos Kanellopoulos GRNET SAGrid All-Hands Meeting, 26 March 2013

2 What is SAM?  Service Availability Monitoring is a fully distributed monitoring framework with the following features:  Based on Open Source Systems (Nagios, ActiveMQ,etc)  High scalability based on Open Messaging  Support integration of 3 rd Party Monitoring Systems  Advanced notification and reporting system  Web interface to visualize service status availability  Web REST API to consume service status and availability 2

3 Architecture  SAM is a system made up of several components, some commodity and some specifically designed and developed for SAM. These include:  Nagios to execute tests  Messaging to transport test results between components,  databases to store both configuration information: the  Aggregate Topology Provider (ATP) and  the Profile Management Database (POEM),  databases to store the test results produced by Nagios:  the Metric Result Store (MRS).  Other components such as the Availability Calculation Engine (ACE) processes the raw test results to calculation metrics such as site and service availability and reliability  A portal, MyWLCG or MyEGI is provided to visualize both test results and availability calculations. 3

4 Architecture  ACE (Availability Computation Engine)  is the calculation engine used for availability and reliability calculations. Starting from metric results stored in the central database ACE computes statuses, availabilities, and reliabilities for services, service flavours, and sites.  ATP (Aggregated Topology Provider)  provides topology information by aggregating grid topology information and downtimes from different external sources (GOCDB, OIM, CIC, BDII, GSTAT, feeds)  POEM (Profile Management Database)  describes existing metrics and groups them in profiles in order to run tests. It replaces the former MDDB component.  MRS (Metric Results Store)  stores metric output and computes service statuses/ It provides views of data such as site and service status, availability and reliability 4

5 Architecture 5

6  MyWLCG/MyEMI  is the main visualization tool to present a grid-aware view of the data collected by the Service Availability Monitoring framework  Nagios  is the heart of SAM. It schedules tests and forwards via messaging the test results to various components which require them, such as the central metric Store and the MyEGI portal  Probes  are contributed by many developer and system managers and are used to test the specific grid services.  Messaging  Apache ActiveMQ is used as an integration framework, adding flexibility, reliability and scalability to the distributed SAM monitoring system. 6

7 Nagios Installations  Disabled selinux in /etc/selinux/config  SELINUX=disabled  Install Host Certificates  ls -l /etc/grid-security/host*  -rw-r--r-- 1 root root 2286 Oct 28 19:26 /etc/grid-security/hostcert.pem  -r-------- 1 root root 887 Oct 28 19:25 /etc/grid-security/hostkey.pem  Install YUM and rpmforge packages  yum-conf-5X-8.slc5.noarch (or later)  rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm (or later)  Remove the old lcg-CA repository, if installed  rm -f /etc/yum.repos.d/lcg-CA.repo 7

8 Nagios Installations  Configure the following repositories  EGI CAs (egi-trustanchors.repo)  gLite BDII (glite-BDII.repo)  gLite UI (glite-UI.repo)  EPEL (epel.repo)  DAG (dag.repo)  SAM [egi-sam] name=EGI SAM repo baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch enabled=1 gpgcheck=0 protect=1 priority=10  Install Yum prioritiies  yum install yum-priorities 8

9 Nagios Installation  Installation  yum install lcg-CA  yum install httpd  yum install nagios.x86_64  # make sure that nagios from EGI SAM repository is installed  yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'  yum install sam-nagios 9

10 Nagios Configuration  A SAM-Nagios node type can be configured in three different way in order to monitor different sets of sites/services:  NGI SAM-Nagios : to monitor all sites/services belonging to a given region  Site SAM-Nagios : to monitor all sites/services made available by one site  VO SAM-Nagios : to monitor all sites/services that support a given VOs (list of services can be based on VO feed or not) 10

11 Nagios Configuration  NGI SAM-Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST= VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/" VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= # put to your NGI WMSes VO_OPS_WMS_HOSTS= # put to your NGI WMSes # Nagios 11

12 Nagios Configuration # Nagios NAGIOS_HOST= NAGIOS_ADMIN_DNS= NCG_NAGIOS_ADMIN= NAGIOS_ROLE=ngi NCG_PROBES_TYPE=local NCG_VO=ops NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NAGIOS_NSCA_PASS=MY_PASS # NGI/ROC Nagios COUNTRY_NAME= NCG_GOCDB_ROC_NAME= NAGIOS_SUDO_ENABLE_CONFIG=true 12

13 Nagios Configuration # DB data MYSQL_ADMIN="MY_MYSQL_PASS" DB_PASS="MY_MRS_PASS" MYEGI_ADMIN_NAME= MYEGI_ADMIN_EMAIL= MYEGI_DEFAULT_PROFILE="ROC" # profile to be displayed by default in MyEGI MYEGI_REGION=  Run yaim  /opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS 13

14 Nagios Configuration  Site SAM Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST=r VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/" VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/' VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= 14

15 Nagios Configuration # Nagios NAGIOS_HOST= NAGIOS_ADMIN_DNS= NCG_NAGIOS_ADMIN= NAGIOS_ROLE=site NCG_PROBES_TYPE=remote,local NCG_VO=dteam NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NCG_REMOTE_USE_NAGIOS=true NAGIOS_NSCA_PASS=MY_PASS  Run yaim  /opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS 15

16 Failover Nagios - configurable hot-standby mode  Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:  Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below.  SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf).  Backup instance constantly monitors resources, but it has the following features:  alarms are not sent to Operations portal  email notifications are disabled  results are not sent to the central MRS database.  note: results are stored to local MRS so the MyEGI shows correct history on both instances.  In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance. 16

17 Failover Nagios - configurable hot-standby mode  Backup instance can be defined in several ways:  Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism)  NCG_BACKUP_INSTANCE=true  Fast backup configuration without YAIM execution:  Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover):  BACKUP_INSTANCE=true  Setting global variable in ncg.conf file:  BACKUP_INSTANCE=1  Using ncg.pl argument:  ncg.pl --backup-instance  In case of backup configuration without Yaim the following additional step is needed:  /sbin/chkconfig send-to-dashboard off  /sbin/service send-to-dashboard stop 17

18 Failover Nagios - configurable hot-standby mode  In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable.  If the Yaim is not used the following additional step is needed:  /sbin/chkconfig send-to-dashboard on  /sbin/service send-to-dashboard start 18

19 Validation  After successful running of Yaim you should be able to access Nagios web interface at the address https://NAGIOS_SERVER/nagios.  If you enabled local probes make sure that you first check if MyProxy credential works by running hr.srce.GridProxy-Get-VO metric on NAGIOS_SERVER. You can do this by force scheduling check via web interface or via command line:  nagios-run-check NAGIOS_SERVER hr.srce.GridProxy-Get-VO  MyEGI interface is at the address: https://NAGIOS_SERVER/myegi https://NAGIOS_SERVER/myegi  Check resource BDII: $ ldapsearch -x -LLL -h NAGIOS_SERVER -p 2170 -b Mds-Vo-Name=resource,O=grid "(GlueServiceType=*- NAGIOS)" GlueServiceEndpoint dn: GlueServiceUniqueID=NAGIOS_SERVER_XXXXXX-NAGIOS_2937827985,Mds-Vo-name= resource,o=grid GlueServiceEndpoint: https://NAGIOS_SERVER:443/nagios 19

20 Probes  SAM Probes  grid-monitoring-probes-ch.cern.sam  Metrics: MrsCheckDBInserts, MrsCheckDBInsertsDetailed, MrsCheckSpool  NodeType: sam-opsnagios  Description: monitors if MRS is receiving metric results.  nagios-gocdb-downtime  Metrics: org.egee.ImportGocdbDowntimes  NodeType: sam-nagios, sam-gridmon  Description:  mrs  Metrics: org.egee.MrsCheckMissingProbes, org.egee.CentralMrsCheckMissingProbes, org.egee.SendToMetricStore  NodeType: sam-nagios, sam-gridmon  Description: 20

21 Probes  SAM Probes  mddb-synchronizer  Metrics: org.egee.MDDBSync  NodeType: sam-nagios, sam-gridmon  Description:  atp  Metrics: org.egee.ATPSync  NodeType: sam-nagios, sam-gridmon  Description: executes atp_synchronizer and verifies if execution was completed without errors |  msg-nagios-bridge  Metrics: org.egee.SendToMsg, org.egee.RecvFromQueue, org.egee.CheckConfig  NodeType: sam-nagios, sam-gridmon  Description: 21

22 Probes  EGI Probes  grid-monitoring-org.ggus-probes  NodeType: sam-nagios, ops-monitor  Description: queries GGUS to check for open tickets for a SU or by site name that has been notified  grid-monitoring-org.nagiosexchange-probes  NodeType: sam-nagios, ops-monitor  Description: checks from Nagios Exchange  grid-monitoring-probes-hr.srce  NodeType: sam-nagios  Description: CAdist, CertLifetime, DPM, DPNS, GRAM, GridFTP, GridProxy, MyProxy and VOMS probes  grid-monitoring-probes-org.sam.sec  NodeType: sam-nagios  Description: several security probes executed from the WNs 22

23 Probes  EMI Probes  grid-monitoring-org.activemq-probes  NodeType: ops-monitor  Description: checks for ActiveMQ Messaging Brokers  grid-monitoring-probes-ch.cern  NodeType: sam-nagios  Description: FTS, LFC and RGMA probes  grid-monitoring-probes-org.bdii  NodeType: sam-nagios  Description: Nagios checks for a WLCG Information System instance (BDII)  grid-monitoring-probes-org.ndgf  NodeType: sam-nagios  Description: ARC-CE, ARC-LFC, ARC-SRM and ARC-RLS probes 23

24 Probes  EMI Probes  grid-monitoring-probes-org.sam  NodeType: sam-nagios  Description: CE, CREAMCE, LFC, SRM, WMS and WN probes  gstat-validation  NodeType: sam-nagios  Description: Gstats checks 24


Download ppt "Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n. 306819."

Similar presentations


Ads by Google