Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n AfricaGrid ROC Monitoring Tools : NAGIOS Christos Kanellopoulos GRNET SAGrid All-Hands Meeting, 26 March 2013
What is SAM? Service Availability Monitoring is a fully distributed monitoring framework with the following features: Based on Open Source Systems (Nagios, ActiveMQ,etc) High scalability based on Open Messaging Support integration of 3 rd Party Monitoring Systems Advanced notification and reporting system Web interface to visualize service status availability Web REST API to consume service status and availability 2
Architecture SAM is a system made up of several components, some commodity and some specifically designed and developed for SAM. These include: Nagios to execute tests Messaging to transport test results between components, databases to store both configuration information: the Aggregate Topology Provider (ATP) and the Profile Management Database (POEM), databases to store the test results produced by Nagios: the Metric Result Store (MRS). Other components such as the Availability Calculation Engine (ACE) processes the raw test results to calculation metrics such as site and service availability and reliability A portal, MyWLCG or MyEGI is provided to visualize both test results and availability calculations. 3
Architecture ACE (Availability Computation Engine) is the calculation engine used for availability and reliability calculations. Starting from metric results stored in the central database ACE computes statuses, availabilities, and reliabilities for services, service flavours, and sites. ATP (Aggregated Topology Provider) provides topology information by aggregating grid topology information and downtimes from different external sources (GOCDB, OIM, CIC, BDII, GSTAT, feeds) POEM (Profile Management Database) describes existing metrics and groups them in profiles in order to run tests. It replaces the former MDDB component. MRS (Metric Results Store) stores metric output and computes service statuses/ It provides views of data such as site and service status, availability and reliability 4
Architecture 5
MyWLCG/MyEMI is the main visualization tool to present a grid-aware view of the data collected by the Service Availability Monitoring framework Nagios is the heart of SAM. It schedules tests and forwards via messaging the test results to various components which require them, such as the central metric Store and the MyEGI portal Probes are contributed by many developer and system managers and are used to test the specific grid services. Messaging Apache ActiveMQ is used as an integration framework, adding flexibility, reliability and scalability to the distributed SAM monitoring system. 6
Nagios Installations Disabled selinux in /etc/selinux/config SELINUX=disabled Install Host Certificates ls -l /etc/grid-security/host* -rw-r--r-- 1 root root 2286 Oct 28 19:26 /etc/grid-security/hostcert.pem -r root root 887 Oct 28 19:25 /etc/grid-security/hostkey.pem Install YUM and rpmforge packages yum-conf-5X-8.slc5.noarch (or later) rpmforge-release el5.rf.x86_64.rpm (or later) Remove the old lcg-CA repository, if installed rm -f /etc/yum.repos.d/lcg-CA.repo 7
Nagios Installations Configure the following repositories EGI CAs (egi-trustanchors.repo) gLite BDII (glite-BDII.repo) gLite UI (glite-UI.repo) EPEL (epel.repo) DAG (dag.repo) SAM [egi-sam] name=EGI SAM repo baseurl= enabled=1 gpgcheck=0 protect=1 priority=10 Install Yum prioritiies yum install yum-priorities 8
Nagios Installation Installation yum install lcg-CA yum install httpd yum install nagios.x86_64 # make sure that nagios from EGI SAM repository is installed yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)' yum install sam-nagios 9
Nagios Configuration A SAM-Nagios node type can be configured in three different way in order to monitor different sets of sites/services: NGI SAM-Nagios : to monitor all sites/services belonging to a given region Site SAM-Nagios : to monitor all sites/services made available by one site VO SAM-Nagios : to monitor all sites/services that support a given VOs (list of services can be based on VO feed or not) 10
Nagios Configuration NGI SAM-Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST= VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://" VO_OPS_VOMSES="'ops /DC=ch/DC=cern/OU=computers/ ops 24' 'ops /DC=ch/DC=cern/OU=computers/ ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://' VO_DTEAM_VOMSES="'dteam /C=GR/O=HellasGrid/ dteam 24' 'dteam /C=GR/O=HellasGrid/ dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= # put to your NGI WMSes VO_OPS_WMS_HOSTS= # put to your NGI WMSes # Nagios 11
Nagios Configuration # DB data MYSQL_ADMIN="MY_MYSQL_PASS" DB_PASS="MY_MRS_PASS" MYEGI_ADMIN_NAME= MYEGI_ADMIN_ = MYEGI_DEFAULT_PROFILE="ROC" # profile to be displayed by default in MyEGI MYEGI_REGION= Run yaim /opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS 13
Nagios Configuration Site SAM Nagios # Generic SITE_NAME= SITE_BDII_HOST= PX_HOST= BDII_HOST=r VOS="dteam ops" VO_OPS_VOMS_SERVERS="vomss://" VO_OPS_VOMSES="'ops /DC=ch/DC=cern/OU=computers/ ops 24' 'ops /DC=ch/DC=cern/OU=computers/ ops 24'" VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'" VO_DTEAM_VOMS_SERVERS='vomss://' VO_DTEAM_VOMSES="'dteam /C=GR/O=HellasGrid/ dteam 24' 'dteam /C=GR/O=HellasGrid/ dteam 24'" VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'" VO_DTEAM_WMS_HOSTS= 14
Failover Nagios - configurable hot-standby mode Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way: Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below. SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf). Backup instance constantly monitors resources, but it has the following features: alarms are not sent to Operations portal notifications are disabled results are not sent to the central MRS database. note: results are stored to local MRS so the MyEGI shows correct history on both instances. In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance. 16
Failover Nagios - configurable hot-standby mode Backup instance can be defined in several ways: Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism) NCG_BACKUP_INSTANCE=true Fast backup configuration without YAIM execution: Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover): BACKUP_INSTANCE=true Setting global variable in ncg.conf file: BACKUP_INSTANCE=1 Using argument: --backup-instance In case of backup configuration without Yaim the following additional step is needed: /sbin/chkconfig send-to-dashboard off /sbin/service send-to-dashboard stop 17
Failover Nagios - configurable hot-standby mode In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable. If the Yaim is not used the following additional step is needed: /sbin/chkconfig send-to-dashboard on /sbin/service send-to-dashboard start 18
Validation After successful running of Yaim you should be able to access Nagios web interface at the address If you enabled local probes make sure that you first check if MyProxy credential works by running hr.srce.GridProxy-Get-VO metric on NAGIOS_SERVER. You can do this by force scheduling check via web interface or via command line: nagios-run-check NAGIOS_SERVER hr.srce.GridProxy-Get-VO MyEGI interface is at the address: Check resource BDII: $ ldapsearch -x -LLL -h NAGIOS_SERVER -p b Mds-Vo-Name=resource,O=grid "(GlueServiceType=*- NAGIOS)" GlueServiceEndpoint dn: GlueServiceUniqueID=NAGIOS_SERVER_XXXXXX-NAGIOS_ ,Mds-Vo-name= resource,o=grid GlueServiceEndpoint: 19
Probes SAM Probes Metrics: MrsCheckDBInserts, MrsCheckDBInsertsDetailed, MrsCheckSpool NodeType: sam-opsnagios Description: monitors if MRS is receiving metric results. nagios-gocdb-downtime Metrics: org.egee.ImportGocdbDowntimes NodeType: sam-nagios, sam-gridmon Description: mrs Metrics: org.egee.MrsCheckMissingProbes, org.egee.CentralMrsCheckMissingProbes, org.egee.SendToMetricStore NodeType: sam-nagios, sam-gridmon Description: 20
Probes SAM Probes mddb-synchronizer Metrics: org.egee.MDDBSync NodeType: sam-nagios, sam-gridmon Description: atp Metrics: org.egee.ATPSync NodeType: sam-nagios, sam-gridmon Description: executes atp_synchronizer and verifies if execution was completed without errors | msg-nagios-bridge Metrics: org.egee.SendToMsg, org.egee.RecvFromQueue, org.egee.CheckConfig NodeType: sam-nagios, sam-gridmon Description: 21
Probes EGI Probes grid-monitoring-org.ggus-probes NodeType: sam-nagios, ops-monitor Description: queries GGUS to check for open tickets for a SU or by site name that has been notified grid-monitoring-org.nagiosexchange-probes NodeType: sam-nagios, ops-monitor Description: checks from Nagios Exchange grid-monitoring-probes-hr.srce NodeType: sam-nagios Description: CAdist, CertLifetime, DPM, DPNS, GRAM, GridFTP, GridProxy, MyProxy and VOMS probes grid-monitoring-probes-org.sam.sec NodeType: sam-nagios Description: several security probes executed from the WNs 22
Probes EMI Probes grid-monitoring-org.activemq-probes NodeType: ops-monitor Description: checks for ActiveMQ Messaging Brokers NodeType: sam-nagios Description: FTS, LFC and RGMA probes grid-monitoring-probes-org.bdii NodeType: sam-nagios Description: Nagios checks for a WLCG Information System instance (BDII) grid-monitoring-probes-org.ndgf NodeType: sam-nagios Description: ARC-CE, ARC-LFC, ARC-SRM and ARC-RLS probes 23
Probes EMI Probes grid-monitoring-probes-org.sam NodeType: sam-nagios Description: CE, CREAMCE, LFC, SRM, WMS and WN probes gstat-validation NodeType: sam-nagios Description: Gstats checks 24