Use of Nagios in Central European ROC

Use of Nagios in Central European ROC
Emir Imamagic University Computing Centre (SRCE) Croatia

Overview Motivation Nagios Grid monitoring with Nagios Demo slides
Sensors Configuration management GOCDB integration Demo slides Future work

Motivation Achieve better availability
getting notifications as soon as problem appears Simplify maintenance of grid resources Complex sensor’s dependencies enables isolating the problem only relevant notifications are issued Report generation availability, problem history Visualization & management interface

Nagios

Nagios Open source monitoring system Widely used & actively developed
Host and service problems detection and recovery Provides set of basic plugins (sensors) easy to develop custom sensors No components required on monitored entities

Objects Host Service Service must be associated with host
physical server, workstation network device (e.g. switch, router) other devices connected to network Service service running on host metric associated with the host Service must be associated with host Objects can be aggregated in groups

Sensor Execution Per object sensor arguments
adaptive monitoring e.g. timeout Per object checking interval each sensor has individual check interval normal vs. problem check interval Per object number of recheck determines state type Advanced check scheduling avoiding server overload

Notifications Per object configuration Contact configuration
list of contacts notification period, states & repeat interval used for authorization Contact configuration name and alias host and service notification period, states & mechanism address pager number Notification escalations if the problem doesn’t get solved notifications escalates to next contact levels

States Host states Service states State types Up, Down, Unreachable
Ok, Warning, Unknown, Critical State types soft object has not been rechecked specified number of times hard object has been rechecked specified number of times object recovers from problem state causes notifications & event handlers

Object Hierarchy Implicit dependency Host hierarchy Service dependency
service depends on associated host Host hierarchy if parent is not OK, don’t send notifications for children (hosts and services) Unreachable state e.g. router is parent for all hosts on specific site Service dependency in which cases are check & notifications performed one host/services can be dependent on multiple hosts/services

Dynamic Operations Modifying monitoring & notification behavior
acknowledging problems enabling/disabling notifications enabling/disabling active checks Executing sensors individual service all services on single host Scheduling downtimes Achieved via web interface or pipeline

Web Interface Viewing current information, history and reports
Performing dynamic operations Generating reports availability, problem trends & history Supports authentication & authorization (AA) per host/service authorization

Other Features Event handling Active vs. passive checks
enables automatic failure recovery Active vs. passive checks active – controlled by Nagios passive – submitted by other systems or another Nagios instance Distributed deployment multiple Nagios servers individual instance submits results as passive checks to central

Grid monitoring with Nagios

History CRO-GRID Infrastructure
since mid 2005 covers several grid middleware (Globus Toolkit Pre-WS & WS, UNICORE, NWS, etc) event handlers for automatic recovery Monitoring Central European (CE) core services since mid 2006 Monitoring all CE sites for 1st line support since September, 2006 Also used for certification with forced check

Deployment Centralized deployment Single Nagios server deployed @ SRCE
URL: Monitoring statistics 65 hosts 480 services Nagios server statistics (last month)

Supported Node Types Node type Number of services BDII 1 CE 8 LFC 2
MON 3 PROX RB 7 SE 9 VOMS 4 WMS

Nagios Basic Sensors Sensor Description Used interval check_ftp
checks FTP server used for GridFTP ping 15 min check_http checks HTTP server used for checking Tomcat on MON and VOMS check_ldap checks LDAP server for defined base dn used for checking BDII, Globus MDS and GridICE check_tcp checks defined TCP port used for DPNS ping

Developed Sensors Sensor Description Used interval CA distribution
checks CA distribution version 1 day Certificate lifetime uses GridFTP or HTTPS to fetch server certificate & verifies lifetime DPNS list lists /dpm directory and looks for the remote server's domain 1 hour EDG Broker submits a test job, waits for the job to finish, fetches and verifies the output Gatekeeper ping performs authorization only 15 min Gatekeeper hostname executes hostname and verifies the output

Developed Sensors Sensor Description Used interval Gatekeeper LRMS
executes command through LRMS 2 hours GridFTP transfer transfers file to remote computer and back and compares copies 1 hour LFC list lists /grid directory 15 min Match list CE – matches CE against multiple RBs RB – compares number of matches with data from BDII MyProxy creates proxy certificate, gets the proxy info and destroys it

Developed Sensors Sensor Description Used interval SRM ping
perform SRM ping with glite-srm-ping 15 min SRM transfer transfers file to remote computer and back and compares copies 1 hour VOMS Proxy creates voms proxy for given VO VOMS Gridmap creates gridmap file for given VO and reports number of users WMS same as EDG Broker, uses glite-job-* WMProxy delegation delegates proxy to WMProxy WMProxy same as EDG Broker, uses glite-job-wms-*

Sensor Hierarchy Parent Service Child Service Host hierarchy
DPNS ping DPNS list Gatekeeper ping Gatekeeper hostname Gatekeeper LRMS GridFTP ping GridFTP transfer CA Distribution Tomcat Certificate lifetime SRM ping SRM transfer VOMS Tomcat VOMS Gridmap WMProxy delegation WMProxy Host hierarchy SRCE is parent to all hosts Parent services lightweight more frequent (15 min) Child services heavyweight & complex less frequent (1 hour) Less overhead on monitored objects!

Complex Sensors Case when one service (target) depends on another service (mediator) e.g. submitting job through grid scheduler to a specific CE, storing file through LFC to SE Sensor can use any available mediator service We developed Nagios interface for retrieving list of available mediators

Configuration Management
GOCDB Static configuration e.g. nodes which are not in GOCDB, special contacts BDII retrieving site-specific data (e.g. queue names, ports) Commands more site-specific data e.g. check_ping, check_ldap

GOCDB Integration Site information Scheduled downtimes nodes
node types site BDII site contact Site Admins (for web interface authorization) Scheduled downtimes data pulled 3 times a day

Web Interface Authentication Authorization
we added certificate-based authentication Authorization admins can perform operations on their own sites only region admin can perform operation on all sites super admin can perform global Nagios operations

Demo slides

Future Work Further sensor development Passive checks
using other monitoring systems (e.g. Ganglia, Gstat) Distributed deployment Nagios per region/country redundant servers cluster for sensor execution

Thank You Questions?

Links CE Nagios monitoring site http://cs-egee.srce.hr/nagios
CE Nagios documentation Nagios official web page

Use of Nagios in Central European ROC

Similar presentations

Presentation on theme: "Use of Nagios in Central European ROC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Use of Nagios in Central European ROC

Similar presentations

Presentation on theme: "Use of Nagios in Central European ROC"— Presentation transcript:

Similar presentations

About project

Feedback