Nagios Demonstration Tom Wlodek SLAC Tier2 workshop
RT/AT machine Rt-racf.bnl.gov Nagios server rnagios01 NRPE server Inside firewall Gridmon… NRPE server Outside firewall Grid02.usatlas.org firewall BNL Nagios Hardware (current) … Several NRPE servers on monitored machines …
RT Nagios AT Problems reported to RT are reflected on asset’s history Information about assets stored in AT is used by Nagios to monitor the BNL machines and services as well as to keep up-to date list of administrators which are to be notified in case of problems In case of a failure of a critical machine or service Nagios notifies experts and/or opens RT problem report to keep track with the problem resolution. OSG Footprints RT can exchange problem reports with external ticketing systems. Machines and services monitored by Nagios No AT support anymore!!!
Coming changes to Nagios The Nagios server will be split into two: internal RACF server (BNL stuff) external (Tier2/3, OSG services, USAtlas) Nagios split has been delayed (lack of suitable hardware) but I hope that the problems have been fixed now Once the split is completed the Tier2 admins will be given nagios administrator rights.
RT machine Rt-racf.bnl.gov Nagios internal server NRPE server Inside firewall Gridmon… NRPE server Outside firewall Grid02… firewall Future Hardware … NRPE servers on monitored machines Nagios external server …
Current Nagios in nutshell Bookmark this page and visit it oftenBookmark this page and visit it often We are currently monitoring ~500 services on ~260 hosts and counting…
Service dependencies Parent services Child service
Service dependencies Currently some service dependencies are defined in nagios More need to be defined/discovered Discovering and declaring service dependencies is a neverending task..
False alarms in Nagios Sometimes probes report false alarms. Many of those false alarms were caused by problems in BNL firewalls. We eliminated them by adding second network interface to nagios server. Some level of false alarms still persist, probably still caused by firewalls. It is hard to eliminate them. I work on making the probes smarter. Also fix to BNL firewalls should bring relief.
Nagios – “Tactical overview” Visit this page daily – especially if you are member of management group or operator
We need to formalize the Nagios operations 1.Operators should monitor “Tactical overview” page for new alerts and notify experts if they see one 2.Upon receiving nagios alert (by and/or pager and/or operator call) expert should visit nagios page and acknowledge the problem. 3.Expert should then take ownership of corresponding RT ticket and check the status of parent service (if applicable) 4.Close the RT ticket, if applicable. 5.Reschedule the new test of nagios service to clear the alert from nagios page 6.Fix the problem, leave record of the solution in RT 7.Delete comments from nagios page (if applicable)
Useful things to know How to schedule a shutdown of a service or group of services?How to schedule a shutdown of a service or group of services? How to disable checks for a particular service or group of services?How to disable checks for a particular service or group of services? How to stop notifications for a service/service group?How to stop notifications for a service/service group?
We need to formalize the Nagios operations (cntd) We need to enforce two rules: No abandoned RT tickets (mostly works OK)No abandoned RT tickets (mostly works OK) No unacknowledged nagios alarmsNo unacknowledged nagios alarms Acknowledged problems should remain acknowledged for at most T time. (One week???) After that they ought to be fixed or removed from nagios. The length of time interval T is negotiable, but we should agree on some number.Acknowledged problems should remain acknowledged for at most T time. (One week???) After that they ought to be fixed or removed from nagios. The length of time interval T is negotiable, but we should agree on some number.
RSV probes and Nagios There are 3 ways to integrate RSV probes in nagios 1.Run RSV probe directly from nagios. Can be done (and is done) for simple probes, more complex ones will timeout nagios 2.Make RSV probes to report results to central OSG database, make nagios read the database. RSV authors do not seem to like it. 3.Make RSV probes report directly to nagios. BNL security experts do not like it, since it would imply changing current authentication methods.. So….
We will combine method 2 and 3 Nagios Interface Db BNL firewall RSV probes running in OSG land
I need feedback from you! 1.What should be monitored? 2.Who should be on call list? 3.What should be notification policy? ? Pager? 4.We define event handlers to correct common error conditions? Do you want/need it? 5.Etc… etc…