Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
20112 Distributed Monitoring Introduction Basic Definition: Splitting up your monitoring server over multiple machines Why use distributed monitoring? Multiple sites with firewall restrictions Large installations that exceed the CPU and memory resources that a single machine can offer.
20113 Understanding CPU Limitations The primary task of the Nagios Core engine is to schedule checks Example Monitoring Server 1000 Hosts, 4 services per host, 5mn interval Check load = ( 5000 checks / 5mn ) / 60 seconds About 16.6 checks per second In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk. When the check schedule exceeds CPU limitations, you get “check latency”
20114 Picking the Right Distributed Model Pick the right model for your environment Think logistics: PLAN before implementation Every hour spent in planning logistics will save tens or even hundreds of man hours later on A 30mn task on 1 server = 5 hours on 10 servers. Consider how to effectively view information across multiple machines As data quantity increases, discerning useful information from it becomes more important Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information
20115 The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check
20116 The Classic Distributed Model
20117 The Classic Distributed Model Central Monitoring vs Central Viewing? OCSP vs Event Handlers OSCP runs after every check Event handlers run only on state changes Freshness checking ensures current data Child servers can also do local monitoring without forwarding results Distributed servers can also receive passive checks and forward them along, creating a multi- level tree structure
20118 The Classic Distributed Model Strengths: Well tested, well documented, proven solution All built into the Nagios Core package Extremely flexible for checks, performance graphing, notifications, etc. Can be combined with other distributed models Challenges: Maintaining configs on multiple machines Which server issued the check? Where to process/view performance data?
20119 The Classic Distributed Model Workarounds: Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers. Use templating as much possible Read Core Docs on “Object Inheritance” Keep template definitions separate Use naming conventions to keep configs organized Nagios XI distributed tools: Inbound and Outbound Checks Unconfigured Objects
The Cluster Model – Nagios Load Balancing Nagios checks are managed by a sub-process and distributed evenly across multiple servers Works like a load balancer Two Popular Examples: DNX: Distributed Nagios eXecutor Mod Gearman Check results and configs are all managed at the central server
The Cluster Model – DNX
The Cluster Model – DNX DNX: How it works When a check is scheduled to execute, the job is passed to a worker node Worker node executes the check, and send results directly to results queue Checks are not associated with any particular worker node Bypasses the nagios.cmd pipe to eliminate a potential bottleneck If a worker goes down, all checks continue
The Cluster Model – DNX DNX: Strengths: Central configuration management Checks redistributed if a worker is down Worker nodes can be added at any time Challenges: Performance data is still handled at the central server If the master goes down, all checks cease
The Cluster Model – Mod Gearman
The Cluster Model – Mod Gearman Strengths: Central configuration management Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments Challenges: Performance data is still handled at the central server If the master goes down, all checks cease Effectively viewing more than 10k+ services on a single machine
The Central Dashboard Model Checks are executed and managed on multiple distributed servers Central viewer unifies all servers Central viewer polls data from each server and displays tactical data in the UI Examples: Nagios Fusion MNTOS check_MK Multisite
The Central Dashboard Model
The Central Dashboard Model: Nagios Fusion Displays tactical overview for each server Monitoring and object configurations compartmentalized to each server Good for geographically distributed servers where local management is required Unified login for all XI servers (basic auth still required for Core machines)
The Central Dashboard Model: Nagios Fusion Strengths: Easy to add new servers User-level control of server views High level overview Very little CPU usage Commercial solution with support Challenges: Not a monitoring solution by itself Free 60 day trial, requires a license
The Central Dashboard Model: Nagios Fusion
The Central Dashboard Model: MNTOS
The Central Dashboard Model: Multisite
Single Server – Distributed Parts Not all environments require check distribution Offload nodutils (DB backend) to a different machine Offload performance data processing to a different machine Mount disk i\o intensive files to a RAM disk A Nagios Core installs can run between k checks depending on what is being checked and how it is configured
Where To Go From Here? Future of Distributed Monitoring? Improved information viewing instead of just raw data Aggregated reporting and statistics Business process views and monitoring What do you, as admins, need to see in this area of software development?
Conclusion Pick the right setup for your environment Any of these models can be mixed and combined PLAN before implementation: Plan for efficient maintenance An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right Environments can scale even larger with the right logistics planning in place
Conference Resources Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Thursday 35,000 hosts and 1.4 million services Mike Weber: “Reducing Server Load with Mod Friday Dave Williams: Author of DNX