Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

Slides:

Advertisements

Similar presentations

Tableau Software Australia

Advertisements

Nagios XI 2012 Mike Guthrie Twitter: mguthrie88 Projects:

Bryan Heden Lead Solutions Provider

Bangkok, Thailand An Introduction intERLab at AIT Network Management Workshop March – Bangkok, Thailand Hervey Allen & Phil Regnauld.

Scaling Nagios ® to monitor large heterogeneous environments Dave Blunt February 21, 2008 Seattle Area System Administrators Guild.

Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.

Security Management IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

Nagios as a PC Health Monitor Sean

#RefreshCache CI - Daily Builds w/Jenkins – an Open Source Continuous Integration Server Nick Airdo Community Developer Advocate Central Christian Church.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

Global Customer Partnership Council Forum | 2008 | November 18 1IBM - GCPC MeetingIBM - GCPC Meeting IBM Lotus® Sametime® Meeting Server Deployment and.

Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Block1 Wrapping Your Nugget Around Distributed Processing.

A Networked Machine Management System 16, 1999.

Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.

Presented by D’Sunte Wilson Microsoft Dynamics Nav (Navision) Training – Day 2.

Server Performance, Scaling, Reliability and Configuration Norman White.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Stuart Kenny and Stephen Childs Trinity.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.

Master thesis Analysis and implementation of monitoring systems of active network equipment. Scientific advisor: Univ. Prof., Dr. Hab., Pavel TOPALA Master.

Nagios Fusion 2012 Mike Guthrie Twitter: mguthrie88 Projects:

Monitoring with InfluxDB & Grafana

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

1 Chapter Overview Using Standby Servers Using Failover Clustering.

Queensland University of Technology Nagios – an Open Source monitoring solution and it’s deployment at QUT.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

2008 Taipei, Taiwan An Introduction APRICOT 2008 Network Management Workshop February – Taipei, Taiwan Hervey Allen & Phil.

Ethan Galstad What Is Nagios? What Nagios Is IT Infrastructure Monitoring.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Nagios Performance Tuning Nick Scott

'08 Rabat An Introduction AfNOG 2008 Network Management Workshop June 1-2 – Rabat, Morocco Hervey Allen & Phil Regnauld.

PHD Virtual Technologies “Reader’s Choice” Preferred product.

Administering the SOWN Network David R Newman & Chris Malton.

Network Management Workshop March – Bangkok, Thailand

The Ultimate SharePoint Admin Tool

SQL Database Management

Monitoring Windows Server 2012

Nithyamoorthy S Core Mind Technologies

Monitoring Storage Systems for Oracle Enterprise Manager 12c

Introduction to Distributed Platforms

Shared Services with Spotfire

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Flash Storage 101 Revolutionizing Databases

CONFIGURING A MICROSOFT EXCHANGE SERVER 2003 INFRASTRUCTURE

op5 Monitor - Scalable Monitoring

INFNGRID Monitoring Group report

Diskpool and cloud storage benchmarks used in IT-DSS

Spark Presentation.

Campus Monitoring Service

MCU cluster Cristian Alexe 18 October 2010.

Advanced Integration and Deployment Techniques

Monitoring Storage Systems for Oracle Enterprise Manager 12c

Solving ETL Bottlenecks with SSIS Scale Out

Distributed System Structures 16: Distributed Structures

X in [Integration, Delivery, Deployment]

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

What's New in eCognition 9

Network+ Guide to Networks, Fourth Edition

Chapter 2: Operating-System Structures

Performance And Scalability In Oracle9i And SQL Server 2000

5/7/2019 Map Reduce Map reduce.

Chapter 2: Operating-System Structures

What's New in eCognition 9

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

20112 Distributed Monitoring Introduction Basic Definition: Splitting up your monitoring server over multiple machines Why use distributed monitoring? Multiple sites with firewall restrictions Large installations that exceed the CPU and memory resources that a single machine can offer.

20113 Understanding CPU Limitations The primary task of the Nagios Core engine is to schedule checks Example Monitoring Server 1000 Hosts, 4 services per host, 5mn interval Check load = ( 5000 checks / 5mn ) / 60 seconds About 16.6 checks per second In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk. When the check schedule exceeds CPU limitations, you get “check latency”

20114 Picking the Right Distributed Model Pick the right model for your environment Think logistics: PLAN before implementation Every hour spent in planning logistics will save tens or even hundreds of man hours later on A 30mn task on 1 server = 5 hours on 10 servers. Consider how to effectively view information across multiple machines As data quantity increases, discerning useful information from it becomes more important Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information

20115 The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check

20116 The Classic Distributed Model

20117 The Classic Distributed Model Central Monitoring vs Central Viewing? OCSP vs Event Handlers OSCP runs after every check Event handlers run only on state changes Freshness checking ensures current data Child servers can also do local monitoring without forwarding results Distributed servers can also receive passive checks and forward them along, creating a multi- level tree structure

20118 The Classic Distributed Model Strengths: Well tested, well documented, proven solution All built into the Nagios Core package Extremely flexible for checks, performance graphing, notifications, etc. Can be combined with other distributed models Challenges: Maintaining configs on multiple machines Which server issued the check? Where to process/view performance data?

20119 The Classic Distributed Model Workarounds: Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers. Use templating as much possible Read Core Docs on “Object Inheritance” Keep template definitions separate Use naming conventions to keep configs organized Nagios XI distributed tools: Inbound and Outbound Checks Unconfigured Objects

The Cluster Model – Nagios Load Balancing Nagios checks are managed by a sub-process and distributed evenly across multiple servers Works like a load balancer Two Popular Examples: DNX: Distributed Nagios eXecutor Mod Gearman Check results and configs are all managed at the central server

The Cluster Model – DNX

The Cluster Model – DNX DNX: How it works When a check is scheduled to execute, the job is passed to a worker node Worker node executes the check, and send results directly to results queue Checks are not associated with any particular worker node Bypasses the nagios.cmd pipe to eliminate a potential bottleneck If a worker goes down, all checks continue

The Cluster Model – DNX DNX: Strengths: Central configuration management Checks redistributed if a worker is down Worker nodes can be added at any time Challenges: Performance data is still handled at the central server If the master goes down, all checks cease

The Cluster Model – Mod Gearman

The Cluster Model – Mod Gearman Strengths: Central configuration management Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments Challenges: Performance data is still handled at the central server If the master goes down, all checks cease Effectively viewing more than 10k+ services on a single machine

The Central Dashboard Model Checks are executed and managed on multiple distributed servers Central viewer unifies all servers Central viewer polls data from each server and displays tactical data in the UI Examples: Nagios Fusion MNTOS check_MK Multisite

The Central Dashboard Model

The Central Dashboard Model: Nagios Fusion Displays tactical overview for each server Monitoring and object configurations compartmentalized to each server Good for geographically distributed servers where local management is required Unified login for all XI servers (basic auth still required for Core machines)

The Central Dashboard Model: Nagios Fusion Strengths: Easy to add new servers User-level control of server views High level overview Very little CPU usage Commercial solution with support Challenges: Not a monitoring solution by itself Free 60 day trial, requires a license

The Central Dashboard Model: Nagios Fusion

The Central Dashboard Model: MNTOS

The Central Dashboard Model: Multisite

Single Server – Distributed Parts Not all environments require check distribution Offload nodutils (DB backend) to a different machine Offload performance data processing to a different machine Mount disk i\o intensive files to a RAM disk A Nagios Core installs can run between k checks depending on what is being checked and how it is configured

Where To Go From Here? Future of Distributed Monitoring? Improved information viewing instead of just raw data Aggregated reporting and statistics Business process views and monitoring What do you, as admins, need to see in this area of software development?

Conclusion Pick the right setup for your environment Any of these models can be mixed and combined PLAN before implementation: Plan for efficient maintenance An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right Environments can scale even larger with the right logistics planning in place

Conference Resources Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Thursday 35,000 hosts and 1.4 million services Mike Weber: “Reducing Server Load with Mod Friday Dave Williams: Author of DNX