Nagios Is Down and Your Boss Wants to See You Andrew Widdersheim

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

2 Copyright © 2005, Oracle. All rights reserved. Installing the Oracle Database Software.

Setting up and configuring BCO EE (BPA) Linux Console How I Learned to Stop Worrying and Love BCO EE Dima Seliverstov 3/3/2014.

Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.

1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.

1.1 Installing Windows Server 2008 Windows Server 2008 Editions Windows Server 2008 Installation Requirements X64 Installation Considerations Preparing.

Chapter 11: Maintaining and Optimizing Windows Vista

1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.

High Availability Module 12.

Ronen Gabbay Microsoft Regional Director Yside / Hi-Tech College

Patch Management Module 13. Module You Are Here VMware vSphere 4.1: Install, Configure, Manage – Revision A Operations vSphere Environment Introduction.

Scalability Module 6.

Installing and Setting up mongoDB replica set PREPARED BY SUDHEER KONDLA SOLUTIONS ARCHITECT.

NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge

How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Welcome Thank you for taking our training. Collection 6421: Configure and Troubleshoot Windows Server® 2008 Network Course 6690 – 6709 at

Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.

Database Storage Considerations Adam Backman White Star Software DB-05:

About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.

Hands-On Microsoft Windows Server 2008

Tutorial 11 Installing, Updating, and Configuring Software

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

11 SYSTEM PERFORMANCE IN WINDOWS XP Chapter 12. Chapter 12: System Performance in Windows XP2 SYSTEM PERFORMANCE IN WINDOWS XP  Optimize Microsoft Windows.

Ch 6. Performance Rating Windows 7 adjusts itself to match the ability of the hardware –Aero Theme v. Windows Basic –Gaming features –TV recording –Video.

Presented by, MySQL AB® & O’Reilly Media, Inc. 0 to 60 in 3.1 Tyler Carlton Cory Sessions.

Module 1: Installing and Configuring Servers. Module Overview Installing Windows Server 2008 Managing Server Roles and Features Overview of the Server.

High-Availability Linux Project (Last updated 24 Septemver, 2003) 30 th December, 2003 Seo Dong Mahn.

General rules 1. Rule: 2. Rule: 3. Rule: 10. Rule: Ask questions ……………………. 11. Rule: I do not know your skill. If I tell you things you know, please stop.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

414/415 Review Session Emin Gun Sirer. True/False Multiprogramming offers increased response time Instructions to access a raw disk device need to be.

WEEK 11 – TOPOLOGIES, TCP/IP, SHARING & SECURITY IT1001- Personal Computer Hardware System & Operations.

Hands-On Virtual Computing

LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.

Jérôme Jaussaud, Senior Product Manager

Virtual Machine Movement and Hyper-V Replica

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

Experiences with Xen virtualization and multi-brick Evergreen Environments Presented by Mike Peters – Indiana State Library & Niles Ingalls– Hussey Mayfield.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.

MySQL HA An overview Kris Buytaert. ● Senior Linux and Open Source ● „Infrastructure Architect“ ● I don't remember when I started.

1 High Availability in 37 Easy Steps Tim Serong Senior Clustering Engineer

High Availability Low Dollar Clustered Storage

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist

Nagios Performance Tuning Nick Scott

Windows Vista Configuration MCTS : Maintenance and Optimization.

SQL Database Management

Jean-Philippe Baud, IT-GD, CERN November 2007

Understanding and Improving Server Performance

Operating System Structures

Installation 1. Installation Sources

Dag Toppe Larsen UiB/CERN CERN,

High Availability Linux (HA Linux)

Dag Toppe Larsen UiB/CERN CERN,

HP ArcSight ESM 6.8c HA Fail Over Illustrated

How can a detector saturate a 10Gb link through a remote file system

Optimizing SQL Server Performance in a Virtual Environment

Introduction of Week 3 Assignment Discussion

Migration Strategies – Business Desktop Deployment (BDD) Overview

High Availability Low Dollar Clustered Storage

湖南大学-信息科学与工程学院-计算机与科学系

SpiraTest/Plan/Team Deployment Considerations

Overview Continuation from Monday (File system implementation)

Your Time to Grow! Alpine Technology REGIONAL SEMINAR 2018

Chapter 11: Printers IT Essentials v6.0 Chapter 11: Printers

CSE 332: Data Abstractions Memory Hierarchy

Presentation transcript:

Nagios Is Down and Your Boss Wants to See You Andrew Widdersheim

20122 Nooooooooooo!!!

20123 Breaking News!

20124 Nagios High Availability Options Merlin by op5 Classic method described in Nagios Core documentation Some type of virtualized solution like VMWare or…

Nagios High Availability + = Win 20125

DRBD magic 20126

DRBD magic Linbit Free Runs in Kernel either by module or in the mainline code if Kernel is new enough Each server gets its own independent storage Able to maintain the data’s consistency between the nodes Resource level fencing

DRBD considerations DRBD is as fast as the slowest node Network latency Replication over great distances can be done DRBD proxy can increase performance over great distances but does cost money Recommend using dedicated cross-over link for best performance Protocol Choices Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer. Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache. Protocol C: write IO is reported as completed, if it has reached both local and remote disk.

Pacemaker 20129

Pacemaker + DRBD + Nagios Pacemaker Resource Manager CoroSync / Heartbeat Messaging Node1 Node2 Hardware PrimarySecondary DRBD ext4 Filesystem VIP rrdcachedNCSANPCDApacheNagios Nagios Stuff

Pacemaker + DRBD + Nagios 11 PrimarySecondary DRBD ext4 Filesystem VIP rrdcachedNCSANPCDApacheNagios Nagios Stuff Pacemaker Resource Manager CoroSync / Heartbeat Messaging Node1 Node2 Hardware 2012

Pacemaker + DRBD + Nagios PrimarySecondary DRBD ext4rrdcachedNCSANPCDApacheNagios Pacemaker Resource Manager CoroSync / Heartbeat Messaging Node1 Node2 Hardware

Pacemaker and Nagios

Pacemaker and Nagios primitive p_nagios lsb:nagios \ op start interval="0" timeout="180s" \ op stop interval="0" timeout="40s" \ op monitor interval="30s" \ meta target-role="Started" primitive p_fs_nagios ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="180s" \ op monitor interval="30s" timeout="40s" group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql \ p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache \ p_nagios_syslog-ng p_nagios \ meta target-role="Started"

Pacemaker and Nagios

Pacemaker considerations Redundant communication links are a must Recommend use of crossover to help accomplish this Init scripts for Nagios must be LSB compliant… some are not

What to replicate? Configuration Host Service Multi check command files Webinject command files PNP4Nagios RRD’s Nagios log files retention.dat Mail Queue (eh…)

Everything else? Binaries and main configuration files installed using packages independently on each server Able to update one node at a time Easy to roll back should there be an issue Version/change management Consistent build process NDO and MySQL hosted on separate HA cluster

RPM’s Build and maintain our own RPM’s Lets us configure everything to our liking Lets us update at our own pace Controlled through SVN with a post-commit to automatically update our own Nagios repository with new packages/updates. Then it is as simple as doing “yum update” on your servers. A lot of upfront work but was worth it

How has this helped? Have been able to repair, upgrade and move hardware with minimal downtime Updated OS and restart server with minimal downtime Able to update to and promptly patch issue affecting Nagios downtime’s that was not caught in QA CGI pages of death

What doesn’t this solve? Having an HA cluster is great but there are still things that can go wrong having a cluster does not solve Configuration issues are probably the most prevalent thing we run into that might bring down Nagios without there being a major hardware/DC issue We make use of NagiosQL which does a backup when a configuration is changed. This allows us to rollback unwanted changes but isn’t the best.

Two is better than one Setting up another cluster for “development” with similar hardware and software is a great way to test things outside of production Lets you spot potential problems before they become a problem

Monitoring your cluster check_crm and-High-2DAvailability/Check-CRM/details check_drbd Systems/Linux/check_drbd/details check_heartbeat_link Systems/Linux/check_heartbeat_link/details

Gotcha’s RPM’s and symlinks in an HA solution are bad Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when node is secondary and you update RPM your symlink will get blown away Restarting services controlled by Pacemaker should be done within Pacemaker crm resource restart p_nagios

Quick Stats Thousands of host and service checks Average check latency ~.300 sec Average checks per second ~70 Mostly active checks polling every 5 minutes DL360 G GB 10k SAS drives in RAID10 2 quad core 3.00GHz 8GB Memory

Tuning RAM disk for check results queue, NPCD queue, objects.cache and status.dat NDOUtils with async patch Built in since version 1.5 Limit what you send to NDOUtils Bulk Mode with npcdmod rrdcached Restarting Nagios through external command eventually resulted in higher latencies for some reason Large installation tweaks Disable environment macros A lot of trial and error with scheduling and reaper frequencies Small amount of check optimization Measuring Nagios performance using PNP4Nagios is a must

RAM disk + ndo-async + rrdcached

non-external command file restarts

nsca

One Year’s Progress

How we run today

Quick Stats Questions?