Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
1 Effective, secure and reliable hosted security and continuity solution.
COM vs. CORBA.
Lucas Schill Brent Grover Ed Schilla Advisor: Danny Miller.
Multi-Mode Survey Management An Approach to Addressing its Challenges
© 2010 VMware Inc. All rights reserved Confidential Performance Tuning for Windows Guest OS IT Pro Camp Presented by: Matthew Mitchell.
© 2011 VMware Inc. All rights reserved Confidential VMware’s Journey to the Cloud: Leveraging Hyperic for Portal Infrastructure Monitoring March 2012.
XProtect® Expert 2013 Product presentation
© 2012 Whamcloud, Inc. Lustre Automation Challenges John Spray Whamcloud, Inc. 0.4.
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
Scaling Nagios ® to monitor large heterogeneous environments Dave Blunt February 21, 2008 Seattle Area System Administrators Guild.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Symantec De-Duplication Solutions Complete Protection for your Information Driven Enterprise Richard Hobkirk Sr. Pre-Sales Consultant.
Operating System Organization
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
–Streamline / organize Improve readability of code Decrease code volume/line count Simplify mechanisms Improve maintainability & clarity Decrease development.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Virtual Machine Management
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Ravi Sankar Technology Evangelist | Microsoft Corporation
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.
Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in We create innovative software solutions for SharePoint,
Data Center Infrastructure
Microsoft Virtual Academy. Microsoft Virtual Academy.
Using the WDK for Windows Logo and Signature Testing Craig Rowland Program Manager Windows Driver Kits Microsoft Corporation.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 2.
Module 7: Fundamentals of Administering Windows Server 2008.
MediaGrid Processing Framework 2009 February 19 Jason Danielson.
Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Page 1 © 2001, Epicentric - All Rights Reserved Epicentric Modular Web Services Alan Kropp Web Services Architect WSRP Technical Committee – March 18,
Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Tal Ben-Nun Scl. Eng & CS Hebrew University Yoav Etsion CS Dept Barcelona.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Mission Critical Application Architecture and Flash August MDCFUG Chafic Kazoun, Founder and CTO Atellis: | Weblog:
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNP 1 v3.0 Module 1 Overview of Scalable Internetworks.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Introduction To BlueMix By: Ryan
Rob Davidson, Partner Technology Specialist Microsoft Management Servers: Using management to stay secure.
Master thesis Analysis and implementation of monitoring systems of active network equipment. Scientific advisor: Univ. Prof., Dr. Hab., Pavel TOPALA Master.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Technical Reading Report Virtual Power: Coordinated Power Management in Virtualized Enterprise Environment Paper by: Ripal Nathuji & Karsten Schwan from.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Solving Today’s Data Protection Challenges with NSB 1.
© 2003, Cisco Systems, Inc. All rights reserved. 2-1 Campus Network Design.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Virtualization Review and Discussion
Users and Administrators
Architectural Design Review
Trends like agile development and continuous integration speak to the modern enterprise’s need to build software hyper-efficiently Jenkins:  a highly.
Operating System Structure
Configuration Management with Azure Automation DSC
Campus Monitoring Service
Storage Virtualization
Xen Summit Spring 2007 Platform Virtualization with XenEnterprise
Northbound API Dan Shmidt | January 2017
Co-designed Virtual Machines for Reliable Computer Systems
Day 2, Session 2 Connecting System Center to the Public Cloud
Microsoft Virtual Academy
Users and Administrators
Presentation transcript:

Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012

2 Verisign Public A Brief History of Nagios at Verisign

3 Verisign Public Whitepaper NSCA configuration Typical 3-Tier setup Remote System Distributed Nagios Servers Central Nagios Servers Architecture in-place for several years Reasonably stable, though high-maintenance Very heterogeneous environment. Many OS and Nagios versions All notifications sent to an Event Management System Offloaded graphing/trending to a custom solution. Legacy Nagios Setup

4 Verisign Public Simplified Passive Architecture Diagram

5 Verisign Public Scaling the Nagios server layers Requires changes to all NSCA instances using the servers Load-Balancing solutions mostly require removing freshness checks… Freshness checking is a challenge More freshness checking means more Nagios forking. More Nagios forking is more operational sadness in a large environment. With Freshness, you end up having an active environment, even if it wasn’t your intention. Freshness errors do not tell the whole story Where is the problem? Even if you know where the problem is, it can be difficult to track down what’s causing it. Nagios? Plugin? System busy? NSCA? Network? Many questions, few obvious answers. Challenges with our passive setup

6 Verisign Public Lack of centralized scheduling Adjusting schedules can be difficult for those without in-depth knowledge of Nagios and how it all works. Inability to have a user run a check immediately without having even more in-depth knowledge about Nagios. Lots of Nagios builds for various platforms. Since we were using NSCA, we needed libmcrypt for encryption. libmcrypt not a standard library for many systems, so yet another package to maintain. All of this needed quite a bit of custom code for intelligent result queuing/sending so as to gracefully handle network outages and minimize send_nsca forking (especially on the distributed servers). Challenges with our passive setup (continued)

7 Verisign Public A Move to Active Monitoring

8 Verisign Public Gearman Provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. Integrates with Nagios via the Mod-Gearman NEB module. NRPE Nagios Remote Plugin Executor Merlin Module for Effortless Redundancy and Loadbalancing In Nagios Allows our Nagios instances to share scheduling (and therefore check results) between one another. Great for load sharing and redundancy An alternative arises…

9 Verisign Public Simplified Active Architecture Diagram

10 Verisign Public All components run in VMs Nagios (with nanosleep) Merlin (1.1.15) Mod-Gearman MK Livestatus (perhaps the greatest NEB module of all time) Merlin setup is a simple peer peer configuration Mod-Gearman NEB modules are configured to talk to multiple gearman servers (gearman server preference is alternated on each system, so that Gearman server failures are easily handled) One Mod-Gearman worker process for each gearman server per worker. Some details about the setup

11 Verisign Public VM Configuration: 4 V-CPUs 2GB RAM Linux Performance Considerations Very CPU Bound RAM usage is very low VM Usage 2 Nagios server 2 Gearman Server 2 Mod-Gearman Workers VM Configuration & Performance

12 Verisign Public Nagios minute interval sleep_time = 0.01 host_inter_check_delay_method=n service_inter_check_delay_method=0.01 max_concurrent_checks=0 5 gearman collector threads Gearman 10 I/O Threads Mod-Gearman Workers 1000 worker processes per system 50 per second max spawn rate Application Configurations

13 Verisign Public Performance Results

14 Verisign Public These 6 VMs can easily handle active services per minute. Additional capacity can be had easily Add Merlin peers Add more workers Scales up very well renice of critical processes makes sure they’re getting the priority they need. The environment can be a bit fragile. Less fragile than before, but still has several components which all must be working correctly. Observations

15 Verisign Public Much less hardware Centralized view and control over all monitoring Opportunity to leverage the Gearman architecture for other services Higher confidence in monitoring accuracy More flexibility in scheduling logic. Event handlers become very useful, since there is a broader view of the infrastructure via MK Livestatus. Benefits

16 Verisign Public Tested several methodologies before arriving at the Nagios+Gearman conclusion. Multisite DNX NRDP The current design is still a work in progress, but will be easier to change and grow (Nagios 4?). Move anything possible off of Nagios and to external processes. Final Thoughts

17 Verisign Public Verisign System Administrators for helping me test Gearman ( ConSol Labs ( Thruk Mod-Gearman Mathias Kettner ( MK Livestatus op5 ( Merlin Nagios ( Nagios Core NRPE Credits

Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.