Nagios Demonstration Tom Wlodek SLAC Tier2 workshop 2007-11-29.

Slides:



Advertisements
Similar presentations
Presented by Nikita Shah 5th IT ( )
Advertisements

Request Tracking (RT) The new ticketing system Tomasz Wlodek.
ONE STOP THE TOTAL SERVICE SOLUTION FOR REMOTE DEVICE MANAGMENT.
File Server Organization and Best Practices IT Partners June, 02, 2010.
Integrating The Datacenter OpalisRobot MOM Operator.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
OpalisRobot™ Demonstration Actual Run Book Procedure Actual Data center Run Book Procedure documenting for Level 1 staff how to both VERIFY.
A A A N C N U I N F O R M A T I O N T E C H N O L O G Y : IT OPERATIONS 1 Problem Management Jim Heronime, Manager, ITSM Program Tanya Friehauf-Dungca,
XProtect® Expert 2013 Product presentation
Validata Release Coordinator Accelerated application delivery through automated end-to-end release management.
Security Management IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Best Practices – Overview
Security Management IACT 418/918 Autumn 2005 Gene Awyzio SITACS University of Wollongong.
Firewall 2 * Essential Network Security Book Slides. IT352 | Network Security |Najwa AlGhamdi 1.
Brian Bradley.  Data is any type of stored digital information.  Security is about the protection of assets.  Prevention: measures taken to protect.
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
McGraw-Hill The McGraw-Hill Companies, Inc., 2000 SNMP Simple Network Management Protocol.
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
A Product of Copyright © ANGLER Technologies AURA – Quality Compliance Monitoring & Process Management System.
Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.
September 18, 2002 Introduction to Windows 2000 Server Components Ryan Larson David Greer.
These materials are prepared only for the students enrolled in the course Distributed Software Development (DSD) at the Department of Computer.
Enforcing Concurrent Logon Policies with UserLock.
Everything the web administrator needs to know about MOM 2005 Chris Adams Program Manager IIS Product Unit Microsoft Corp.
Chapter 9: Novell NetWare
Josh Riggs Utilizing Open Source Network Monitoring.
CSI-E Computer Security Investigator – Enterprise.
Module 7: Fundamentals of Administering Windows Server 2008.
Intrusion Detection Prepared by: Mohammed Hussein Supervised by: Dr. Lo’ai Tawalbeh NYIT- winter 2007.
Windows Small Business Server 2003 Setting up and Connecting David Overton Partner Technical Specialist.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Existing Solution 1 Proactive – SLR 6 hour schedule activities for managed services Stop light report is posted and reviews any open tickets for.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
INFN-GRID Testbed Monitoring System Roberto Barbera Paolo Lo Re Giuseppe Sava Gennaro Tortone.
OCTAVE-S on TradeSolution Inc.. Introduction Phase 1: Critical Assets and threats Phase 2: Critical IT Components Phase 3: Changes Required in current.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
New perfSonar Dashboard Andy Lake, Tom Wlodek. What is the dashboard? I assume that everybody is familiar with the “old dashboard”:
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
West Virginia University Slide 1 Copyright © K.Goseva 2010 CS 736 Software Performance Engineering Comments on Homework #1  Please revise the solution.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
Console Operations (Service Desk). Console Operators are tasked with a wide variety of functions and responsibilities We are the first point of contact.
Cluster Consistency Monitor. Why use a cluster consistency monitoring tool? A Cluster is by definition a setup of configurations to maintain the operation.
Tbox is a monitoring solution for all your computer systems Unifies and simplifies management of system surveillance Notifies you in the event of.
Paul Graham Software Architect, EPCC PCP – The P robes C oordination P rotocol A secure, robust framework.
WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
© Copyright 2014 TONE SOFTWARE CORPORATION. Confidential and Proprietary. All rights reserved. ® Administrator Training – Release Alarms Administration.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
26/01/2007Riccardo Brunetti OSCT Meeting1 Security at The IT-ROC Status and Plans.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
FHA Training Module 1 This document reflects current policy related to this topic. Its content is approved for use in all external and internal FHA-related.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Security Monitoring Daniel Kouřil EGI-TF 2011.
1 A Look at the Application Authorized users can access Communicator! NXT from any Internet-capable computer via the Web.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
SQL Database Management
Gradebook point of contact/Parentaccess admin training
The FAST Report Scheduler
NGI and Site Nagios Monitoring
Data and database administration
David Adams Brookhaven National Laboratory September 28, 2006
Introduction to Networking
POP: Building Automation Around Secure Server Deployment
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

Nagios Demonstration Tom Wlodek SLAC Tier2 workshop

RT/AT machine Rt-racf.bnl.gov Nagios server rnagios01 NRPE server Inside firewall Gridmon… NRPE server Outside firewall Grid02.usatlas.org firewall BNL Nagios Hardware (current) … Several NRPE servers on monitored machines …

RT Nagios AT Problems reported to RT are reflected on asset’s history Information about assets stored in AT is used by Nagios to monitor the BNL machines and services as well as to keep up-to date list of administrators which are to be notified in case of problems In case of a failure of a critical machine or service Nagios notifies experts and/or opens RT problem report to keep track with the problem resolution. OSG Footprints RT can exchange problem reports with external ticketing systems. Machines and services monitored by Nagios No AT support anymore!!!

Coming changes to Nagios The Nagios server will be split into two: internal RACF server (BNL stuff) external (Tier2/3, OSG services, USAtlas) Nagios split has been delayed (lack of suitable hardware) but I hope that the problems have been fixed now Once the split is completed the Tier2 admins will be given nagios administrator rights.

RT machine Rt-racf.bnl.gov Nagios internal server NRPE server Inside firewall Gridmon… NRPE server Outside firewall Grid02… firewall Future Hardware … NRPE servers on monitored machines Nagios external server …

Current Nagios in nutshell Bookmark this page and visit it oftenBookmark this page and visit it often We are currently monitoring ~500 services on ~260 hosts and counting…

Service dependencies Parent services Child service

Service dependencies Currently some service dependencies are defined in nagios More need to be defined/discovered Discovering and declaring service dependencies is a neverending task..

False alarms in Nagios Sometimes probes report false alarms. Many of those false alarms were caused by problems in BNL firewalls. We eliminated them by adding second network interface to nagios server. Some level of false alarms still persist, probably still caused by firewalls. It is hard to eliminate them. I work on making the probes smarter. Also fix to BNL firewalls should bring relief.

Nagios – “Tactical overview” Visit this page daily – especially if you are member of management group or operator

We need to formalize the Nagios operations 1.Operators should monitor “Tactical overview” page for new alerts and notify experts if they see one 2.Upon receiving nagios alert (by and/or pager and/or operator call) expert should visit nagios page and acknowledge the problem. 3.Expert should then take ownership of corresponding RT ticket and check the status of parent service (if applicable) 4.Close the RT ticket, if applicable. 5.Reschedule the new test of nagios service to clear the alert from nagios page 6.Fix the problem, leave record of the solution in RT 7.Delete comments from nagios page (if applicable)

Useful things to know How to schedule a shutdown of a service or group of services?How to schedule a shutdown of a service or group of services? How to disable checks for a particular service or group of services?How to disable checks for a particular service or group of services? How to stop notifications for a service/service group?How to stop notifications for a service/service group?

We need to formalize the Nagios operations (cntd) We need to enforce two rules: No abandoned RT tickets (mostly works OK)No abandoned RT tickets (mostly works OK) No unacknowledged nagios alarmsNo unacknowledged nagios alarms Acknowledged problems should remain acknowledged for at most T time. (One week???) After that they ought to be fixed or removed from nagios. The length of time interval T is negotiable, but we should agree on some number.Acknowledged problems should remain acknowledged for at most T time. (One week???) After that they ought to be fixed or removed from nagios. The length of time interval T is negotiable, but we should agree on some number.

RSV probes and Nagios There are 3 ways to integrate RSV probes in nagios 1.Run RSV probe directly from nagios. Can be done (and is done) for simple probes, more complex ones will timeout nagios 2.Make RSV probes to report results to central OSG database, make nagios read the database. RSV authors do not seem to like it. 3.Make RSV probes report directly to nagios. BNL security experts do not like it, since it would imply changing current authentication methods.. So….

We will combine method 2 and 3 Nagios Interface Db BNL firewall RSV probes running in OSG land

I need feedback from you! 1.What should be monitored? 2.Who should be on call list? 3.What should be notification policy? ? Pager? 4.We define event handlers to correct common error conditions? Do you want/need it? 5.Etc… etc…