Monitoring of Castor at RAL Castor F2F Rutherford Appleton Laboratory, February 18 th 2009 Cheney Ketley RAL.

Slides:



Advertisements
Similar presentations
Mercury Quality Center 9.0 Training Material
Advertisements

Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.
Calculating Quality Reporting Service
Micro Control Solutions Stability System II rev. 6.4
With Folder HelpDesk for Outlook, support centres and other helpdesks can work efficiently with support cases inside Microsoft Outlook. The support tickets.
AS ICT Finding your way round MS-Access The Home Ribbon This ribbon is automatically displayed when MS-Access is started and when existing tables.
Visualization of Monitoring Data at the NASA Advanced Supercomputing Facility Janice Singh
Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Advanced Workgroup System. Printer Admin Utility Monitors printers over IP networks Views Sharp and non-Sharp SNMP Devices Provided Standard with Sharp.
WELCOME TO THE MCCLOUD SERVICES CUSTOMER WEB PORTAL TUTORIAL.
Calendar Browser is a groupware used for booking all kinds of resources within an organization. Calendar Browser is installed on a file server and in a.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
1 Web Developer & Design Foundations with XHTML Chapter 6 Key Concepts.
Best Practices in Moodle Administration Best Practices in Moodle Administration A variety of topics from technical to practical Jonathan Moore Vice President.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
Self Guided Tour for Query V8.4 Basic Features. 2 This Self Guided Tour is meant as a review only for Query V8.4 Basic Features and not as a substitute.
Josh Riggs Utilizing Open Source Network Monitoring.
DWINSA 2007 Website. Website Purpose Allow states to track status of questionnaires Allow systems >100K or states to upload project data.
TechEd /22/2017 5:40 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.
VT SMS System User Manual
TEAM Basic TotalElectrostatic ManagementAwareness&
The VPO Operator. [vpo_operator] 2 The VPO Operator Section Overview The role of the VPO operator Starting and stopping the Motif GUI The VPO Operator.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Training.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
INFN-GRID Testbed Monitoring System Roberto Barbera Paolo Lo Re Giuseppe Sava Gennaro Tortone.
Chapter 8 Collecting Data with Forms. Chapter 8 Lessons Introduction 1.Plan and create a form 2.Edit and format a form 3.Work with form objects 4.Test.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Website.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
A Brief Documentation.  Provides basic information about connection, server, and client.
WSM Administrator Training. WSM Administrator Discussion of WSM Administrator responsibilities Discussion of WSM administrative interfaces Detailed discussion.
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
PearsonAccess April 14, PearsonAccess – Agenda Order Tracking Additional Orders Student Data Upload (SDU) files New Student Wizard Online Testing.
0 eCPIC Admin Training: OMB Submission Packages and Annual Submissions These training materials are owned by the Federal Government. They can be used or.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
PerfSONAR-PS Functionality February 11 th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D.
NovaBACKUP xSP Technical Training By: Nathan Fouarge
Master thesis Analysis and implementation of monitoring systems of active network equipment. Scientific advisor: Univ. Prof., Dr. Hab., Pavel TOPALA Master.
Nagios Fusion 2012 Mike Guthrie Twitter: mguthrie88 Projects:
30 Copyright © 2009, Oracle. All rights reserved. Using Oracle Business Intelligence Delivers.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Monitoring with InfluxDB & Grafana
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
MyEdge Overview Comprehensive and Easy Customer Portal.
SIGMA Requestor Training In this presentation we will cover : How to log a Sigma ticket How to update a ticket via the notification function How.
Overview of cluster management tools Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Web Server Administration Chapter 11 Monitoring and Analyzing the Web Environment.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
Online | classroom| Corporate Training | certifications | placements| support Contact: USA : , India.
Ethan Galstad What Is Nagios? What Nagios Is IT Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Stavroula Balopoulou , Angelo Lykiardopoulos, Sissy Iona HCMR-HNODC
SQL Database Management
Architecture Review 10/11/2004
NGI and Site Nagios Monitoring
Printer Admin Print Job Manager
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

Monitoring of Castor at RAL Castor F2F Rutherford Appleton Laboratory, February 18 th 2009 Cheney Ketley RAL

Overview of current monitoring Nagios LSF monitoring – see Brian’s talk later Statistical – TSBN / ganglia and cacti Diagnostic – checkreplicas Of servers running castor

NAGIOS

Capabilities of Nagios Browser frontend API access – Mimic created by James Adams RAL alerting Structured around groups and services Scheduled downtimes for hosts / services / groups Acknowledgements of alerts Comments to alerts and scheduled downtimes Triggers to execute external scripts Dependencies between alerts User authentication Supports snmp, plugins to nrpe, nsca forwarding Supports pager and alerts

Overnight callout Triggers – Detected errors will execute a script that bleeps a pager and sends a message Support person – refers to callout documentation on how to fix problems – Fixes castor from home – logs actions in RT ticketing (helpdesk) system This works well !

Overview – grid format Note the easy-to-see errors. 325 servers monitored in this view

Scheduling of downtimes So alerts are not sent during downtime. Note comments. Multiple servers put into downtime via hostgroup.

Red alerts only Nagios can show only the errors if desired Useful for easy monitoring

Dealing with an alert Note that acknowledgement is not the only way Note a comment is also displayed but not shown here

Example of an alert Note that extra text can be added which is useful for explaining the problem.

Methods of detecting problems NRPE – Custom plugins can be written – Standard plugins cover very many situations SNMP NSCA – Send a message from a user-written script with an alert code – Useful where nrpe cannot be installed

Mimic – created by RAL Server room layout at a glance

Summary of Nagios experiences Overall – very positive Plus points – Supports overnight callout via pager alerts – A powerful tool – Scales well – into hundreds of servers so far – Supports many architectures / OS’s – besides castor Negative points – Re-configuration to add server requires restart – Dependency checks not fully functional

OTHER MONITORING TOOLS

TSBN Extract from tape server logs with analysis by MS Excel pivot tables Perl extract written by Tim Bell (Cern) MS Excel analysis written by Cheney Ketley (RAL)

TSBN – analysis by tape server Filters can be used to select particular data and ranges

TSBN - overview Note there are many tabs at page bottom for different views on data Provided by Excel pivot tables

Ganglia, Cacti, Gridview Ganglia monitors server performance, Cacti is for the network. Ganglia also shows feeds into Castor e.g. FTS Gridview gives a broad view

Ganglia plots This shows performance graphs by server or by service

Another ganglia plot Useful for analysis by feed e.g. FTS

Gridview This gives an overview of site performance

DIAGNOSTICS

Diagnostics Cern checkreplicas.pl (perl script) – Webmin wrapper to checkreplicas with hyperlinks to drill down into data – Broken at the moment ! – Contact is Other ad hoc tools exist – sql scripts written by Bonny – Contact is

SERVER-SIDE

Operating system and environment Logwatch – covers diskspace, logins, usage, cron usage, packages updated, and other messages not recognised by logwatch IPMI being rolled out (for server physical alerts and remote control) Nagios also monitors the operating system components Backups – messages from Amanda

Logwatch message Shows key information regarding operating system But causes problems with overload…

FUTURE

Future direction More Nagios plugins Performance metrics More diagnostic tools

Contacts Jonathan Wheeler is main Nagios administrator but not responsible for Castor – Cheney Ketley is secondary Nagios administrator and supports Castor – James Adams wrote Mimic –