Visualization of Monitoring Data at the NASA Advanced Supercomputing Facility Janice Singh

Slides:



Advertisements
Similar presentations
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Advertisements

Nagios: An introduction and Brief Tutorial
Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.
©2011 Quest Software, Inc. All rights reserved.. Andrei Polevoi, Tatiana Golubovich Program Management Group ActiveRoles Add-on Manager Overview.
Open Science Grid Discovering and understanding the site environment Or, yet another site test kit.
Chapter 20 Oracle Secure Backup.
Coursework 2: getting started (3) – hosting static web pages Chris Greenhalgh G54UBI /
Lucas Schill Brent Grover Ed Schilla Advisor: Danny Miller.
Real World Uses for Nagios APIs Janice Singh
Monitoring of Castor at RAL Castor F2F Rutherford Appleton Laboratory, February 18 th 2009 Cheney Ketley RAL.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Advanced Workgroup System. Printer Admin Utility Monitors printers over IP networks Views Sharp and non-Sharp SNMP Devices Provided Standard with Sharp.
Bookshelf.EXE - BX A dynamic version of Bookshelf –Automatic submission of algorithm implementations, data and benchmarks into database Distributed computing.
ALEPH version 21 Task Manager. New Task Manager Interface Admin tab 2 The Task Manager interface has been removed from the ALEPH menu, and is now found.
Nada Abdulla Ahmed.  SmoothWall Express is an open source firewall distribution based on the GNU/Linux operating system. Designed for ease of use, SmoothWall.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
INTERNET DATABASE. Internet and E-commerce Internet – a worldwide collection of interconnected computer network Internet – a worldwide collection of interconnected.
1 Chapter Overview Introduction to Windows XP Professional Printing Setting Up Network Printers Connecting to Network Printers Configuring Network Printers.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Printing Terminology. Requirements for Network Printing At least one computer to operate as the print server Sufficient RAM to process documents Sufficient.
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
SEEM4570: XAMPP, Eclipse, Summary of Html Kangfei Zhao Room 711,ERB
Monitoring and Troubleshooting Chapter 17. Review What role is required to share folders on Windows Server 2008 R2? What is the default permission listed.
Microsoft Windows 2003 Server. Client/Server Environment Many client computers connect to a server.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Squiggle Lan Messenger.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
Josh Riggs Utilizing Open Source Network Monitoring.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Advanced Features of Nagios XI Sam Lansing -
MagicInfo Pro Scheduler Now that a template has been created from content imported into the Library, the user is ready to begin scheduling content to.
Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
INFN-GRID Testbed Monitoring System Roberto Barbera Paolo Lo Re Giuseppe Sava Gennaro Tortone.
INDIANAUNIVERSITYINDIANAUNIVERSITY Grid Monitoring from a GOC perspective John Hicks HPCC Engineer Indiana University October 27, 2002 Internet2 Fall Members.
VPO Troubleshooting. [vpo_troubleshooting] 2 VPO Troubleshooting Section Overview Possible Trouble Areas Filesystem structures Management Server logfiles.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
A Brief Documentation.  Provides basic information about connection, server, and client.
Automated Scheduling and Operations for Legacy Applications.
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
FTP File Transfer Protocol Graeme Strachan. Agenda  An Overview  A Demonstration  An Activity.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
Monitoring with InfluxDB & Grafana
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Overview of cluster management tools Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Maintaining and Updating Windows Server 2008 Lesson 8.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
Advanced Computing Facility Introduction
Metrics data published Via different methods Monitoring Server
Monitoring with Nagios
TYPES OF SERVER. TYPES OF SERVER What is a server.
Lab 1 introduction, debrief
INSTALLING AND SETTING UP APACHE2 IN A LINUX ENVIRONMENT
Securing web applications Externally
Presentation transcript:

Visualization of Monitoring Data at the NASA Advanced Supercomputing Facility Janice Singh

NASA Advanced Supercomputing Facility Pleiades –11,136-node SGI ICE supercluster 162,496 cores total (32,768 additional GPU cores) Tape Storage –pDMF cluster NFS servers –home filesystems on computing systems Lustre Filesystems –each filesystem is multiple servers PBS (Portable Batch System) –job scheduler that runs on computing systems Ref: Janice Singh -

Why Visualization is Needed 24 x 7 Help Desk –need a quick overview of system status but still more specific than nagios visualization –not just single status per host, but sub-groups per host –they assess situations before calling next level of support automatic alerts from nagios are not as selective –interrelated issues allows us to see how many systems affected by one issue Janice Singh -

Heads Up Display – For Staff Janice Singh -

Heads Up Display – For Staff (details) Janice Singh -

Heads Up Display – for Users Janice Singh -

All the Parts nagios nrpe nsca datagg (in-house software) apache perl/cgi-bin Janice Singh -

Data flow network firewall (The Enclave) nsca nrpe ssh Cluster Compute Node Remote Node Web Server nagios nscanagios.cmd datagg nagios2.cmd nagios nagios web interface HUD orange - pipe file green - flat file purple - web site HUD buffer nrpe Dedicated Nagios Node Janice Singh -

Nagios (and add-ons) setup The Basics the webserver and the main nagios server are the same machine there is a network firewall called “the enclave” –most compute nodes are inside the enclave –the webserver can only receive data from the enclave, not send the servers within the enclave send nagios data to the webserver via nsca for the servers outside the enclave, nrpe is used plugins written in Perl using Nagios::Plugin Janice Singh -

Nagios (and add-ons) setup, cont. Versions when I inherited the systems, they were all using nagios 2.10 –most systems have been upgraded to nagios 3.4+ –all new systems have nagios 3.5 –webserver is still using 2.10 nsca across the board Janice Singh -

Nagios (and add-ons) setup, cont. Clusters within the enclave there is one host that is considered the Dedicated Nagios Server the rest of the hosts are monitored using nrpe exceptions on Pleiades: –there are many hosts monitored that get reimaged often difficult to administer nrpe use check_by_ssh –ssh is flaky under nagios 3, so it still uses nagios 2.10 will randomly give the error: Could not open pipe: /usr/bin/ssh –use 2 Dedicated Nagios Servers so many checks that there was unacceptable latency –tests that should run every 2 mins were running every 30 mins Janice Singh -

datagg (DATa AGGregator) why it was needed (i.e. what nagios couldn’t do for us) –error summaries of nagios problems the nagios webpage does tell number of alerts per service group, but that cannot be leveraged via API Janice Singh -

datagg (DATa AGGregator), cont. why it was needed (i.e. what nagios couldn’t do for us) –parsing data about Portable Batch System (PBS) –piecing together large outputs from NSCA current output from PBS: 404,937 characters Janice Singh -

datagg (DATa AGGregator), cont. why it was needed (i.e. what nagios couldn’t do for us) –mapping service nodes to the appropriate Lustre filesystem they are in a servicegroup (but not available via API) Janice Singh -

datagg (DATa AGGregator), cont. in-house written perl script –it reads the command file (pipe) that nsca creates on the webserver –the nagios configuration on the webserver also writes to the nsca command file –it aggregates the data that it reads in from the pipe and writes it out to a flat file referred to as the “HUD buffer” The data read in is in the format: [$timestamp] PROCESS_SERVICE_CHECK_RESULT; $hostname; $service_description; $state; $output Janice Singh -

HUD Buffer is a Windows-style.ini file –sections used to group together the boxes (or sub-boxes) ex: [pleiades daemons] –keys name=value –this is where we put the nagios state and output of plugin –every section also has the key Error Summary Janice Singh -

Data flow network firewall (The Enclave) nsca nrpe ssh Cluster Compute Node Remote Node Web Server nagios nscanagios.cmd datagg nagios2.cmd nagios nagios web interface HUD orange - pipe file green - flat file purple - web site HUD buffer nrpe Dedicated Nagios Node Janice Singh -

Displays Two versions –Internal (aka Staff HUD) on the internal network Staff HUD and nagios web interface clicking on Staff HUD goes to the nagios web page – for the service checks for the host or service group –the nagios web interface gives more details on the plugin output than is displayed on the HUD –the nagios web interface is used to suspend/restart notifications –External (aka miniHUD) –Permissions set in the Apache config file Both written in Perl using cgi-bin Janice Singh -

Future Plans use a database to collect data –this will also allow us to have historical data beyond what is in the logs, which can be used in graphing –will eliminate the need for a flat file Janice Singh -

Future Plans things that would be great to see in nagios 4 –an API –make a nagios check run “on demand” –no more random “could not open pipe” errors –lower latency (we’re having problems with less than 600 checks!) –error summaries –way to send large amounts of data via nsca –a way to see the exact command that nagios ran on the nagios webpage Janice Singh -

Questions? Janice Singh -