Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Slides:



Advertisements
Similar presentations
Sabyasachi Ghosh Mark Redekopp Murali Annavaram Ming-Hsieh Department of EE USC KnightShift: Enhancing Energy Efficiency by.
Advertisements

Distributed Data Processing
Chapter 20 Oracle Secure Backup.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
Access 2007 Product Review. With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office.
Two Broad Categories of Software
Phones OFF Please Operating System Introduction Parminder Singh Kang Home:
Office 2003 Introductory Concepts and Techniques M i c r o s o f t CPTG104 Intro to Information Systems Dr. Hwang Essential Introduction to Computers.
Introduction to C++ Programming CS 117 Section 2 and KNET Sections Spring 2001 MWF 1:40-2:30.
Technology Fundamentals 6 th pd. Terms to know Decimal Binary Hexadecimal Input Output Operating system Printer firewall Hardware Software Data Mainframe.
Systems Software Operating Systems.
Unix Presentation. What is an Operating System An operating system (OS) is a program that allows you to interact with the computer -- all of the software.
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
Operating System.
DCS PowerEdge C Systems Management Overview PowerEdge C Product Group.
SOFTWARE.
Operating Systems Chapter 4.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Linux Basics CS 302. Outline  What is Unix?  What is Linux?  Virtual Machine.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
CSC 142 A 1 CSC 142 Introduction to Java [Reading: chapter 0]
COMPUTER MAIN PARTS Hardware Software. HARDWARE Definition: The set of hardware components that make up the material part (physical) of a computer, unlike.
Operating systems CHAPTER 7.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Outcome 2 – Computer Software The Range of Software Available The Different Categories of Software System Software Programming Languages Applications Software.
Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido
Hsu Chun-Hung Network Benchmarking Lab
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
Module 7: Fundamentals of Administering Windows Server 2008.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
CS 390 Unix Programming Environment Summer Suchindra Rengan - CS3902 Course Details Instructors Suchindra Rengan – ‘sachin’ ( Section 001)
Systems Software Operating Systems. What is software? Software is the term that we use for all the programs and data that we use with a computer system.
Introduction to the Adapter Server Rob Mace June, 2008.
MANAGING SOFTWARE ASSETS ~ pertemuan 6 ~ Oleh: Ir. Abdul Hayat, MTI 1[Abdul Hayat, SIM, Semester Genap 2007/2008]
IPMI 2.0 Overview SOL-Serial redirection over Lan Management of servers and systems in a remote environment over LAN connections Allow IT managers to manage.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
CGI Common Gateway Interface. CGI is the scheme to interface other programs to the Web Server.
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
* Property of STI Page 1 of 18 Software: Systems and Applications Basic Computer Concepts Software  Software: can be divided into:  systems software.
Operating System What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. An operating.
VO-Ganglia Grid Simulator Catalin Dumitrescu, Mike Wilde, Ian Foster Computer Science Department The University of Chicago.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Computers & Operating Systems
Experiment Management System CSE 423 Aaron Kloc Jordan Harstad Robert Sorensen Robert Trevino Nicolas Tjioe Status Report Presentation Industry Mentor:
1 Software. 2 What is software ► Software is the term that we use for all the programs and data on a computer system. ► Two types of software ► Program.
© Paradigm Publishing, Inc. 4-1 Chapter 4 System Software Chapter 4 System Software.
Application Software System Software.
System Center Lesson 4: Overview of System Center 2012 Components System Center 2012 Private Cloud Components VMM Overview App Controller Overview.
Background Real-time environmental monitoring is a field garnering an ever-increasing amount of attention. The ability for sensors to make and publish.
Introduction to UNIX CS465. What is UNIX? (1) UNIX is an Operating System (OS). An operating system is a control program that allocates the computer's.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Chapter 4 Software. Introduction Program: is a set of sequence instructions that tell the computer what to do. Software: is a collection of programs,
Tackling I/O Issues 1 David Race 16 March 2010.
Running clusters on a Shoestring Fermilab SC 2007.
An operating system (OS) is a collection of system programs that together control the operation of a computer system.
Retele de senzori EEMon Electrical Energy Monitoring System.
Overview of cluster management tools Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Retele de senzori Curs 1 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Identify internal hardware devices (e. g
Accessing the VI-SEEM infrastructure
2. OPERATING SYSTEM 2.1 Operating System Function
Cloud based Open Source Backup/Restore Tool
Introduction to Cloud Computing
Linux: A Product of the Internet
Presentation transcript:

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy Stanford, CA 94305

SCCS The Scientific Computing and Computing Services at SLAC: Provides computing power, technical support, communications capabilities. Core services include Unix systems, Windows, networking, network operations, telecommunications. Supplies dept. support, science applications, network security. Houses thousands of servers.

The High Performance Computing Group of SCCS To ensure optimal computing performance of all of these servers, they must be monitored. This is the responsibility of the HPC group. The group watches data storage, electrical service to servers, cooling system abilities. This is made possible through the use of monitoring software: Nagios and Ganglia.

SCCS Task Until last year, all computing capacity at SLAC was located within the SCCS computing building. By then the datacenter had reached its maximum electrical service and cooling system capacities. New experiments meant the need for more computing power. A new datacenter would take years and a lot of funding to complete.

The Solution: Blackboxes This is a Sun Modular Datacenter produced by Sun Microsystems. It is a portable computing center built into a standard 8 foot by 20 foot shipping container. It is painted white for energy efficiency and is tightly sealed, insulated, and cooled. Today, SLAC maintains 2 blackboxes.

Blackbox Contents Blackbox 1 –252 bali machines (Sun X2200 servers) Blackbox 2156 –yili machines (Sun X4100 servers) –139 boer machines (Sun X2200 servers) The operating system on these machines is RedHat Enterprise Linux (RHEL) version 4.

Current Monitoring of the Blackboxes The High Performance Computing Group currently uses Nagios and Ganglia to monitor: Percentage of CPU in use, Amount of memory in use, and Input/output rates. The software periodically calls on utilities to extract monitoring data for the machines, displaying the info in graphs, storing the info in databases, and – in the case of Nagios – alerting administrators if machines reach warning or critical states.

Nagios User specifies items to be monitored by providing external plugins that return the status of machines to Nagios. If a warning or critical status is returned, Nagios can alert via , IM, text, etc. Admins and users can view current status and history using a web browser. –MySQL runs as a server to provide multi-user access to multiple databases. Interface: PerfParse. –Round robin database (RRD) provides useful graphs of broad historical data. Popular because the database files do not increase in size over time.

Ganglia Robust scalable distributed monitoring system designed for clusters and grids. Based on a hierarchical design: uses a tree of connections to representative nodes for each cluster, reducing overheads. Updates the RRD. Has a web frontend like Nagios but does not have alerting feature.

Additional Monitoring Needed Temperature Fan speed Power supply voltage

“Materials” Baseboard management controller (BMC) –Service processor that monitors physical state of machine. –Located in the motherboard. –Performs monitoring through use of machines sensors. –Part of the Intelligent Platform Management Interface (IPMI) which provides set of interfaces to manage and monitor a system. IPMI tool –Open source utility. –Can be used to extract physical parameters and parameter thresholds. These are important in determining the status. Lower Non-Recoverable, Lower Critical, Lower Non-Critical, Upper Non- Critical, Upper Critical, and Upper Non-Recoverable

“Materials” continued “sudo ipmitool –c sdr” “sudo ipmitool sensor list” Output for both commands are when connected to the Sun X2200 server boer0113.

“Materials” continued Cron (Chronograph) –Time-based scheduling service in Unix. –Used for security reasons since root user is needed to collect data. Perl –ideal Unix scripting language for the task. –Interpreted language; no compiler. –Efficient programming language that is powerful for file input and output because of its text manipulation capabilities and fast development cycle.

Task Create three Perl scripts (temperature, fan speed, voltage) that can be used on any machine regardless of the specific BMC. –Work first with yili0113, bali0113, and boer0113. –Cron will run root user to call on IPMI tool and will store data every 15 minutes in a readable file. –The scripts will read the data every 15 minutes from the file to produce the current machine parameters and interpret the current status of the machine (OK, WARNING, CRITICAL, UNKNOWN). –For Nagios, the scripts will return the current status and parameters. –For Ganglia, the scripts will call on the Ganglia command which passes in the parameters.

Results In a test of the check_cpu_temp.pl script on the bali0113 machine, the following results were produced using the Perl interpreter: “Temperature OK - CPU_0_Temp=49.000, CPU_1_Temp= | CPU_0_Temp= CPU_1_Temp=51.000”

The Scripts as Nagios Plugins

Ganglia work is still underway!

Conclusions Perl scripts, Nagios monitoring, and graphics tools work successfully. All three test machines are running with acceptable temperatures, fan speeds, and power supply voltages. This suggests that current cooling systems and electrical supplies in blackboxes are effective. The monitoring must be done on all servers, however, for a complete evaluation to be possible. The HPC group is much closer to ensuring optimal computing performance for the lab.

Future Work The scripts are portable. –3 test machines –KIPAC machines –All blackbox machines upon approval –Possibly more to come The scripts can also be edited to monitor different parameters.

Acknowledgements Thank you to the U.S. Department of Energy Office of Science and the Stanford Linear Accelerator Center for the opportunity to participate in the Science Undergraduate Laboratory Internships program. Thank you to Steve, Susan, and Farah. Thank you to my mentor, Yemi Adesanya, for his mentorship throughout the project.