Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Similar presentations


Presentation on theme: "Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy."— Presentation transcript:

1 Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy Stanford, CA 94305

2 SCCS The Scientific Computing and Computing Services at SLAC: Provides computing power, technical support, communications capabilities. Core services include Unix systems, Windows, networking, network operations, telecommunications. Supplies dept. support, science applications, network security. Houses thousands of servers.

3 The High Performance Computing Group of SCCS To ensure optimal computing performance of all of these servers, they must be monitored. This is the responsibility of the HPC group. The group watches data storage, electrical service to servers, cooling system abilities. This is made possible through the use of monitoring software: Nagios and Ganglia.

4 SCCS Task Until last year, all computing capacity at SLAC was located within the SCCS computing building. By then the datacenter had reached its maximum electrical service and cooling system capacities. New experiments meant the need for more computing power. A new datacenter would take years and a lot of funding to complete.

5 The Solution: Blackboxes This is a Sun Modular Datacenter produced by Sun Microsystems. It is a portable computing center built into a standard 8 foot by 20 foot shipping container. It is painted white for energy efficiency and is tightly sealed, insulated, and cooled. Today, SLAC maintains 2 blackboxes.

6 Blackbox Contents Blackbox 1 –252 bali machines (Sun X2200 servers) Blackbox 2156 –yili machines (Sun X4100 servers) –139 boer machines (Sun X2200 servers) The operating system on these machines is RedHat Enterprise Linux (RHEL) version 4.

7 Current Monitoring of the Blackboxes The High Performance Computing Group currently uses Nagios and Ganglia to monitor: Percentage of CPU in use, Amount of memory in use, and Input/output rates. The software periodically calls on utilities to extract monitoring data for the machines, displaying the info in graphs, storing the info in databases, and – in the case of Nagios – alerting administrators if machines reach warning or critical states.

8 Nagios User specifies items to be monitored by providing external plugins that return the status of machines to Nagios. If a warning or critical status is returned, Nagios can alert via email, IM, text, etc. Admins and users can view current status and history using a web browser. –MySQL runs as a server to provide multi-user access to multiple databases. Interface: PerfParse. –Round robin database (RRD) provides useful graphs of broad historical data. Popular because the database files do not increase in size over time.

9 Ganglia Robust scalable distributed monitoring system designed for clusters and grids. Based on a hierarchical design: uses a tree of connections to representative nodes for each cluster, reducing overheads. Updates the RRD. Has a web frontend like Nagios but does not have alerting feature.

10 Additional Monitoring Needed Temperature Fan speed Power supply voltage

11 “Materials” Baseboard management controller (BMC) –Service processor that monitors physical state of machine. –Located in the motherboard. –Performs monitoring through use of machines sensors. –Part of the Intelligent Platform Management Interface (IPMI) which provides set of interfaces to manage and monitor a system. IPMI tool –Open source utility. –Can be used to extract physical parameters and parameter thresholds. These are important in determining the status. Lower Non-Recoverable, Lower Critical, Lower Non-Critical, Upper Non- Critical, Upper Critical, and Upper Non-Recoverable

12 “Materials” continued “sudo ipmitool –c sdr” “sudo ipmitool sensor list” Output for both commands are when connected to the Sun X2200 server boer0113.

13 “Materials” continued Cron (Chronograph) –Time-based scheduling service in Unix. –Used for security reasons since root user is needed to collect data. Perl –ideal Unix scripting language for the task. –Interpreted language; no compiler. –Efficient programming language that is powerful for file input and output because of its text manipulation capabilities and fast development cycle.

14 Task Create three Perl scripts (temperature, fan speed, voltage) that can be used on any machine regardless of the specific BMC. –Work first with yili0113, bali0113, and boer0113. –Cron will run root user to call on IPMI tool and will store data every 15 minutes in a readable file. –The scripts will read the data every 15 minutes from the file to produce the current machine parameters and interpret the current status of the machine (OK, WARNING, CRITICAL, UNKNOWN). –For Nagios, the scripts will return the current status and parameters. –For Ganglia, the scripts will call on the Ganglia command which passes in the parameters.

15 Results In a test of the check_cpu_temp.pl script on the bali0113 machine, the following results were produced using the Perl interpreter: “Temperature OK - CPU_0_Temp=49.000, CPU_1_Temp=51.000 | CPU_0_Temp=49.000 CPU_1_Temp=51.000”

16 The Scripts as Nagios Plugins

17

18

19

20

21

22

23

24

25

26

27 Ganglia work is still underway!

28 Conclusions Perl scripts, Nagios monitoring, and graphics tools work successfully. All three test machines are running with acceptable temperatures, fan speeds, and power supply voltages. This suggests that current cooling systems and electrical supplies in blackboxes are effective. The monitoring must be done on all servers, however, for a complete evaluation to be possible. The HPC group is much closer to ensuring optimal computing performance for the lab.

29 Future Work The scripts are portable. –3 test machines –KIPAC machines –All blackbox machines upon approval –Possibly more to come The scripts can also be edited to monitor different parameters.

30 Acknowledgements Thank you to the U.S. Department of Energy Office of Science and the Stanford Linear Accelerator Center for the opportunity to participate in the Science Undergraduate Laboratory Internships program. Thank you to Steve, Susan, and Farah. Thank you to my mentor, Yemi Adesanya, for his mentorship throughout the project.


Download ppt "Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy."

Similar presentations


Ads by Google