INDIANAUNIVERSITYINDIANAUNIVERSITY Grid Monitoring from a GOC perspective John Hicks HPCC Engineer Indiana University October 27, 2002 Internet2 Fall Members meeting, HENP Working Group – Los Angeles
INDIANAUNIVERSITYINDIANAUNIVERSITY This presentation is concerned with work being done for the iVDGL/iGOC demonstration at SuperComputing Identifying the issues NOC vs. iGOC Getting information GOC tools Overview Web site:
INDIANAUNIVERSITYINDIANAUNIVERSITY What role should the GOC play in grid monitoring? Should the GOC just collect and publish general information about the grid status? Should the GOC collect information for trouble shooting problems? Should the GOC try to direct traffic and identify potential problems analogous to an air traffic controller (suggested by Saul Youssef, Boston University) Identifying the issues
INDIANAUNIVERSITYINDIANAUNIVERSITY What are some of the potential problems the GOC can help solve? Resource status and availability. Computational node Storage node Network Services (MDS) Resource availability can be determined with something as simple as a ping. Resource status depends on the measurement criteria. What is the machines current load? How much disk space is available? What is the measured network throughput between nodes? Are LDAP services available on this machine? Identifying the issues (cont.)
INDIANAUNIVERSITYINDIANAUNIVERSITY What information does the GOC need to help solve problems? What data needs to be gathered? Grid centric (MDS). OS centric (Ganglia, Nagios). Network centric (SNMP, other network monitoring tools). What is the data and acquisition frequency? Static (total number of nodes in a cluster). Dynamic but infrequent (number of available nodes). Dynamic and frequent (jobs running on a cluster). Realtime (available network bandwidth). Identifying the issues (cont.)
INDIANAUNIVERSITYINDIANAUNIVERSITY The Global NOC provides first level support for network related problems typically over networks within their domain of control. The iGOC should provide first level support for network, facility, and, infrastructure related problem not necessarily with their domain of control. The Global NOC has network engineers on staff. As far as I know, there is no such thing as a grid engineer. NOC performance monitoring usually has a demarcation point (i.e. wall jack, edge device, etc.) within a homogeneous network. GOC performance monitoring must measure end to end performance in a heterogeneous network and end node environment. The GOC must use the NOC as a resource for solving problems. NOC vs. iGOC
INDIANAUNIVERSITYINDIANAUNIVERSITY A key component of a successful GOC is accurate contact information. In order to solve problems or monitor resources you have to know who to talk to. We are currently collecting the following contact information from each site on the grid. High Performance Computing (HPC) contact. Principle Investigator (PI). Network person or local NOC contact. Security. Storage. System administrator. Getting information
INDIANAUNIVERSITYINDIANAUNIVERSITY We are using and developing the following tools to meet the GOC monitoring requirements. Nagios Ganglia LDAP tools GOC and other tools GOC tools
INDIANAUNIVERSITYINDIANAUNIVERSITY Nagios® is a host and service monitor designed to inform you of network problems and end system problems. Nagios provides simple ping availability of resources on the network. Nagios works with a set of “plugins” to provide local and remote host service status. Custom “plugins” are relatively easy to develop. Different methods are provided for remote resource discovery. Nagios is freely available from Nagios
INDIANAUNIVERSITYINDIANAUNIVERSITY Currently using the following built-in Nagios plugins: check_users check_load check_disk check_procs check_mem Current Nagios plugin development: check_nagios (see if a remote Nagios is running). check_aggregate (summarize and collect the status of a group of services). Nagios
INDIANAUNIVERSITYINDIANAUNIVERSITY There are different ways Nagios can get information from plugins. nrpep (perl version of nrpe). check_by_ssh (passive). check_by_ssh (active). Nagios remote plugin execution (perl). Easy to use once setup. uses MD5 and TripleDES. Scales reasonably well for large number of hosts. Must have remote root access to setup. Nagios
INDIANAUNIVERSITYINDIANAUNIVERSITY check_by_ssh (passive). Easy to use once setup. sshd already running most places. Requires crontab entry to push data to the server. Scales reasonably well for large number of hosts. check_by_ssh (active). Easy to use once setup. sshd already running most places. Does not scale well for large number of hosts. Nagios
INDIANAUNIVERSITYINDIANAUNIVERSITY Current iVDGL Nagios implementation for SuperComputing demo consists of star topology. One Nagios server. Using check_by_ssh (passive). Does not scale well. Quick and dirty demo Proposed persistent GOC Nagios infratructure. Run a Nagios server at the gatekeeper of each cluster. Gatekeeper Nagios only responsible for local site. Aggregate summary information and send to regional Nagios server. GOC maintains Meta Nagios with grid health status. Nagios
INDIANAUNIVERSITYINDIANAUNIVERSITY Ganglia provides a complete pseudo real-time monitoring and execution environment. Ganglia provides a mechanism that you can not only link nodes of a cluster but an entire cluster to another cluster. Ganglia Monitoring Daemon (Gmond) is a multithreaded daemon that runs on each node that you want to monitor. Ganglia Meta Daemon (gmetad) allows you to monitor clusters. The Ganglia web front end uses PHP and RRDTool. Ganglia is freely available at Ganglia
INDIANAUNIVERSITYINDIANAUNIVERSITY Ganglia has been modified to provide VO – centric reporting. Standard Ganglia does not provide layered reporting. VO – centric Ganglia has the following features: Monitoring of host resources (processor load, memory load, disk load, etc.) Simple plugin design that allows users to easily develop their own service checks (included from the standard version) Grid and VO related sensors Publishing/Retrieving summary information to third parties Optional SSL-enabled communication (meta-daemons and web-interface) MDS interface for collecting list of reporting nodes Optional web interface for viewing current network status, notification and problem history, log file, etc. Interface with Nagios(TM)Nagios(TM) Developed by Catalin Lucian, – University of Chicago ( Ganglia
INDIANAUNIVERSITYINDIANAUNIVERSITY Grid centric information can be obtained from the MDS. There are a couple of good LDAP web interfaces. LDAPExplorer, John’s LDAP Web interface, There are a number of Perl modules for LDAP, The key to extracting information is understanding the schema. Find out who is responsible for the schema and take an active role in its development. Always built dynamic search queries tools. Learn to use ldapsearch and grid-info-search. LDAP tools
INDIANAUNIVERSITYINDIANAUNIVERSITY GOC staff are being presented with a new set of challenges. New tools are being developed to meet these challenges. A combination of new and old tools is required to monitor and troubleshoot grid issues. Future GOC staff and “Grid Engineers” will need a broad skill set in order to be affective. There are many other grid and cluster monitoring packages: MonaLisa, GOSSIP, Gridview, etc.. There are many network monitoring packages. MRTG SNAPP and other RRDTool collectors. Netflow tools. Weather Map software. OCxMON. Pinger. GOC and other tools
INDIANAUNIVERSITYINDIANAUNIVERSITY Questions and discussion John Hicks Indiana University