Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring Systems Jim Fromm Marc Mengel Jack Schmidt Gary Stiehr.

Similar presentations


Presentation on theme: "Monitoring Systems Jim Fromm Marc Mengel Jack Schmidt Gary Stiehr."— Presentation transcript:

1 Monitoring Systems Jim Fromm Marc Mengel Jack Schmidt Gary Stiehr

2 Introduction During 2005, CSI recognized a crisis with NGOP Performance and reliability issues plagued NGOP CMS grew to hate NGOP. It was unreliable, was a maintenance nightmare, and configuration was very difficult. The Farms group was having problems as well, and were getting frustrated with hundreds of false alarms nightly. Configurations were time consuming and error prone, especially early in the learning curve, for CSI staff. Tanya Levshina has been unbelievably helpful, especially early in that learning curve. CSI formulated a two phase plan to address these problems Throw hardware at the problem to buy time so that we could investigate a longer term solution CSI created separate instances of NGOP for CMS and farms. This, along with a network parameter tweak, greatly improved performance and reliability. Investigate and implement a longer term solution

3 Introduction CMS chose to re-evaluate their long term monitoring solution. Gary Stiehr began looking at other monitoring packages. CMS and CSI began collaborating, and CMS presented a list of requirements they needed for a monitoring system. CSI responded to these requirements, and through a series of meetings a solution began to emerge… CSI was able to address most of the concerns of CMS Both groups agreed that the best solution long term (3 years?) was to make some changes to NGOP and keep it as the primary monitoring package. But only if: FTE’s can be found to do the work on NGOP FTE’s are commited to the general maintenance NGOP. All monitoring systems require serious effort to configure and maintain them, and NGOP is no different.

4 Introduction All monitoring system require a team to provide support on an ongoing basis: Development Maintenance Monitoring NGOP was developed for monitoring facilities Not Physics Not for Grid. Originally designed for lights out operation of the computer room. When things are right, NGOP does exactly what it was intended to do.

5 The rest of this talk… Go into more details of the problems/solutions. Discuss why keeping NGOP with a moderate amount of development work, and a commitment to maintaining it, is the best solution.

6 History of NGOP 1999: NGOP group was created to gather requirements for a Distributed Management System capable of efficiently monitoring Fermilab computing facility for Run II. Decision to develop a custom DMS was made. 2000: First prototype implementation was released. 2001: Monitoring by operator was started. “Xfalive” service was replaced with NGOP for all nodes monitored by CSD. Interfacing Remedy Help Desk System. Implementation of Web Admin Tools 2002: Major re-design of NGOP GUI and implementation of Web interface Web Admin Tool was integrated with CD System Status Page NGOP interface to telalert

7 History of NGOP (cont) 2003: Development frozen, accepted known deficiencies. 2004: Complete responsibility transferred to CSI. The majority of the code level expertise lies outside of CSI. 2005: Problems identified in maintenance and scalability Design note (DocDB 905-v1) written to address short and long term issues:905-v1 Short term issues of scalability and reliability were addressed by splitting production instance into three separate instances. Currently finalizing getting last instance live. Long term issues of missing features and reducing maintenance costs are in progress. Note: 2002-2005 the NGOP software has been virtually untouched! Until the Summer 2005, NGOP was a very reliable product.

8 I can’t give an NGOP talk without showing this System Status Page System Status Page

9 NGOP Stats

10 NGOP Stats (cont)

11

12 Current State of NGOP Three distinct instances: (and systems) CMS Farms (not yet in production) General With new configuration, it is the monitoring objects being spread over several machines, not the NGOP daemons. Each system will run all of the NGOP daemons on one machine. This simplifies things, while at the same time providing less load on the machine due to the split of the monitoring objects. Keeps servers from being overloaded with network traffic.

13 Current State of NGOP Configuration upgrades much faster for two reasons: Less stuff. Less stuff to do, cubed. The original design allowed for various system views. In order to allow for this, the configuration management server performed a transitive closure algorithm (Warshall’s) that runs O(n^3). With the split, Warshall’s algorithm can be removed. This part of the configuration now runs O(n). CMS configuration is now < 15 minutes and should grow linearly. Bi-weekly configuration upgrades are required for any change in NGOP Configuration upgrades are time consuming and error prone. 140+ xml files with over 13,000 LOC. Error checking and logging are not good. Pre syntax checking is possible, but other errors are not detected until the end of the configuration process. CSI writes agents and rules on clients because it’s too complicated for casual users to do this.

14 CMS Product Survey Gary Stiehr looked at monitoring tools with general requirements in mind and chose 11 (including NGOP) to evaluate against CMS’s full list of requirements: Nagios Mercury SiteScope Fidelia NetVigil Big Brother SysOrb Lemon Big Sister ZABBIX IBM Director InterMapper NGOP 6 were freeware, 5 had fees (either per monitored element, per node or per site) https://plone4.fnal.gov/P1/Main/CD/org/css/Shared/ngopng/prodrev/uscms _package_evals_2005-November_update.txt https://plone4.fnal.gov/P1/Main/CD/org/css/Shared/ngopng/prodrev/uscms _package_evals_2005-November_update.txt

15 CMS Requirements CMS presented a list of key requirements of a monitoring system. https://plone4.fnal.gov/P1/Main/CD/org/css/Shared/ngopng CMS’s requirements are not unique to them, they represent a good list useful for others. CSI categorized requirements: Green: Already available in NGOP 10 Greens Yellow: Available with moderate amount of work. 7 Yellows Red: Major development required to implement. 5 Reds So of the 22 items, 5 require significant development in NGOP.

16 “Red” Items [R16] Should not have to take down the entire monitoring system to update configuration parameters. Add monitored elements and hosts on the fly. Long known to be a deficiency of NGOP. Shouldn’t have to wait 2 weeks to add a host. [R17] Capable of a complete restart in 15 minutes. Not possible with the scale listed above. [F1] Provides redundancy Listed as desirable, but the solution we are proposing will address this. [R21] Ease of Configuration. Front end editor to XML could help, but configuration will probably never be easy. [R5] Accelerate alarms based on time.

17 Work-Around of “Red” Items Address four “Red” items with: Two mirrored instances of NGOP, one “live”. Re-configure “non-live” instance, and then flip over when complete. Requires changes to API, or some networking trick, to allow dual reporting.. Checks and balances in place to make sure the two instances are identical. Major development effort to implement this, 6-12 month FTE coding effort.

18 Summary NGOP is the product in use for the last 5 years and is heavily relied on. It monitors about 60,000 objects NGOP development was frozen 3 years ago, only showstoppers were fixed (last version released on January 21, 2004). CSI does not have all the code-level expertise in the product and does not have enough resources to acquire it. Tanya Levshina has been unbelievably helpful. We have spent some time investigating other products but could not find a suitable replacement NGOP has some serious deficiencies and CMS requires the addition of some new features All of the above shows that we need to devote some effort in improving NGOP. We need to allocate resources to do some development (.5 FTE) and provide adequate support and maintenance (.5 FTE ongoing).


Download ppt "Monitoring Systems Jim Fromm Marc Mengel Jack Schmidt Gary Stiehr."

Similar presentations


Ads by Google