Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

Similar presentations


Presentation on theme: "Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish."— Presentation transcript:

1 Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish MTF meeting 21/07/2011

2 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 2Author etc Current status Large parts of computing operations have very little in terms of alarms –SLS notifications When availability is degraded –Lemon alarms Via email, on Lemon web Computing shifts largely rely on visual inspection of monitoring pages The lack of a proper alarm framework is one of the most important issues to be addressed

3 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 3Author etc Steps Define detailed requirements Look for a suitable existing framework –Might develop a new one if nothing suitable exists, but would be a long term project –As the use cases are rather universal, likely that something exists As usual, start implementing from the most urgent and simple cases

4 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 4Author etc Alarm classification Alarms related to a node –Full partitions, high load, etc. –E.g. Lemon Alarms related to a service health –Stuck or dead daemons, internal queue overload, errors in log file, etc. High level alarms on anomalous behaviours –Too many failed jobs or transfers, too low transfer rate, too few running jobs, too low CPU efficiency, etc.

5 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 5Author etc High level requirements: first iteration 1.Collect alarm data from other sources as well as generating its own alarms 2.Distribute alarm data, filtered and selected, via an API 3.Provide an “error logger”, email and SMS notifications 4.Have help pages associated to alarms to guide a human 5.Provide historical records of alarms and corresponding actions 6.Allow to correlate and aggregate alarms to create higher-level alarms

6 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 6Author etc Detailed requirements (partial) NumberDescription 1.1Must be able to collect input from WMAgent, PhEDEx, SAM, Hammercloud, HTTP servers 1.2Must be able to collect input from Lemon, Nagios, FTS, … 1.3 Must provide an API to code custom sensors for specific situations for things that have no other form of monitoring 1.4Must be able to collect information from remote clients, securely and robustly 2.1Must show alarms in a web page, with browsable history and filtering criteria 2.2Must support notifications via email, SMS, RSS 2.3Must be able to export/publish information to other systems, e.g. MonaLisa 2.4 Must support notification filtered by severity, by parameters like site/service, by time of the day 2.5Must support easy enabling/disabling of notifications 2.6Must provide an API to retrieve alarms 2.7Must not generate too many alarms for a single or repeating condition

7 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 7Author etc Ideas A “fractal” distributed system –A ‘sensor’ for the global system is an instance of the system at a smaller scope –The PhEDEx watchdog agent is a prototype of such a system as it can kill/restart agents and send alerts to other systems Another example: the Cessy  T0 transfer system ‘Logger’ –A communication hub accepting input from various sources to which one can subscribe to get notifications –Could be used to deploy a network of information gatherers at CMS sites to channel alarms into a central system

8 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 8Author etc Conclusions As usual, must start from requirements Make a good choice for a framework discussing many options if possible Find manpower for the development of sensors and integration in the CMS computing


Download ppt "Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish."

Similar presentations


Ads by Google