Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAM Alarm Triggering and Masking

Similar presentations


Presentation on theme: "SAM Alarm Triggering and Masking"— Presentation transcript:

1 SAM Alarm Triggering and Masking
Domenico Vicinanza CERN COD 13, Stockholm June, 2007

2 Alarm triggering Procedure to trigger an alarm:
The test result is ERROR or CRIT, The node belongs to a certified site, VO is 'OPS‘, The test is critical for OPS VO, No alarm already for that test, vo and node, The node is not in maintenance.

3 Alarms Info Data stored of each alarm in the SAM DB: alarm identifier
vo identifier test identifier node identifier weight (see next slides on masking) test exec time alarm status (new, assigned, masked, off) update time ticket id (GGUS)

4 Alarms Masking Automatic Alarms Masking:
Simple rule based correlation engine If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. Rules defining test relationships among alarms: (Not restricted right now. Restricting it from now on)

5

6 Prioritisation of alarms
A prioritisation mechanism for the alarms is set up according to a scoring schema. Depending on the service a certain amount of “points” are associated to an alarm according to its relevance (i.e. its responsibility in causing other services failure) As an example LFC has a larger score (40000) compared with SE one (10000) since if LFC is failing SE will fail consequently

7 Example of scoring alarms
Example of scoring mechanism depending on the service: points: VOBOX, BDII, VOMS, LFC, WMS, RB. points: SRM, MyProxy, FTS. points: RGMA, sBDII. points: gCE, CE, SE.

8 Alarms responsible for other alarms
If an alarm masks another one (so the alarm is "important" as it causes other alarms): 1000 points are added to the alarm weight to show that it's causing other failures as well, so should be dealt with a high priority. up to a maximum of points.

9 Prioritisation of alarms (cont.)
Depending on the test status: 100 points if ‘INFO’ 200 points if ‘NOTE’ 300 points if ‘WARN’ 400 points if ‘ERROR’ 500 points if ‘CRIT’ Depending on n° of CPUs in the site: Value taken from the 'CE-totalcpu' test divided by 100. This gives a [0-50] number.

10 Happy End... Thanks!!!


Download ppt "SAM Alarm Triggering and Masking"

Similar presentations


Ads by Google