SAM Alarm Triggering and Masking

Slides:

Advertisements

Similar presentations

New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.

Advertisements

Łukasz Skitał 2, Renata Słota 1, Maciej Janusz 1 and Jacek Kitowski 1,2 1 Institute of Computer Science AGH University of Science and Technology, Mickiewicza.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.

FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

SEE-GRID-2 The SEE-GRID-2 initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.

CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.

Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.

The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

Last update 31/01/ :41 LCG 1 Maria Dimou Procedures for introducing new Virtual Organisations to EGEE NA4 Open Meeting Catania.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Update of SAM Implementation ALICE TF Meeting 18/10/07.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –

Vendredi 19 février 2016 CIC portal development status and TODO list Gilles Mathieu, Osman Aidel, Cyril L’Orphelin IN2P3/CNRS Computing Centre, Lyon, France.

SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.

RI EGI-InSPIRE RI UMD 2 Decommissioning Status Cristina Aiftimiei EGI.eu.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Feedback from joining and first COD shift M.Radecki on behalf of CE ROC COD-7, Lyon, France.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Operations Portal OTAG September, 21th 2011 Cyril L’Orphelin – CCIN2P3/CNRS.

Nordic NE ROC Face 2 Face Meeting

CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.

Service Availability Monitoring

EGI Operations Management Board

NGI and Site Nagios Monitoring

Andreas Unterkircher CERN Grid Deployment

CRIC ・ Authentication & Authorization

Service Operations at the T0/T1 for the ALICE Experiment

GDB 8th March 2006 Flavia Donno IT/GD, CERN

Pedro Andrade ACE Status Update Pedro Andrade

Grid Operations Procedures

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007

WLCG Management Board, 16th July 2013

Testing for patch certification

WLCG Service Interventions

Cyril L’Orphelin (CC-IN2P3) COD-19, Bologna, March 30th 2009

Grid Deployment Board meeting, 8 November 2006, CERN

Short update on the latest gLite status

ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010

Take the summary from the table on

Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)

Unsupported middleware migration update

UMD 2 / EMI 2 Decommissioning Status

UMD 2 Decommissioning Status

Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford

EGEE Operation Tools and Procedures

Site availability Dec. 19 th 2006

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Retirement calendar of gLite 3.2 and EMI 1 middleware

The LHCb Computing Data Challenge DC06

Presentation transcript:

SAM Alarm Triggering and Masking Domenico Vicinanza CERN COD 13, Stockholm 11-12. June, 2007

Alarm triggering Procedure to trigger an alarm: The test result is ERROR or CRIT, The node belongs to a certified site, VO is 'OPS‘, The test is critical for OPS VO, No alarm already for that test, vo and node, The node is not in maintenance.

Alarms Info Data stored of each alarm in the SAM DB: alarm identifier vo identifier test identifier node identifier weight (see next slides on masking) test exec time alarm status (new, assigned, masked, off) update time ticket id (GGUS)

Alarms Masking Automatic Alarms Masking: Simple rule based correlation engine If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. Rules defining test relationships among alarms: (Not restricted right now. Restricting it from now on) http://lcg-sam.cern.ch:8080/alarms/mask_alarm.xsql

Prioritisation of alarms A prioritisation mechanism for the alarms is set up according to a scoring schema. Depending on the service a certain amount of “points” are associated to an alarm according to its relevance (i.e. its responsibility in causing other services failure) As an example LFC has a larger score (40000) compared with SE one (10000) since if LFC is failing SE will fail consequently

Example of scoring alarms Example of scoring mechanism depending on the service: 40.000 points: VOBOX, BDII, VOMS, LFC, WMS, RB. 30.000 points: SRM, MyProxy, FTS. 20.000 points: RGMA, sBDII. 10.000 points: gCE, CE, SE.

Alarms responsible for other alarms If an alarm masks another one (so the alarm is "important" as it causes other alarms): 1000 points are added to the alarm weight to show that it's causing other failures as well, so should be dealt with a high priority. up to a maximum of 9.000 points.

Prioritisation of alarms (cont.) Depending on the test status: 100 points if ‘INFO’ 200 points if ‘NOTE’ 300 points if ‘WARN’ 400 points if ‘ERROR’ 500 points if ‘CRIT’ Depending on n° of CPUs in the site: Value taken from the 'CE-totalcpu' test divided by 100. This gives a [0-50] number.

Happy End... Thanks!!!