SAM Alarm Triggering and Masking

Slides:



Advertisements
Similar presentations
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
Advertisements

Łukasz Skitał 2, Renata Słota 1, Maciej Janusz 1 and Jacek Kitowski 1,2 1 Institute of Computer Science AGH University of Science and Technology, Mickiewicza.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
SEE-GRID-2 The SEE-GRID-2 initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Last update 31/01/ :41 LCG 1 Maria Dimou Procedures for introducing new Virtual Organisations to EGEE NA4 Open Meeting Catania.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Update of SAM Implementation ALICE TF Meeting 18/10/07.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
Vendredi 19 février 2016 CIC portal development status and TODO list Gilles Mathieu, Osman Aidel, Cyril L’Orphelin IN2P3/CNRS Computing Centre, Lyon, France.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
RI EGI-InSPIRE RI UMD 2 Decommissioning Status Cristina Aiftimiei EGI.eu.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Feedback from joining and first COD shift M.Radecki on behalf of CE ROC COD-7, Lyon, France.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Operations Portal OTAG September, 21th 2011 Cyril L’Orphelin – CCIN2P3/CNRS.
Nordic NE ROC Face 2 Face Meeting
CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.
Service Availability Monitoring
EGI Operations Management Board
NGI and Site Nagios Monitoring
Andreas Unterkircher CERN Grid Deployment
CRIC ・ Authentication & Authorization
Service Operations at the T0/T1 for the ALICE Experiment
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Pedro Andrade ACE Status Update Pedro Andrade
Grid Operations Procedures
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
WLCG Management Board, 16th July 2013
Testing for patch certification
WLCG Service Interventions
Cyril L’Orphelin (CC-IN2P3) COD-19, Bologna, March 30th 2009
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Take the summary from the table on
Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)
Unsupported middleware migration update
UMD 2 / EMI 2 Decommissioning Status
UMD 2 Decommissioning Status
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Retirement calendar of gLite 3.2 and EMI 1 middleware
The LHCb Computing Data Challenge DC06
Presentation transcript:

SAM Alarm Triggering and Masking Domenico Vicinanza CERN COD 13, Stockholm 11-12. June, 2007

Alarm triggering Procedure to trigger an alarm: The test result is ERROR or CRIT, The node belongs to a certified site, VO is 'OPS‘, The test is critical for OPS VO, No alarm already for that test, vo and node, The node is not in maintenance.

Alarms Info Data stored of each alarm in the SAM DB: alarm identifier vo identifier test identifier node identifier weight (see next slides on masking) test exec time alarm status (new, assigned, masked, off) update time ticket id (GGUS)

Alarms Masking Automatic Alarms Masking: Simple rule based correlation engine If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. Rules defining test relationships among alarms: (Not restricted right now. Restricting it from now on) http://lcg-sam.cern.ch:8080/alarms/mask_alarm.xsql

Prioritisation of alarms A prioritisation mechanism for the alarms is set up according to a scoring schema. Depending on the service a certain amount of “points” are associated to an alarm according to its relevance (i.e. its responsibility in causing other services failure) As an example LFC has a larger score (40000) compared with SE one (10000) since if LFC is failing SE will fail consequently

Example of scoring alarms Example of scoring mechanism depending on the service: 40.000 points: VOBOX, BDII, VOMS, LFC, WMS, RB. 30.000 points: SRM, MyProxy, FTS. 20.000 points: RGMA, sBDII. 10.000 points: gCE, CE, SE.

Alarms responsible for other alarms If an alarm masks another one (so the alarm is "important" as it causes other alarms): 1000 points are added to the alarm weight to show that it's causing other failures as well, so should be dealt with a high priority. up to a maximum of 9.000 points.

Prioritisation of alarms (cont.) Depending on the test status: 100 points if ‘INFO’ 200 points if ‘NOTE’ 300 points if ‘WARN’ 400 points if ‘ERROR’ 500 points if ‘CRIT’ Depending on n° of CPUs in the site: Value taken from the 'CE-totalcpu' test divided by 100. This gives a [0-50] number.

Happy End... Thanks!!!