Validation of SAM3 monitoring data (availability & reliability of services) Ivan Dzhunov, Pablo Saiz (CERN), Elena Tikhonenko (JINR, Dubna) April 11, 2014.

Slides:

Advertisements

Similar presentations

Summer Student presentation Changing Dashboard build system to Bamboo Robert Varga IT/SDC

Advertisements

ERCOT Nodal Protocols Telemetry Requirements Proposal

WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.

Introduction to Maintainability

Software Engineering For Beginners. General Information Lecturer, Patricia O’Byrne, office K115A. –

CHAPTER 11 Searching. 2 Introduction Searching is the process of finding a target element among a group of items (the search pool), or determining that.

Simple Return on Investment Calculations Elton Billings Elton Billings

Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.

Adapted from slides by Marie desJardins

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Algorithmic Problem Solving CMSC 201 Adapted from slides by Marie desJardins (Spring 2015 Prof Chang version)

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

ATLAS-CMS Batteries ATLAS - - 2x MSS battery banks + 2x Quench Heater banks - - MSS systems OK (6 hrs minimum) - - HEC batteries promised for week 25,

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.

MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Chapter 1 – Science and Measurement

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.

INTRODUCTION TO PROGRAMMING. Program Development Life Cycle The program development life cycle is a model that describes the stages involved in a program.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

Visualization Ideas for Management Dashboards

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

1 Calorimeters LED control LHCb CALO meeting Anatoli Konoplyannikov /ITEP/ Status of the calorimeters LV power supply and ECS control Status of.

Julia Andreeva on behalf of the MND section MND review.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

Department Observing Networks and Data TECO May 2005 A Quality Management System for the Process of Collecting Meteorological Data Jochen.

2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.

SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.

Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

Computation of Service Availability Metrics in Gridview Digamber Sonvane, Rajesh Kalmady, Phool Chand, Kislay Bhatt, Kumar Vaibhav Computer Division, BARC,

WLCG Service Report ~~~ WLCG Management Board, 14 th February

SUM like functionality with WLCG-MON Ivan Dzhunov.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

Prototype of new Site Usability interface Amol Wakankar March, /1/20111.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka,

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Flexible Availability Computation Engine for WLCG Rajesh Kalmady, Phool Chand, Vaibhav Kumar, Digamber Sonvane, Pradyumna Joshi, Vibhuti Duggal, Kislay.

ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.

Daniele Bonacorsi Andrea Sciabà

Pedro Andrade ACE Status Update Pedro Andrade

Lab Roles and Lab Report

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.

March Availability Report for EGEE Sites based on Nagios

Monitoring of the infrastructure from the VO perspective

CHAPTER 6 Testing and Debugging.

Presentation transcript:

Validation of SAM3 monitoring data (availability & reliability of services) Ivan Dzhunov, Pablo Saiz (CERN), Elena Tikhonenko (JINR, Dubna) April 11, 2014

Introduction The new Site Usability Monitor SAM3 based on the tests results is under development the task is to validate Availability/Reliability calculation for that is reasonable to compare monitoring data from SUM (old) and SAM3 (new) systems SUM:     SAM3:     Elena Tikhonenko 2 April 11, 2014

Availability/reliability algorithms  Definition taken from current SAM  Availability = Up period / (Up period + Down period + SD period)  Reliability = Up period / (Up period + Down period)  Up period - OK or WARNING when no SD  Down period - CRITICAL when no SD  SD period – scheduled downtime with severity outage  A/R calculation is implemented in WLCG-MON (from where SAM3 UI takes data)  works at site/service level  needs a site/service SD metric 3 Elena Tikhonenko April 11, 2014

Validation procedure SUM and SAM3 offer the same interface and processing the same messages. Comparison of monitoring data for different services:  Done for ALICE, CMS, ATLAS and LHCb (SRMv2, OSG-SRMv2 and CREAM-CE); results can be obtained for the periods of time (for example, one week) specified at run of the validation procedure which is realized as a number of shell scripts and C++ program  Data are considered not coinciding if the difference is more 4%  Investigation of the results of comparison 4 Elena Tikhonenko April 11, 2014

Results of comparison The reasons for the differences found:  Downtime - new system calculates A/R numbers correctly – in a case of SD, SUM doesn’t return A/R  Different metric validity on both systems (2h in SAM3, 24h in SUM) - in SUM we had CRITICAL status for 5-6 hours, until a OK test arrived, in SAM3 we have 2 hours of CRITICAL status, then 3-4 hours of no results - this led to different A/R numbers  In SAM3 UNKNOWN status invalidated CRITICAL status, while in the old system, CRITICAL is the heaviest when AND operation is performed – now it is fixed by Ivan  Also it was observed that SAM3 doesn’t handle SRM services that belong to more than one site – now it is fixed by Ivan 5 Elena Tikhonenko April 11, 2014

Overview of comparison for one week ( ) 6 Availability Reliability Name of service Number of services Number of records for comparison Total number of differences Due to downtime Due to different validity Total number of differences CMS SRMv (4.5%) (2.2%) ATLAS SRMv (3.4%)18813 (1.7%) ATLAS OSG- SRMv (8%)263 (3%) LHCb SRMv CMS CREAM-CE (20%) (19%) Elena Tikhonenko April 11, 2014

7 Elena Tikhonenko April 11, 2014 SD example (CMS): SRMv2 service availability: T2_BE_UCL ingrid-se02.cism.ucl.ac.be ingrid-se02.cism.ucl.ac.be: downtime from 30-Mar-14 20:00:00 to 04-Apr : Availability = -2 (maintenance) : Availability =

8 Elena Tikhonenko April 11, 2014 “Validity” example (ALICE): CREAM-CE service availability: CERN ce408.cern.ch Unknown 0.85 Unknown Unknown

9 Elena Tikhonenko April 11, 2014 SRMv2 service reliability: T2_UA_KIPT cms-se0.kipt.kharkov.ua cms-se0.kipt.kharkov.ua : downtime 11-Mar-14 18:01:00 14-Mar-14 18:00:00 14-Mar-14 18:01:00 19-Mar-14 07:45:34 only 1 minute “OK”! Reliability = Up period / (Up period + Down period) Up period - OK or WARNING when no SD Down period - CRITICAL when no SD

TO DO  Compare Site A/R  To complement the comparison of data for the CREAM-CE CMS services by information on what percentage of time for each day there was information about the state of the service, i.e. summarize for each day SUM time for “Critical”, “OK” and “Warning” states and divide it by the length of a day.  2 weeks ----  Decouple tests from SUM  Add SUM group filtering to SAM3 10 Elena Tikhonenko April 11, 2014

Conclusions 11 Elena Tikhonenko April 11, 2014  3 issues solved!: 1) UNKNOWN heavier than CRITICAL when AND operation 2) services that belong to more than one site 3) calculation of Flavour Availability (ex. CREAM-CE * & SRMv2) now 2 reasons for difference:  downtime but SAM3 calculates A/R numbers correctly while downtime  other differences – only those caused by different metric validity