SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.

Slides:



Advertisements
Similar presentations
Mardi 30 mars 2010 Lavoisier : a way to integrate heteregeneous monitoring systems. Cyril LOrphelin IN2P3/CNRS Computing Centre, Lyon, France.
Advertisements

Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, February 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
SEE-GRID-SCI Regional Grid Infrastructure: Resource for e-Science Regional eInfrastructure development and results IT’10, Zabljak,
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The network monitoring in grid context Operations.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Performance Improvements to BDII - Grid Information.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Feedback on SAM from SA1 site representatives.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
SEE-GRID-2 The SEE-GRID-2 initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Update of SAM Implementation ALICE TF Meeting 18/10/07.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Mardi 8 mars 2016 Status of new features in CIC Portal Latest Release of 22/08/07 Osman Aidel, Hélène Cordier, Cyril L’Orphelin, Gilles Mathieu IN2P3/CNRS.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
Computation of Service Availability Metrics in Gridview Digamber Sonvane, Rajesh Kalmady, Phool Chand, Kislay Bhatt, Kumar Vaibhav Computer Division, BARC,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
GOCDB Handover + Status Update Quite heavy GGUS ticketing traffic; responding to user issues has been quite timely, especially in first few weeks (expected.
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GOCDB4 Gilles Mathieu, RAL-STFC, UK An introduction.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
GGUS New features and roadmap
Service Availability Monitoring
Daniele Bonacorsi Andrea Sciabà
Jean-Philippe Baud, IT-GD, CERN November 2007
NGI and Site Nagios Monitoring
Pedro Andrade ACE Status Update Pedro Andrade
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Lavoisier : a way to integrate heteregeneous monitoring systems.
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Short update on the latest gLite status
SAM Alarm Triggering and Masking
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007

SAM DB and relations with GridView, SAM Review, CERN, 21 May Architecture GridView SAM

SAM DB and relations with GridView, SAM Review, CERN, 21 May Database layout BDII GOC DB Sensors summarisation Portal/Programmatic interface GridView portal

SAM DB and relations with GridView, SAM Review, CERN, 21 May DB Business logic (SAM schema) Processing and merging of GOC DB data Service discovery (BDII2Oracle) Alarm triggering and masking On-line status calculation (views) Availability metrics calculation Moved to GridView

SAM DB and relations with GridView, SAM Review, CERN, 21 May GOCDB data processing PL/SQL code running inside Oracle (Oracle job, 1/h) Analyses GOC DB data structures (as replicated by GridView) Builds new normalised representation: Nodes, ServiceInstance, ServiceVO, etc. Uses a bit heuristic approach trying to map GOCDB nodetype to Service ambiguity in Node definition (multiple entries) Administrative topology (Sites, Countries, Regions) taken as provided (views)

SAM DB and relations with GridView, SAM Review, CERN, 21 May Service discovery BDII2Oracle - external program run once/hour Discovers Sites, Nodes, ServiceInstances, supported VOs Additional services read from a static file updated through HTTP (HEP VOs requirement) Updates the DB merging new information with reprocessed GOCDB content (set union) In case of conflicts GOCDB taken with higher priority Ageing of nodes and service instances based on “last seen” timestamp (also from GOCDB)

SAM DB and relations with GridView, SAM Review, CERN, 21 May Service discovery (cont.) When service becomes monitored? site appears in GOCDB or in BDII (or in both) node appears in GOCDB or in BDII (or in both) monitoring flag for the site in GOCDB is ON or site not registered in GOCDB at all monitoring flag for the node in GOCDB is ON or node not registered in GOCDB at all the same logic for scheduled downtime How to disable monitoring of a service? both site and node have to be registered in GOCDB monitoring flag switched off either on the site or on the node level the same logic for scheduled downtime (But! scheduled downtime doesn’t trigger monitoring off)

SAM DB and relations with GridView, SAM Review, CERN, 21 May Alarm triggering Procedure to trigger an alarm: The test result is ERROR or CRIT, The node belongs to a certified site, VO is 'OPS‘, The test is critical for OPS VO, No alarm already for that test, vo and node, The node is not in maintenance.

SAM DB and relations with GridView, SAM Review, CERN, 21 May 2007 Alarms Info Data stored of each alarm: alarmid vo test node test exec time alarm status (new, assigned, masked, off) update time ticket id (GGUS)

SAM DB and relations with GridView, SAM Review, CERN, 21 May 2007 Alarms Masking Automatic Alarms Masking: Simple rule based correlation engine If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. Rules defining test relationships among alarms:

SAM DB and relations with GridView, SAM Review, CERN, 21 May 2007 Prioritisation of alarms Depending on the Service: points: VOBOX, BDII, VOMS, LFC, WMS, RB points: SRM, MyProxy, FTS points: RGMA, sBDII points: gCE, CE, SE. Depending on n° of alarms getting masked: points per alarm masked by the new alarm, But up to a maximum of points.

SAM DB and relations with GridView, SAM Review, CERN, 21 May 2007 Prioritisation of alarms (cont.) Depending on the test status: 100 points if ‘INFO’ 200 points if ‘NOTE’ 300 points if ‘WARN’ 400 points if ‘ERROR’ 500 points if ‘CRIT’ Depending on n° of CPUs in the site: Value taken from the 'CE-totalcpu' test divided by 100. This gives a [0-50] number.

SAM DB and relations with GridView, SAM Review, CERN, 21 May Open issues Irresolvable conflicts in GOCDB (2.0) migrate to GOCDB3 - short term (1 month) redesign SAM/GV data schema - long term (6 months) BDII2Oracle needs refactoring (modules) and improvements in data processing: basic modularisation (Input/Output) - short term (3 months) new processing model (synchronised with redesign of data schema) - long term (>6 months) Differentiation of test criticality level needed Alarm level - for example WARN for host certificate Unavailability level timeline: ~3 months