Download presentation
Presentation is loading. Please wait.
Published byShon Victor Fields Modified over 8 years ago
1
www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Update on Service Availability Monitoring (SAM) Marian Babik, David Collados, Wojciech Lapka, Pedro Andrade, Paloma Fuente, Jacobo Tarragon (CERN) Emir Imamagic (SRCE) Christos Triantafyllidis (AUTH)
2
www.egi.eu EGI-InSPIRE RI-261323 Overview SAM overview/ SAM Architecture Description and recent changes for all components Operations and support Messaging
3
www.egi.eu EGI-InSPIRE RI-261323 SAM Scope SAM grid monitoring (SAM-Gridmon) –Central services (Web, API, availability) SAM-Nagios –Monitoring platform supporting multiple configurations: NGI-Nagios VO-Nagios 1 Site-Nagios Operations Tools-Nagios (ops-monitor) 1 initial guide by Gonçalo Borges (NGI_IBERGRID)
4
www.egi.eu EGI-InSPIRE RI-261323 SAM Architecture
5
www.egi.eu EGI-InSPIRE RI-261323 Topology aggregation (ATP) Now primary source of all external information –Synchronization of GOCDB service types –Support for operational tools –Provides contacts and user details (secured) Improved sanity checking and logging
6
www.egi.eu EGI-InSPIRE RI-261323 Profile Management Defines and groups metrics into profiles Distributed system –based on the concept of namespaces –central (ch.cern.sam-ROC*, GLEXEC) –local (hr.srce.sam-MY_ROC) –synchronization from multiple namespaces Based on Django framework
7
www.egi.eu EGI-InSPIRE RI-261323 Profile Managment
8
www.egi.eu EGI-InSPIRE RI-261323 Profile Management determines how Nagios is configured enables support for adding custom probes enables support for multi-VO instances enables support for FQANs in profiles streamlines internal data flows and establishes well defined APIs
9
www.egi.eu EGI-InSPIRE RI-261323 Profile Management Authentication –requires valid certificate or login/password Authorization –group-based for adding, removing, changing profiles –DN based for changing particular profile –Superuser role
10
www.egi.eu EGI-InSPIRE RI-261323 Editing profiles
11
www.egi.eu EGI-InSPIRE RI-261323 POEM migration of NGI-Nagios database migrated automatically no additional settings necessary synchronizes: –ch.cern.sam-ROC, ROC_CRITICAL, ROC_OPERATORS and GLEXEC local namespace can be enabled –POEM_WEB_ENABLE=True
12
www.egi.eu EGI-InSPIRE RI-261323 POEM migration of VO-Nagios database migrated automatically no central profiles local namespace needs to be configured –POEM_WEB_ENABLE=True –POEM_NAMESPACE=“ ” –POEM_SYNC_NS_URLS=“localhost” local namespace needs to list all metrics for which Nagios is to be configured
13
www.egi.eu EGI-InSPIRE RI-261323 Nagios configuration New bootstrapping via NCG POEM module: –bootstraps services from ATP and metrics from POEM New synchronization (sam-sync service) –reloads all SAM services (NCG, MRS) New metric configuration –replaces Hash.pm (Hash_local.pm) –JSON /etc/ncg-metric-config.conf (/etc/ncg- metric-config.d/*.conf)
14
www.egi.eu EGI-InSPIRE RI-261323 Adding custom probes ensure probe package is already deployed metric configuration is available –/etc/ncg-metric-config.d/*.conf just adding metric to a profile for critical profiles changes need to follow EGI PROC10
15
www.egi.eu EGI-InSPIRE RI-261323 New Probes Introduction of DesktopGrid probes Introduction of the eu.egi.sec probes ARC probes integrated in ROC profile Improved UNICORE support Improved uncertified sites setup for CREAMCE metrics
16
www.egi.eu EGI-InSPIRE RI-261323 Site-Nagios will require changes to continue working after SAM-Update 17 will become officially supported as new nodetype basic functionality will be to import metric results via messaging final scope to be decided, your feedback is welcome
17
www.egi.eu EGI-InSPIRE RI-261323 MyEGI improvements improved layout and filters
18
www.egi.eu EGI-InSPIRE RI-261323 History improvements
19
www.egi.eu EGI-InSPIRE RI-261323 Availability improvements
20
www.egi.eu EGI-InSPIRE RI-261323 Data transfers
21
www.egi.eu EGI-InSPIRE RI-261323 WEB API statistics ~ 13k visits a month (705k hits) ~ 25k hits/day top hosts: –nagios-goegrid.gwdg.de (110k hits) –wwwcache4.rl.ac.uk (101k hits) –wwwcache5.rl.ac.uk (63k hits) –nagiosv01.i3m.upv.es (49k hits) 500 statuses (760 hits, 0.5%)
22
www.egi.eu EGI-InSPIRE RI-261323 Operations and Support grid-monitoring, grid-monitoring-preprod central services upgraded to Oracle 11g decommissioning of the old SAM –March decommissioning of Gridview –pending SAM-Update 17 248 GGUS tickets in past 12 months
23
www.egi.eu EGI-InSPIRE RI-261323 Milestones and releases 3 releases (524 tickets) since Sept. Gridmap myEGI, VO feeds implementation –SAM Update 15 Profile management system –SAM Update 16-17 (366 tickets) Support for monitoring of the Operational Tools
24
www.egi.eu EGI-InSPIRE RI-261323 Messaging: PROD network A setup of 4 brokers –Administratively and geographically separated AUTH CERN SRCE –Fully connected mesh
25
www.egi.eu EGI-InSPIRE RI-261323 Messaging: Recent updates November 2011 –Major update –ActiveMQ 5.3 to ActiveMQ 5.5 –All brokers at once (service outage) –Authenticated connections with wrong credentials dropped Side effect (not advertised), APEL results to SAM were affected January 2012 –Minor update –ActiveMQ 5.5.1-fuse-01-06 to ActiveMQ 5.5.1-fuse-01-20 –Two brokers each time (service was available during update) –Camel support dropped (last single point of failure) –Idle connections dropped after 1 hour
26
www.egi.eu EGI-InSPIRE RI-261323 Messaging: What’s next A new update from FUSE Best practice code repository –Broker discovery –Sending messages –Receiving messages First round of authentication campaign –SAM boxes –Operation portal
27
www.egi.eu EGI-InSPIRE RI-261323 Summary SAM/Nagios and SAM/Gridmon working stably Substantial improvements in MyEGI, profile management, NCG configuration Continuous support and bugfixing Future plans (MS708, EGI milestones)
28
www.egi.eu EGI-InSPIRE RI-261323 SAM sanity checking sanity checking provided by internal probes: –atp-sync – reports topological errors –poem-sync – reports profile errors –ncg-sync – reports Nagios configuration failures –sam-sync – reports SAM reloading errors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.