1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007.

Slides:

Advertisements

Similar presentations

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid.

Advertisements

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.

CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.

Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.

Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.

LCG workshop on Operational Issues CERN November, EGEE CIC activities (SA1) Accounting: current status

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.

CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.

ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.

Julia Andreeva on behalf of the MND section MND review.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia

Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.

JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.

RSV: OSG Grid Monitoring and User Customizable Views Rob Quick, Arvind Gopu, and Soichi Hayashi High Performance Distributed Computing Location: Munich,

INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.

RSV and Nagios in OSG Rob Quick. March 11, 2008 USCMS Tier-2 Workshop 2 Current State of OSG ~ 100 Sites ~ 30 VOs April 8th:  216,000 jobs (85% successful)

Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.

Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays.

Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.

Monitoring BOF, 23 rd Jan 2007 Grid Service Monitoring Working Group Monitoring WG BOF, January 2007 James Casey/Ian Neilson.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

NGI and Site Nagios Monitoring

POW MND section.

Introduction to OAT presentations

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Grid Service Monitoring Working Group

Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)

Presentation transcript:

1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007

2 WLCG Monitoring Working Groups  3 groups proposed by Ian Bird to the LCG Management Board, Oct 06. Goal to improve the reliability of the WLCG grid System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Repositories Views ……. System Analysis Application monitoring …… Grid Services Grid sensors Transport Repositories Views …….

3 Grid Services Monitoring WG Mandate “….to help improve the reliability of the grid infrastructure….” “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” WGMandate

4 Monitoring Grid Monitoring Control Presentation measurement instrumentation - active, passive, collection intervals, alarms appropriate metrics - directly relevant to user experience - clearly defined and understood manual decision making Sensors/Agents  Transport  Repositories automated decision making real-time  historical accuracy and credibility data collection points - system element  service Views  You can’t manage what you don’t measure... Slide by Max Böhm, EDS

5 WLCG Grid Monitoring Landscape local resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... GStat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use 3 WLCG Monitoring Working Groups Slide by Max Böhm, EDS

6 Aims of grid services WG  Not to provide yet another technical solution But,  Improve reliability of WLCG  Consolidate existing solutions Improve communication Reduce overlap Increase sharing

7 How?  Engage with stakeholders Operations meetings WLCG Workshops Questionnaires to site managers Grid Service providers (EGEE, OSG) Grid Middleware providers (gLite, VDT) Monitoring software providers (SAM, GridIce, MonAmi, GridView, LEMON, Nagios, …) External experts (openlab EDS collaboration) Other Working Groups

8 Tasks of grid services WG  Collect descriptions of current grid services So that probes can be written Input from developers, deployment team, site admins  Improve quality by providing technical guidance Documenting best practices Providing example components

9 Tasks of grid services WG  Best practice notes How to many grid proxies for monitoring Message-level Security for monitoring What information can/should be passed through site boundary …  Create set of ‘standard’ WLCG probes And how to calculate availability based on the metrics produced

10 Direction  Focus on the interaction points between the different systems Allow for diversity across different grid infrastructures  “Specifications, not Standards” Timescales mean we can’t get involved in long and heavyweight standards activities Take best practices from existing systems, and document them  Implement simple prototypes And mature the bits that work !  Get something out to the stakeholders Close feedback loop is the key to adoption Plan for a “standards based” solution in the future

11 High-level Model See WLCG_Monitoring_for_Managers.pdf for detailshttps://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702- WLCG_Monitoring_for_Managers.pdf

12 Clearing up some terminology  Metric A data value gathered that tells us something about a service  Probe The actual code which gathers the metric/metrics  Check & Sensor A ‘probe’ in Nagios and LEMON respectively

13 Locality of Probes  ‘local’ can mean two things ;(  ‘local’ and ‘remote’ with respect to probing the interface of the service local means on the site remote means external to the site  (host-)local probes Gathering information from the operating system level Traditional fabric management probes

14 3 Categories of Probe Locality  So we have host-local  is daemon running, free-space in size of /tmp local  Can I probe the service interface from inside the site remote  Can I probe the service interface from outside the site  All give useful views of the service And I believe all are required!

15 Service Descriptions  We needed to understand what to monitor Asked JRA1 developers to fill in a questionaire Then run past site admins to add their experience  This should form the basis of the grid fabric monitoring erviceMonitoringDescriptions

16 Node types gathered

17 Where should this go?  This information needs to curated in the long term EGEE JRA1, SA1 and SA3 involvement is crucial Along with providers of ‘externals’  Proposal: Simple web based structured repository  With database backend Can generate fabric monitoring configuration for a release directly from the information

18 Specifications  Probe Specification Defines how a fabric monitoring system can interact with probes that test grid services Simple text-based protocol (lightweight) Decouples grid probes from the specifics of the fabric monitoring system Allows for currently existing probes to be re-used by any monitoring system  SAM Tests  EGEE CE ROC Nagios testing  OSG Tests

19 $./LFC-probe -u lfc://lfc101.cern.ch/ -m ch.cern.LFC-Write -v dteam serviceType: glite-LFC gatheredAt: lxadm01.cern.ch metricStatus: OK timestamp: T15:01:39.86Z voName: dteam summaryData: OK serviceURI: lfc://lfc101.cern.ch/ metricName: ch.cern.LFC-Write EOT Example of Grid Probe

20 Grid Data Exchange Format  Query interface for repositories to provide stored information to clients  HTTP message based Query parameters encoded in URL Response is single XML message  Based on SAM Query format Either current status or history Structured data is returned E.g. all metrics gathered for a site, a VO, …

21 Example of exchange format Production Certified ok T00:00:00Z true false …

22 Site monitoring  We can’t/won’t impose a solution on sites They might/should have something already  Specification based approach allows our probes fit into any fabric monitoring system  Data Exchange format allows higher- level services consume the data regardless of fabric monitoring system

23 Component view

24 “The Nagios-based prototype”  How to make this real?  Initally implement using one fabric monitoring system - Nagios … but architecture checked with LEMON developers  Implement some of the site components Configuration Generation Certificate handling Service Status Calculator  Staged approach for integrating sensors

25 Prototype site implementation

26 Configuration generation  Nagios Specific Configuration Generation Script Generates configuration files from SAM, BDII, live service checks Generate checks for either ‘remote’ (SAM), or ‘local (or ‘both’)  Verbose mode that dumps the view of your site Next version will allow you to add/remove services  Produces a single Nagios.cfg file which can be integrated into an existing Nagios configuration

27 Stage I  Feedback SAM info into site monitoring  Single Nagios sensor – ‘gather_sam’ Connects to SAM Web Service and gets SAM results for all nodes at the site

28 Stage II  Run probes locally as well Allows for local verification of site availability  Another Nagios sensor – ‘check_wlcg’ Heavily dependant on configuration What probes are needed for what service?  Start with a probe set consisting of: Sample LFC probe EGEE CE ROC probes (Emir Imamagic)

29 Stage III  Run local fabric tests gLite developers have described how to monitor logfiles, daemons etc. on the service nodes  Integrate this information into the fabric monitoring Where possible use existing sensors

30 Current status  Prototype tested against CERN PPS egee.srce.hr site  Installation and configuration instructions exist  Packaging done  Stage I, II are ready or testing

31 Nagios display

32 Prototype delivery timescale  Stage I – ‘gather_sam’ Operations Workshop, mid-June 2007  Stage II – ‘check_wlcg’ End mid-July 2007  Stage III – Local probes CHEP, September 2007  Expect to rapidly iterate, so perhaps only a few “early adopter” sites in June/July Asking for Volunteers

33 RPM Packages  grid-monitoring-fm-nagios General nagios tools (including certificate handling for running local probes)  grid-monitoring-config-gen-nagios Configuration generator  grid-monitoring-probes-ch.cern Example probe for LFC  grid-monitoring-probes-hr.srce Full probe set for many services from EGEE CE Region

34 Who should try this?  Site admins who already use nagios and want to integrate SAM Simply use ‘remote’ generation  Site admins who have no monitoring yet and are thinking of trying Nagios Use both ‘local’ and ‘remote’ generation Not for the faint hearted - you’ll be a very early adoptor!  RPMs will be available linked from twiki  Mailing list set up

35 Futures and other work  We focus here on the prototype Since this is what we are delivering now  Also working on Specifications and example components Security architecture  Future work includes Probe description database Topology database Messaging architecture for transport layer  Closely involved with SAM team Looking at how to use Nagios as a submission framework for SAM

36 Summary  Effort invested to understand the current landscape  Approach for improvement based on specifications of interfaces between components  Prototype has been developed and tested on a small scale  Now looking for early adopters to get feedback Who wants to volunteer?

37 Links to tools  SAM/GridView Monitoring Portal: TWiki:  SAM (Service Availability Monitor) Test Page: TWiki:  GridICE Monitoring Portal: Documentation:  Experiment Dashboard Portal: TWiki:  GridPP Real Time Monitor Homepage: (2D map and 3D globe visualizations)  GStat Portal: TWiki:  Lemon Portal (CERN Compute Center): Documentation:  Nagios Homepage: