Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays James Casey, CERN IT-GD HEPIX, November 2007

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 2 Overview Service Monitoring in WLCG Site Service Monitoring –Nagios Central Monitoring –GridMap Future work

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 3 Grid Services Grid sensors Transport Metric Repositories Views ……. WLCG Monitoring Working Groups 3 groups created by Ian Bird, Oct’06 –“….to help improve the reliability of the grid infrastructure….” –“…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” –“… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. System Analysis Application monitoring ……

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 4 Monitoring Grid Monitoring Control Presentation measurement instrumentation - active, passive, collection intervals, alarms appropriate metrics - directly relevant to user experience - clearly defined and understood manual decision making Sensors/Agents  Transport  Repositories automated decision making real-time  historical accuracy and credibility data collection points - system element  service Views You can’t manage what you don’t measure... Slide by Max Böhm, EDS

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 5 WLCG Grid Monitoring Landscape local resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... GStat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use Slide by Max Böhm, EDS

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 6 Grid Monitoring Landscape View BDII CESE RB Info System Experiment/VO ATLAS GOCDB site registry Central Services GStat GridICE SAM Grid View html site status + graphs Exp. Dashb. LFC CPUsTBs batch Site Services Grid Services Fabric Resources App Layer Experiment/VO... Ganga/ Panda Apps RTM HTTP/XML pull LB real time 3D job view job state AtlasProdDB VO jobs, data, site reliability data transfer, job status, service availability GOCDB, BDII GOCDB, extBDII DB access HTTP/SOAP push LDAP Experiment/VO... HTTP/XML BDII + fabric/job infos sites LEMON one per experiment File Catalog Resource Broker HTTP/XML push agents RGMA RGMA, RGMA, MonALISA MonALISA HTTP/XML pull submit test jobs fabric infos (other monitoring tools) RGMA FTS results

7 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 7 High-level Model See https://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf for detailshttps://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf LEMON Nagios SAM R-GMA SAME GridView Experiment Dashboard GridIce HTTP LDAP GOCDB Dashboard GridView GridMap

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 8 Grid Site Monitoring Principles Provide an easily extensible site monitoring system –Or be able to plug grid features into existing site monitoring Should be able to provide (or augment) alarms at the site for the grid services Don’t force a solution on the site administrators –Should work with any fabric monitoring system that provides basic functionality Provide the specific plugins to deal with the Grid –Probes that work for Grid Services Enable export of the data from the site into standard grid monitoring systems e.g. SAM, GridView, GridICE,… –Avoid duplicate running of probes

9 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 9 Purpose Bring in data from existing monitoring systems inside the site monitoring tools –Service Availability Monitoring (SAM) –Network performance monitoring (NPM) –Experiment site blacklists (FCR tool) –Experiment dashboards, … Decided to create a prototype based on Nagios –Due to existing take-up of Nagios in the community Second stage will be integrate with LEMON –As next most common solution –Based on questionnaire to community

10 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 10 Nagios Open source monitoring system Widely used & actively developed Host and service problems detection and recovery Provides set of basic plugins (sensors) –easy to develop custom sensors No components required on monitored entities

11 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 11 Architecture … Site nodes Site BDII CESELFC MyProxy Refresh proxy Get VOMS proxy Service checks Get remote results Probe descriptions … Get site’s & nodes information Get nodes information Live node checks Get Nagios results Site admins Get site status Issue alarms Monitoring server

12 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 12 Grid Extensions Standard probes –provided by SRCE, CERN, OSG –Security facilities & services  CA distribution, Certificate lifetime, MyProxy –Monitoring & information services  R-GMA, BDII, MDS, GridICE –Job management services  Globus Gatekeeper, RB, WMS, WMProxy, Job matching –Data management services  GridFTP, SRM, DPNS, LFC, FTS Remote gatherers –SAM & NPM Nagios Config Generator (NCG), Publisher, Credential management

13 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 13 Standard Components Probe wrapper –enables integration of standardized probes  One probe can run in Nagios, LEMON, SAM, … –Grid Monitoring Probes Specification –https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpec ificationhttps://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpec ification Publisher & remote gatherers –integration with other tools  Existing tools can just consume the data. E.g SAM, GridView, Dashboards… –Grid Monitoring Data Exchange Standard –https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExcha ngeStandardhttps://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExcha ngeStandard Comments, contributions & probes welcome!

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 14

15 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 15

16 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 16

17 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 17 SAM Standard probes NPM

18 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 18

19 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 19 Current Status Three sets of standard probes integrated –SRCE, CERN, OSG RPMs in apt and yum repository –http://www.sysadmin.hep.ac.ukhttp://www.sysadmin.hep.ac.uk Installation documentation on twiki –https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNagiosInstallhttps://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNagiosInstall Mailing list for community support of sites –wlcg-monitoring-discuss@cern.chwlcg-monitoring-discuss@cern.ch Will appear in upcoming gLite releases as packaged software Will be bundled with “follow-up” documentation to help site admins understand what went wrong on probe failure New (early-access) volunteers welcome!

20 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 20 New visualizations for the Grid ? Grid monitoring data is complex! –And there are many sites… Current tools visualize data by sorted tables, bar charts, etc. Difficult to present an easy to understand top-level view which provides –quick, action oriented oversight and insight –help understand job failures and availability patterns Can new visualizations help?

21 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 21 GridMap Visualization Idea –visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap site regions Size of rectangle is e.g. - size of site (#CPUs) - #running jobs -...

22 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 22 GridMap Visualization Idea –visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service -... okdegradeddown

23 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 23 Multiple Views GridMaps can be used for top-level, geographical and VO views VO Views cross-location Top-level View Geographical Views Federation, Partner, Site, etc. Next level of GridMaps Large-scale Federated Grid Services Infrastructure Global GridMap Application Domain GridMap Local GridMap Alert Corrective action effect

24 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 24 Trends Trends can be understood by looking at a sequence of GridMaps 25 Sep 200724 Sep 200723 Sep 2007 Site Availability over time: 22 Sep 200721 Sep 200720 Sep 2007

25 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 25 More Views Correlations of metrics can be discovered by switching between different views LHCbCMSAtlasAliceOPS Site Availability from different VO perspectives: site BDIISRMSECEOverall Site Status of different Site Services: sites without colour do not support the VO

26 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 26 GridMap Prototype Architecture Grid sites existing monitoring system(s) GridMap Server Web Browser Title view1 view2 view3 GridMap ViewGridMap Server - Browser based Web 2.0 type client component - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight JavaScript libraries, AJAX type communication pattern - provides client side code and client supporting services - implements GridMap Layout Algorithm - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python

27 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 27 GridMap Prototype View Component Metric selection for colour of rectangles Show SAM status Show GridView availability data Grid topology view (grouping) Metric selection for size of rectangles VO selection Overall Site or Site Service selection Link: http://gridmap.cern.chhttp://gridmap.cern.ch Drilldown into region by clicking on the title Context sensitive information Colour Key Description of current view

28 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 28 GridMap Prototype: Link to Existing Tools Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results

29 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 29 Conclusions To improve reliability we need to: 1.Provide more information to site administrators –That relate to what users actually see when using their site  A lot of data already gathered, so if possible don’t do it again –Need to get it into the fabric monitoring system already used at a site –Nagios-based prototype validating the approach  Good feedback form early adoptors 2.Improve the visualization –Too much data - especially for central monitoring (~250 sites) –New techniques help to compress information and bring useful information into view http://gridmap.cern.ch http://nagios-test.cern.ch/nagioshttp://nagios-test.cern.ch/nagios (guest:guest)


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays."

Similar presentations


Ads by Google