1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.

Slides:

Advertisements

Similar presentations

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.

Advertisements

Apache Axis: A Set of Java Tools for SOAP Web Services.

CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.

RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.

1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

1 1 Service Composition for LHC Computing Grid Monitoring Beob Kyun Kim e-Science Division, KISTI

CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.

Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.

LCG workshop on Operational Issues CERN November, EGEE CIC activities (SA1) Accounting: current status

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.

Visualization Ideas for Management Dashboards

CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.

ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.

Julia Andreeva on behalf of the MND section MND review.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.

PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

CERN IT Department CH-1211 Geneva 23 Switzerland t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey)

Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.

Daniele Bonacorsi Andrea Sciabà

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

POW MND section.

Introduction to OAT presentations

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Security Monitoring in a Nagios world

Grid Service Monitoring Working Group

Maite Barroso, SA1 activity leader CERN 27th January 2009

Author: Laurence Field (CERN)

Presentation transcript:

1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007

2 What’s our architecture? Conway's Law: Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure. Fred Brooks – “The Mythical Man-Month” Because the design that occurs first is almost never the best possible, the prevailing system concept may need to change. Therefore, flexibility of organization is important to effective design. What does this mean for us?

3 Who is involved?  Four main stakeholders Site Administrators Grid Operators  “CIC on Duty”  Regional Operation Centre (ROC) WLCG Project Management Virtual Organisations  WLCG Experiments  5 th ‘Stakeholder’ Monitoring developers + operators Focus of this talk…

4 High-level Model See WLCG_Monitoring_for_Managers.pdf for detailshttps://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702- WLCG_Monitoring_for_Managers.pdf LEMON Nagios SAM R-GMA SAME GridView Experiment Dashboard GridIce HTTP LDAP GOCDB Dashboard GridView

5 Architectural Principles  Initially we tried to express some architectural principles Mostly coming from stakeholder requirements Focused less on technology choices  This was used to design the site-local monitoring prototype  This is an attempt to extend them more globally And focus on some technologies

6 Improve reliability by reducing time to respond  “Site administrators are closest to the problems, and need to know about them first”  Our focus has been on site monitoring And it seems to work But hasn’t been deployed widely by sites  Implications Improved understanding of how to monitor services “Service Cards” being developed by EGEE SA3 Need to deploy components to sites  Sometimes an entire monitoring system  Needs active participation of site admins

7 Tell others what you know  “If you’re monitoring a site remotely, it’s only polite to give the data to the site” Chris Brew.  Remote systems should feed back information to sites  Implications Common publication mechanisms Integration into fabric monitoring  Discovery of data ?  Site trust of data – Is it a “backdoor” communications mechanism?

8 Don’t impose systems on sites  We can’t dictate a monitoring system Many (big?) sites already have a deployed system We have to be pluggable into them  Implications Modular approach Specifications to define interfaces between existing systems and new components

9 Common names  Infrastructures has identifiers for things GOCDB name is really an ID, not a name – used as a primary key for linkage  Implication ?(GOCDB ?) ID used for mapping between names A community should only see their name  And never the ID

10 Flexible topology  Communities also have structure Distributed tier-2s, split between CAF and production, disk and tape, …  We need to be able to map sites, services and metrics into the community structure  Implications Common ontology/schemas Flexible and distributed  Integrate many data sources

11 Authority for data…  Currently repositories have direct DB connections to (all) other repositories E.g. SAM, Gridview, Gstat, GOCDB, CIC  And they cache and merge and process the data  Implications  We have a “Interlinked distributed schema” And tools should take responsibility for contents of parts of it

12 Responsibility for data  Certain system clean up data FTM for log files Gstat for BDII  And should re-publish the “validated” version Everyone shouldn’t do the processing (differently)  E.g. gridmap/SAM/gstat for CPU counts  Implications Tools are simpler if they use cleaned up data  At the cost of latency

13 No monolithic systems  Different systems should specialize in their areas of expertise And not have to also invent all the common infrastructure  Implications Less overlap and duplication of work  Someone needs to manage some common infrastructure  We need to agree on the common infrastructure

14 Don’t have central bottlenecks  “Local problems detected locally shouldn’t require remote services to work out what the problem is”  This is a big change Lots of central processing done now in SAM/Gridview  Implications Do as much processing locally (or regionally) Helps scaling  Harder to deploy

15 Necessary (and complete) security  Security has an overhead Don’t kill your system for “needless” security e.g. R-GMA  Encryption of specific elements E.g User DN  Signing for provenance  Implication Security is available when needed depending on application use-case It’s not used when not needed

16 Make it less brittle  Remove SPOFs Push reliability into the fabric Less point-to-point connections over the WAN  Asynchronous communication works for monitoring Latency is ok (to a point)  Implications Reliable transport fabric using messaging systems Less central (unique) repositories improves availability

17 Re-use, don’t re-invent  What do we do? Collect some information, Move it around Store it, View it, Report on it  This is pretty common We should look at existing systems  Already happening for site fabric… Nagios, LEMON, …  Implication Less code to develop and maintain  Integration nightmare?

18 Visualization for each community  Visualization is very specific to stakeholder “user-targeted” visualization - Gilles But all should use the same underlying data  Extract information processing out of visualization tools  Provide same processed info to all visualizations Interface with community specific information, e.g. names  Implications  Many “similar” dashboards Everyone sees the same data ?Common frameworks/widgets needed here ?

19 Locality issues  Our operations model is changing From central to regional/national/local  Architecture should reflect this Probably good, since scaling should be easier  Still need to pass information upstream For reporting, central debugging, …  Implications  Guidelines needed to help developers to decide what to pass back  Possibility of more info going across the site boundary than the site admin wants

20 A (technical) architecture?  Scalable underlying fabric Messaging Systems, HTTP REST  Use (simple) standards Web technology  Publish metadata Common grid topology information And community specific metadata  Republish “raw” bulk data Test results, job logs, …  Visualization “toolkits”  Reporting ActiveMQ HTTP, SSL RDF “standard” XML (?) Dashboard, Yahoo UI, … JasperReports

21 Summary  Have some principles to guide us And some solutions/techniques to implement them  Need to move slowly to not break existing production systems But improve data quality and coverage  Remember that the goal is to improve reliability of the monitored system And every decision should reflect that