Monitoring BOF, 23 rd Jan 2007 Grid Service Monitoring Working Group Monitoring WG BOF, January 2007 James Casey/Ian Neilson.

Slides:



Advertisements
Similar presentations
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Advertisements

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Julia Andreeva on behalf of the MND section MND review.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
23 January 2007WLCG workshop, CERN System Management Working Group Alessandra Forti WLCG workshop CERN, 23 January 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Javier Orellana JRA4 Coordinator Face to Face Partners Meeting University College London 11 December 2003 EGEE is proposed as a project funded by the European.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Co-ordination & Harmonisation of Advanced e-Infrastructures Research Infrastructures – Grant Agreement n CHAIN sustainability guidelines Dr. Ognjen.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operational Tools M2 Update James Casey.
Baseline Services Group Status of File Transfer Service discussions Storage Management Workshop 6 th April 2005 Ian Bird IT/GD.
Project Execution Methodology
Bob Jones EGEE Technical Director
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
MANAGEMENT OF STATISTICAL PRODUCTION PROCESS METADATA IN ISIS
Key Activities. MND sections
Ian Bird GDB Meeting CERN 9 September 2003
POW MND section.
Short term improvements to the Information System: a status report
Systems Architecture WG: Charter and Work Plan
LCG/EGEE Incident Response Planning
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Regional Grid Monitoring - timeline
Experiment Dashboard overviw of the applications
Report on SLA progress Ioannis Liabotis <ilaboti at grnet.gr>
LHCOPN Operations: Yearly review
Grid Service Monitoring Working Group
Cyril L’Orphelin (CC-IN2P3) COD-19, Bologna, March 30th 2009
OmniRAN Introduction and Way Forward
Maite Barroso, SA1 activity leader CERN 27th January 2009
EDT-WP4 monitoring group status report
LCG Operations Centres
LCG Operations Workshop, e-IRG Workshop
Monitoring of the infrastructure from the VO perspective
Examining a Windows NT Infrastructure (2)
Leigh Grundhoefer Indiana University
OmniRAN Introduction and Way Forward
Partner Implications from Seminar: New Approaches to Capacity Development Break-out Group: Role of Assessment frameworks.
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

Monitoring BOF, 23 rd Jan 2007 Grid Service Monitoring Working Group Monitoring WG BOF, January 2007 James Casey/Ian Neilson

Monitoring BOF, 23 rd Jan 2007 WLCG Monitoring Working Groups 3 groups proposed by Ian Bird LCG-MB, Oct 06.LCG-MB, Oct 06 –Goal to improve the reliability of the grid System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Repositories Views ……. System Analysis Application monitoring ……

Monitoring BOF, 23 rd Jan 2007 Grid Services Monitoring WG Mandate –“….to help improve the reliability of the grid infrastructure….” –“…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …”

Monitoring BOF, 23 rd Jan 2007 Grid Services Monitoring WG Mandate –to develop more monitoring tools unless a specific need is identified –to replace existing fabric management systems

Monitoring BOF, 23 rd Jan 2007 Current State

Monitoring BOF, 23 rd Jan 2007 Monitoring Data Flow

Monitoring BOF, 23 rd Jan 2007 Site Metrics Publication

Monitoring BOF, 23 rd Jan 2007 Immediate Tasks “What do you have and what is needed?” –questionnaire to site administrators (Dec 06) Per-service sensor definition –Plain english –Sensor ‘architecture’ Characterise monitoring data traffic –→ transport requirements Repository schema –Understand relationship between multiple DB’s –Include security requirements Describe stakeholder “views” –Site, Service, VO, Management

Monitoring BOF, 23 rd Jan 2007 WG Structure 2 coordinators “core” team of ~10 across domains 4 domain sub-groups –Sensors –Transport –Repository –Views

Monitoring BOF, 23 rd Jan 2007 Timeline Now (Dec 06) –Background research –Establish core group Feb 07 –Establish sub-groups –Agree interfaces and workplan April/May 07 –Prototype instrumented services to local FM –Remote metrics to local FM end-Summer 07 –Demonstrated improvement in reliability of grid

Monitoring BOF, 23 rd Jan 2007 Grid Services Monitoring WG Site Survey Results to 17 Jan 2007

Monitoring BOF, 23 rd Jan 2007 Questionnaire 1) What local fabric monitoring system do you use?: a) GridICE/Lemon b) Nagios c) Other (please specify) d) None. 2) Which Grid level sensors do you use?: a) which services are monitored b) what values/metrics are measured 3) Who provided the sensors? 4) Is your fabric monitoring part of any regional/off-site monitoring framework? a) who are you linked with b) generally, how is this implemented 5) When you learn that something is wrong with the services at your site, what is the most frequent way you are informed? a) looking in the local fabric or Grid monitoring system b) getting a trouble ticket c) getting a mail/telephone call from VOs/users d) other (please specify).. 6) Briefly describe what you see as your top 3 monitoring priorities to help improve your service reliability/availability

Monitoring BOF, 23 rd Jan 2007 Summary of Returns 1 34 responses analysed up to 17 Jan 2007 –Not so easy to summarise sometimes so numbers don’t always add up! Local monitoring frameworks in use –Sites using multiple frameworks a) Nagios: 22 b) GridICE/Lemon: 10 c) Other: =majority as (a or b) + Ganglia: 13 d) None : 3 Grid Services Monitored –12 sites monitoring some Grid services Most commonly CE+SE Non-Grid default Nagios sensors in use –Sensors provided by AP, CE, IT ROCS

Monitoring BOF, 23 rd Jan 2007 Summary of Returns 2 How problems get reported –Most common from local monitoring : 21 –Support Ticket : 10 –Looking at SAM/GSTAT : 4 –Direct from User/VO : 3 Sites reported being in regional infrastructures : 10 –Not clear from the reports how these are implemented. –Regions (= as for sensors provided) AP, CE, IT ROCS

Monitoring BOF, 23 rd Jan 2007 Priorities –Quite difficult to summarise but keywords are…. single view - common interface - global view unified tools - repository more/deeper diagnostics more flexible – alarm levels improved/reliable/redundant SAM hardware/network monitoring –Also non-monitoring replies Working/debugged middleware Reliable hardware Experience/knowledge transfer