CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.

Slides:



Advertisements
Similar presentations
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
Impact of end of EMI+EGI-SA3 April 2013: EMI project finishes EGI-Inspire-SA3 finishes (mainly CERN affected) EGI-Inspire continues until April 2014 EGI.eu.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG Status update Daniel Rodrigues.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey WLCG Monitoring – some worked examples.
Visualization Ideas for Management Dashboards
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Infrastructure EMI Kickoff Meeting.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
CERN IT Department CH-1211 Geneva 23 Switzerland t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey)
CH-1211 Genève 23 Job efficiencies at CERN Review of job efficiencies at CERN status report James Casey, Daniel Rodrigues, Ulrich Schwickerath.
CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Key Activities. MND sections
POW MND section.
Grid Service Monitoring Working Group
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Problem Statement We would like to be able to gather job state transitions from all jobs submitted in WLCG resources –EGEE RB + WMS submitted jobs Jobs submitted directly to a CE (via condor_g) –OSG Jobs on a OSG CE for WLCG VOs –Nordugrid ARC CE Use this to calculate resource reliability –And use for debugging e.g. Dashboards 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Principles Only gather this information once –Propagate to interested parties Using existing systems and expertise where possible –Don’t try and deploy components on every WMS/RB/L&B/CE/… –Get ‘cooked’ data from the systems Hook up with Pilot Jobs –Linkage between pilot and experiment jobs as a ‘state change’ ‘Job Wrapper tests’ fit in here too –They’re just another state change 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Current situation - Gridview Currently mines L&B log files, and sends them via R-GMA Loses many records Hacks to ‘finish’ unfinished jobs after 24 hours –Inaccurate results 4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Current situation - Dashboard Jobs reported via experiment frameworks –Gathers from many sources – Imperial College XML files, job submission tools, monAlisa reporting from jobs, R-GMA But some missing information for condor_g jobs –info between submission and user job starting on WN –Job aborted Some work done (Sergey Belov/Dubna) on reporting state changes inside condor_g Presentation title - 5

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Proposal Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data –Messaging system –Standard message formats Work with expert groups to instrument the job submission systems Visualization by Gridview + Dashboards 6

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Effort Provide some effort to do the instrumentation –Coordination – WLCG Monitoring (D Rodrigues) –Messaging system integration - Gridview –EGEE WMS – L&B, Gridview –Condor-G – Condor team, Dashboard team –OSG CE – through OSG participation in the Joint Monitoring Group (OSG Operations (Rob Quick), measurements & metrics (Brian Bockelman) Presentation title - 7

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services EGEE L&B Notifications means we don’t have to run components mining L&B logfiles –Consumer of notifications can be remote L&B is stated to scale for our needs –Tested at >1m records/day –Testing of integrating with notifications underway by gridview team Message formats already defined –Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 8

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services condor_g condor_g submitter instrumented to create L&B messages –Done by a separate listener process that is started by condor_g –Limited subset of condor_g state changes will be sent Listener/reporter can use different transport for reporting –Currently monalisa as a transport layer –Will migrate to WLCG messaging system 9

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services EGEE Architecture 10

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services OSG Gratia is used to transport messages inside OSG A Gratia-MSG Bridge could be implemented –Similar to the RSV bridge used for OSG availability Plan to include discussion in the upcoming EGEE-OSG-WLCG design meeting in Madison at the end of May –Hope to further collaborate with OSG on the infrastructure for the analysis of the collected data, dashboards etc. 11

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Nordugrid –Currently only Nordugrid Job info is via ATLAS production DB –How do we get information from the CE? Will look to implement a similar bridge if needed –Need to work through the technical details with the experts –Discussion yet to start… Presentation title - 12

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Pilot Jobs L&B client resides on every worker node Can be used to submit additional messages to L&B for a job –Timestamps +environment for Job Wrapper start/end –Timestamp of handover to user job –Linkage of pilot job to experiment job ID –… Benefit is that it’s all in one coherent data structure for a given job 13

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Summary Propose a more coherent approach to mining job state changes –Uses expert knowledge where possible to ‘cook’ the data into a useful structure Fits the principles of the WLCG monitoring activity –Use common message system components –Split the work across the relevant teams What have we missed? –Your feedback essential… 14