Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.

Slides:



Advertisements
Similar presentations
Implementation of a Validated Statistical Computing Environment Presented by Jeff Schumack, Associate Director – Drug Development Information September.
Advertisements

Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Batch Production and Monte Carlo + CDB work status Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
LSC Segment Database Duncan Brown Caltech LIGO-G Z.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
VOX Project Status T. Levshina. Talk Overview VOX Status –Registration –Globus callouts/Plug-ins –LRAS –SAZ Collaboration with VOMS EDG team Preparation.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WP3 Information and Monitoring Steve Fisher / RAL 23/9/2003.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Performance Improvements to BDII - Grid Information.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Website: Answering Continuous Queries Using Views Over Data Streams Alasdair J G Gray Werner.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
 CMS data challenges. The nature of the problem.  What is GMA ?  And what is R-GMA ?  Performance test description  Performance test results  Conclusions.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1.
INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Accounting in DataGrid HLR software demo Andrea Guarise Milano, September 11, 2001.
Analysis of job submissions through the EGEE Grid Overview The Grid as an environment for large scale job execution is now moving beyond the prototyping.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
The GridPP DIRAC project DIRAC for non-LHC communities.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Core and Framework DIRAC Workshop October Marseille.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Database 12.2 and Oracle Enterprise Manager 13c Liana LUPSA.
Daniele Bonacorsi Andrea Sciabà
GPIR GridPort Information Repository
Job monitoring and accounting data visualization
gLite Information System
Key Activities. MND sections
POW MND section.
BDII Performance Tests
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Evolution of SAM in an enhanced model for monitoring the WLCG grid
BOSS: the CMS interface for job summission, monitoring and bookkeeping
gLite Information System
New developments on the LHCb Bookkeeping
Monitoring of the infrastructure from the VO perspective
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London

26 th March 2009 Janusz Martyniak Imperial College London 2 What is the RTM RTM is a GRID monitoring system which gives on overview of current state of the infrastructure. The RTM's view (‘map’) of the Grid is based solely on information retrieved from BDII services Job information is obtained by querying EGEE Logging and Bookkeeping servers The information gathered is published in XML format and available for clients to use RTM supports 2 graphics client programs using satellite imagery from NASA, these clients display the Grid as it is geographically spread over the world. There is a text version as well (formatted, ordered by country/site). Any site using EGEE compatible technology (i.e LB servers/RB/WMS) may be displayed on the map. If the LB is open for RTM access this happens automatically. Webpage:

26 th March 2009 Janusz Martyniak Imperial College London 3 History Development started in summer 2002 initial version 2002, Perl / CGI mix 2003, java application/applet – 2D map, ‘stacked’ job display , present graphics - 2D map - 3D globe - selectable VO and RB/WMS - RB/CE/jobs info on request (graphs and tables) DB restructuring, WMS support, revised threading, many bug fixes and improvements.

26 th March 2009 Janusz Martyniak Imperial College London 4 3D Client

26 th March 2009 Janusz Martyniak Imperial College London 5 3D client, zoomed in view

26 th March 2009 Janusz Martyniak Imperial College London 6 Site Information

26 th March 2009 Janusz Martyniak Imperial College London 7 Components 2D java application/applet client 3D client, requires OpenGL support Server – gets information from LB databases and stores relevant info in RTM DB by periodically querying LBs. BDIIscan – upkeeps BDII/RTM DB synchronisation Enquirer – reads RTM DB and writes XML for java clients into Apache file system (so clients query Apache, not the database) Resource scanner – checks if LB DB accepts RTM requests (once per day) Data analysis (java/Root) – creates CE reports.

26 th March 2009 Janusz Martyniak Imperial College London 8 RTM Database (PostgreSQL) site rb/wms ce jobs Apache Server BDII Scan Server LB servers BDII Enquirer Summary files html/xml rrdtool (graphs) RTM Architecture

26 th March 2009 Janusz Martyniak Imperial College London 9 RTM Data Streams xml formatted data is made available to clients: 2D/3D java clients CMS dashboard ??? Job Summaries are used by the Grid Observatory ( observatory.org/). The data is publicly available from their website. observatory.org/ Data is also analysed at Imperial, some examples will follow. We have a poster as well [023] on Thu.

26 th March 2009 Janusz Martyniak Imperial College London 10 Performance Metrics We use the following metrics: Job Submissions in a given day. Cumulative job submissions in a given day. Job Hours – total number of hours consumed by a VO on a given day. ERAW = Total number of successful jobs in a day Total number of jobs in that day EUSER = Time job is running. Time Job is in system GRID Performance Analysis

26 th March 2009 Janusz Martyniak Imperial College London 11 The Cumulative job submissions graph show a greater than linear increase since September Growth of the EGEE Grid

26 th March 2009 Janusz Martyniak Imperial College London 12 LHCb Job submissions vs time Production runs mid 2006 and They use pilot jobs which fail quickly if resource is wrong or no waiting job. E RAW vs time Pilot jobs make the RAW efficiency low. Especially when in production runs. LHCb

26 th March 2009 Janusz Martyniak Imperial College London 13 Job Hours vs time Though many failed pilot jobs these take very little time so very little time is spent running failed jobs. E USER vs time User efficiency is very good as jobs which run do so quickly and in general there is a high job load from this VO. LHCb

26 th March 2009 Janusz Martyniak Imperial College London 14 Future…. We are considering changing the current RTM server-side architecture which relies on direct LB DB access in a pull mode. It uses one thread per LB serves which does not scale well It suffers from internal changes in LB – current code will not work with next version of LB (2.0) Amount of information RTM retrieves is not always optimal Privileged DB access make RTM non-portable.

26 th March 2009 Janusz Martyniak Imperial College London 15 LB notification based architecture LB and RTM teams have implemented a prototype which uses push rather than pull mode. Replacement RTM module LB harvester was developed from scratch, based on LB notifications The module subscribes to receive messages on job state changes from all registered LB’s. The messages are processed and stored into internal RTM database Other RTM components remain unchanged One harvester thread accepts messages from multiple LBs Limiting factor is the total number of incoming messages, not number of connections A message on a specific job is generated only on job state change. Access to LB uses its stable public interface only, not depending on changes in LB internals. Uses GSI.

26 th March 2009 Janusz Martyniak Imperial College London 16 Current and Proposed Systems Combined

26 th March 2009 Janusz Martyniak Imperial College London 17 We have successfully implemented a working prototype of the harvester: Job state fully synchronized between LB and RTM Reduced LB load by information push AuthN/Z with X509 instead of DB password Promising performance Better scaling due to “1 thread : many LB’s” architecture Smooth migration from LB 1.9 to LB 2.0

26 th March 2009 Janusz Martyniak Imperial College London 18 Conclusions RTM is a reliable monitoring system It has been widely used as a dissemination tool to both specialists and non-specialists all over the world. Works well on multiple platforms, however there are still issues with some graphics cards and OpenGL support There are ongoing discussions with EGEE on how to improve the tool