Julia Andreeva, CERN IT-ES GDB 08.02.2012. Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

Slides:



Advertisements
Similar presentations
WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.
Advertisements

WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 Andrea Sciabà CERN Towards a global monitoring system for CMS computing Lothar A. T. Bauerdick Andrea P. Sciabà Computing in High Energy and Nuclear.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Dashboard program of work Julia Andreeva GS Group meeting
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
DDM FAX Dashboard status and future Luca Magnoni IT/SDC 2 nd June 2014.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Next Steps after WLCG workshop Information System Task Force 11 th February
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Prototype of new Site Usability interface Amol Wakankar March, /1/20111.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Monitoring Overview: status, issues and outlook Simone Campana.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,
Daniele Bonacorsi Andrea Sciabà
Main objectives To have by the end of August :
Key Activities. MND sections
POW MND section.
Experiment Dashboard overviw of the applications
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

Julia Andreeva, CERN IT-ES GDB

Every experiment does evaluation of the site status and experiment activities at the site As a rule the state of the site is exposed through experiment-specific monitoring systems Too many information sources, different entry points, different implementation The information published by the VO-specific monitoring systems should be integrated in a high level site-centric tool which offers a generic view of the computing activities of the LHC experiments at the site. 2Julia Andreeva, GDB

SiteView development started in 2009 First prototype ready by spring 2009 It suffered from the instability of the collectors and lack of the LHC VO topology descriptors In the end of summer 2009 it was completely redesigned. Common implementation with Site Status Board 3Julia Andreeva, GDB

VOTopologyStatusJob processingTransfer ALICEVO feedMonAlisa ATLASVO feedSSBJob monitoring Dashboard DDM Dashboard ( Soon WLCG Transfer Dashboard ) CMSVO feedSSBJob monitoring Dashboard Phedex (Soon WLCG Transfer Dashboard) LHCbVO feedSUM (Soon new Dirac site monitoring) Dirac (Soon WLCG Transfer Dashboard) 4 Over last years progress in decreasing of number of various information sources Julia Andreeva, GDB

SiteView collects: global site status, job monitoring information and transfer monitoring information From the implementation point of view SSB and SiteView is the same system (DB schema, collectors, UI) Recently many improvements done in SiteView : -Collectors became stable, alarms are raised in case any of collectors is stuck - Thanks to appearing of VO-feeds in most cases the problems with inconsistent site names (GOCDB-experiment site name mapping) are solved - ATLAS and CMS share many underlying monitoring systems : SSB, Dashboard job monitoring applications - The underlying monitoring systems like ATLAS DDM Dashboard, Phedex had improved over last two years (Phedex data service) - When new FTS version is deployed to all T1 sites there would be a common monitoring system for data transfers shared by CMS, ATLAS and LHCb which could be used as a source for SiteView So far is used mainly for dissemination purposes (runs behind WLCG GoogleEarth ). Not actively used for operations 5Julia Andreeva, GDB

Overall status -CMS and ATLAS: as defined in the SSB shifter view (realistic but not necessary all metrics are under site responsibility and not all of them are comprehensive to the sites) -ALICE: based on the status of the critical services as published in MonAlisa -LHCb: based on critical SAM tests ( does not provide complete picture, all tests are OK, but the site is banned, why?) LHCb is validating new site monitoring system which will define two statuses for the site, one takes into account only things which are under site responsibility, another one defines overall usability of the site from the LHCb perspective Should this strategy be applied for all experiments in order to disentangle site-related problems from the problems of other nature? 6Julia Andreeva, GDB

Job monitoring (where possible split by activity analysis, production, tests, etc…): - # of running jobs, average per hour - success rate (1,4,24 hours) as ratio of successful jobs to all accomplished jobs - CPU and wall clock consumption Data transfer (in and out) - average transfer rate over last hour - success rate as ratio of successfully transferred files to all transfer attempts over last 4 hours 7Julia Andreeva, GDB

8 Treemap visualization Per-site snapshot view Julia Andreeva, GDB

SiteView SSB-like all sites snapshot page Click on a given site to see 24 hours history 9Julia Andreeva, GDB

Click to be redirected to the underlying data source SiteView SSB-like per-site history page 10Julia Andreeva, GDB

CMS SSB Snapshot view Click to get history Navigation to the information sources (1) 11Julia Andreeva, GDB

CMS SSB history Click to See Site Readiness Navigation to the information sources (2) 12Julia Andreeva, GDB

CMS SSB Click to See Savannah tickets Navigation to the information sources (3) 13Julia Andreeva, GDB

In order to make the system useful for operations serious commitment from the sites and experiments is required Question to the sites: -Is information which is provided by SiteView is what you need? Is it comprehensive? Is it consistent with local site monitoring? Questions to the experiments: - Is it possible to define realistic status of the site usability based on set of metrics which are under site responsibility and are comprehensive to the site? In case when the cause of the problem is not under site control, how to communicate what is wrong in a clear way? These questions have to be addressed following the outcome of the Operations TEG. Set of current metrics and status definitions to be reviewed. When the new set of metrics and status definitions is agreed, validation of the SiteView data has to be performed both by sites and experiments. 14Julia Andreeva, GDB

The SiteView application is in place for few years It has common implementation with Site Status Board. Number of various information sources used by SiteView is gradually decreasing, though it is not realistic to hope that we can always enforce use of common solutions In order to make it useful for operations, commitment of the potential users (sites) and information owners (VOs) is required. Metrics and status definitions have to be reviewed. After changes exposed data needs validation In the process of defining new metrics the most complicated task is to disentangle site problems from the problems related to VOs, users, applications, external GRID services… This work will be followed up in the scope of Operations TEG GridMap UI SSB-like UI 15Julia Andreeva, GDB