1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Slides:

Advertisements

Similar presentations

DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.

Advertisements

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

Physicists's experience of the EGEE/LCG infrastructure usage for CMS jobs submission Natalia Ilina (ITEP Moscow) NEC’2007.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Enabling Grids for E-sciencE ENEA and the EGEE project gLite and interoperability Andrea Santoro, Carlo Sciò Enea Frascati, 22 November.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.

2 Sep Experience and tools for Site Commissioning.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

Grid job submission using HTCondor Andrew Lahiff.

Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

INFSO-RI Enabling Grids for E-sciencE The gLite Workload Management System Elisabetta Molinari (INFN-Milan) on behalf of the JRA1.

FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.

June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The usage of the gLite Workload Management.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

The GridPP DIRAC project DIRAC for non-LHC communities.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

The GridPP DIRAC project DIRAC for non-LHC communities.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.

Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.

Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)

Daniele Bonacorsi Andrea Sciabà

WLCG IPv6 deployment strategy

GDB 8th March 2006 Flavia Donno IT/GD, CERN

INFN-GRID Workshop Bari, October, 26, 2004

Survey on User’s Computing Experience

Site availability Dec. 19 th 2006

The LHCb Computing Data Challenge DC06

Presentation transcript:

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007

Andrea Sciabà CERN Outline Monitoring for the Monte Carlo production Monitoring for the user analysis Monitoring for data transfer Monitoring of central Grid services Conclusions

Andrea Sciabà CERN Monte Carlo production What information is needed?  Status of the computing resources  Are they working?  How busy are they?  Status of the local storage resources  Are they working?  How full are they?  Status of the central Grid services  LCG RB / gLite WMS –Is it working? –How busy is it? –How long does it take for a job to be assigned to a CE?  VOMS –Is it working?  BDII –Does it contain reliable information?

Andrea Sciabà CERN Status of the CE (I) Is the CE working?  The SAM tests can be used to answer the question  Basic sanity checks  More specific tests –E.g. is it possible to stage out a file from WN to local SRM? –Is FroNTier working at the site? –Are the required versions of CMSSW installed?  The time granularity of SAM tests is too large  Not reasonable to run them more often than hour (now it is every two hours)  Is there a way to have more fine-grained information? –Yes, by running the tests from the site itself –How to make the information available?

Andrea Sciabà CERN Status of the CE (II) How busy is the CE?  This information is needed to make the right choice about how to distribute jobs  Source #1: the information system  Needs to contain reliable information  Sites must have the possibility to be confident that they are publishing the correct information  Source #2: fabric monitoring  Could be used to cross-check the information in the IS  Currently impossible to make it available to the experiment at all sites in a homogeneous way

Andrea Sciabà CERN Grid monitoring as used by the MC production system Now  Automatic exclusion of CEs from the BDII by FCR  ~ OK means "can run a hello world job"  The SAM test results are periodically checked by people to maintain a list of good/bad CEs  The list of RB/WMS instances to be used is similarly maintained by hand based on reports about malfunctions or downtimes Ideally  Automatic ranking of CEs based on SAM tests and accurate resource usage reporting  E.g. to submit jobs to a CE as a function of the CE load  To avoid black holes  Automatic selection of good RB/WMS instances  Possibly also based on SAM  Calculation of the used/free space on the local storage

Andrea Sciabà CERN Analysis The users want to have a clear and simple picture of which CEs are working and which are not They do not need to know the status of the WMS, but CRAB does Useful to have an estimate of the time it will take for their jobs to start These requirements can be satisfied at the Dashboard level

Andrea Sciabà CERN FTS monitoring The main monitoring for data movement is the PhEDEx monitoring  It collects already quite a lot of logging information from FTS PhEDEx is self-regulating in case of transfer failures However, useful to have this information  Channel configuration parameters  Channel status (no. of active and pending transfers, etc.)  Load by VO  "Estimated time of start" for a new job by VO  Current transfer rate for ongoing transfers  Callback from the FTS API to know ASAP that there was a failure The FTS monitoring should also have  A unique entry point  Same information for all servers  Easy remote access to transfer logs

Andrea Sciabà CERN SRM monitoring Some information is also available via dedicated SAM tests  LFN  PFN conversion following CMS rules  As published in the Trivial File Catalogue  Copy forth and back a file between a UI and a remote SRM SAM could be used to store information from the higher level PhEDEx monitoring Information that would be nice to have from SRM  Clearer error messages  A reliable way to understand if a transfer is really ongoing  A better report on the transfer  X seconds to prepare, Y to move the file, Z to close out

Andrea Sciabà CERN WMS Ideally, the middleware should be able to find out by itself which services to contact  In the UI there is a simple random choice from a list and an automatic retry if WMProxy is not able to accept the submission request  A better load balancing would be desirable  For example, to use the "least loaded" WMS  Now a WMS refuses new jobs if the load is too high If the middleware does not provide this functionality, the application must implement it  Need to have the right monitoring information  WMS deamons status  Load  Number of jobs still in the task queue  Free disk space on WMS  Job latency –Submitted  matched to a CE  submitted to the batch system  starts execution

Andrea Sciabà CERN Example of WMS monitoring

Andrea Sciabà CERN VOMS, MyProxy, BDII No special monitoring needs Either they work, or do not Problems with them will have an immediate effect on all activities

Andrea Sciabà CERN Conclusions Most of the monitoring info which is needed is already available in some way or another What is needed is for it to be  Accurate  Up to date  Easy to retrieve by applications  Programmatic interfaces  Standard format (XML) uniform across different sources  Well documented  Fast to retrieve  Least possible use of authentication  Possibly using caching servers not to flood data sources with requests