Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

Slides:



Advertisements
Similar presentations
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
Advertisements

Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG operations: communication channels Andrea Sciabà WLCG operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
1 Andrea Sciabà CERN Towards a global monitoring system for CMS computing Lothar A. T. Bauerdick Andrea P. Sciabà Computing in High Energy and Nuclear.
2 Sep Experience and tools for Site Commissioning.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
CERN IT Department t LHCb Software Distribution Roberto Santinelli CERN IT/GS.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc News from the CMS computing and offline monitoring.
Daniele Bonacorsi Andrea Sciabà
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Monitoring of the infrastructure from the VO perspective
D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich
Status and plans for bookkeeping system and production tools
Presentation transcript:

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini Andrea Sciabà

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2Andrea Sciabà Hammercloud: quick summary (1/2) Hammercloud is a service to define and run Grid test jobs simulating analysis workflows Similar to the Job Robot (to which it was inspired) but much more powerful –User can choose the dataset, CMSSW version, job splitting parameters, sites or regions, throttling parameters Two basic modes of operation: –Functional tests: user defines a template and HC instantiates tests in a continuous way Exactly like the JR does –Stress tests: user instantiates tests by hand Ideal for site stress testing, could be applied to CMSSW and CRAB validation

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 3Andrea Sciabà Hammercloud: quick summary (2/2) Status of tests visible in real time via plots and tables –Any parameter in the FJR can be plotted as a histogram over the jobs in the test Statistics by site available Administrative interface to add/change templates, metrics to plot, CRAB parameters to use, etc.

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 4Andrea Sciabà Status of HC in CMS HC server running of vocms38 HC web interface running on voatlas49 (common for ATLAS, CMS and LHCb) –New users must request a login Functional tests running since several months at all sites (Note: right now there is a Python error, preventing job submission, to be fixed  ) New CMSSW releases need to be installed by hand Only one CRAB version selectable –Not using CRAB server

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 5Andrea Sciabà Planned improvements Allow to select the CRAB version Allow to use the CRAB server Allow full access to standard output files via web server Allow to select as parameter the activity name for the Dashboard –Possible with the latest CRAB version –Essential to replace the JobRobot as we need to separate “JR/HC” jobs from other HC tests Use a non-ATLAS host for the web server –Is this a requirement? Develop SLS sensors

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 6Andrea Sciabà Possible improvements Enable the possibility (already supported) to report which sites fail some freely defined criteria –For example, sites with a low success rate in the last N minutes –Easy to publish in the Site Status Board (done for ATLAS) –May be used to implement automatic site exclusion mechanisms

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 7Andrea Sciabà Nagios: quick summary A new framework, based on the Nagios monitoring system and the WLCG MSG service (a messaging system based on ActiveMQ) It replaces the old SAM framework Experiments can use it to run their own functional tests Developed and supported by IT-GT for WLCG

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 8Andrea Sciabà Status of Nagios in CMS CMS tests ported from SAM to Nagios since several months –Nagios server very stable; preprod server available CMS tests and their configuration must be packaged as an RPM –Currently, not automatically generated when a test is updated by a test maintainer, so risk of delays Tests are run on all CE+SRM at all CMS sites Test results and site availability published by the Dashboard and taken from the old SAM database CMS Site Readiness will use Nagios availability as from today

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 9Andrea Sciabà Planned improvements Integrate a CMS glexec test Run tests only on specific services using a CMS topology feed (produced by the Dashboard) Enable automatic site exclusion from BDII for CEs failing critical tests Have the Dashboard taking test results from ACE (the replacement for the old SAM database) Proper procedure to generate new RPMs Proper alarming (SLS + Lemon)

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 10Andrea Sciabà Open issues IT-GT maintaining the CMS Nagios production and preproduction server for the time being but not forever Must determine if to run them in CMS or in IT

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 11Andrea Sciabà Conclusions Hammercloud –Still some integration work needed –Should define ASAP procedures for Facilities Operations –Promote its usage to have a better fit with CMS needs and a quicker development cycle; aim at decommissioning the Job Robot ASAP Nagios –Basically already production quality –Integrate new important tests –Converge on a production infrastructure

Experiment Support Acknowledgements Thanks to the IT-GT group for their support on the usage of SAM/Nagios and to Mario Úbeda for the HC-CMS integration