D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich

Slides:



Advertisements
Similar presentations
Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.
Advertisements

Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Metadata Mòrag Burgon-Lyon University of Glasgow.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Julia Andreeva on behalf of the MND section MND review.
The GridPP DIRAC project DIRAC for non-LHC communities.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
The GridPP DIRAC project DIRAC for non-LHC communities.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.
A Statistical Analysis of Job Performance on LCG Grid David Colling, Olivier van der Aa, Mona Aggarwal, Gidon Moont (Imperial College, London)
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Status: ATLAS Grid Computing
Computing Operations Roadmap
Dan van der Ster for the Ganga Team
Key Activities. MND sections
POW MND section.
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Evolution of SAM in an enhanced model for monitoring the WLCG grid
New monitoring applications in the dashboard
Experiment Dashboard overviw of the applications
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
PanDA in a Federated Environment
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Cloud Computing R&D Proposal
Monitoring of the infrastructure from the VO perspective
Leigh Grundhoefer Indiana University
The LHCb Computing Data Challenge DC06
Presentation transcript:

HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich F. Legger, LMU Munich A. Sciabà, CERN IT-ES M. Úbeda García, CERN PH-LBC (formerly CERN IT-ES) EGI User Forum 2011 (12 April 2011, Vilnius) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Overview Motivation Introduction to HammerCloud Reliability and Performance Introduction to HammerCloud User friendly yet powerful tool to stress test and/or continually validate grid sites HammerCloud in the LHC Experiments CMS, LHCb and ATLAS deployments Future Plans 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Motivation I The ATLAS experiment at CERN surveyed their Grid Users about issues related to distributed analysis. There was one common response from many of the 241 users who completed the survey… “…I would like to mention that since I started using the GRID (in 2006), the tools became much more user-friendly... However, my colleagues and students do complain frequently because often about 10%-20% of the jobs do not succeed and they need to re-submit them several times and at certain point bookkeeping becomes a nightmare." 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Motivation II Physics analysis jobs are quite I/O intensive: Ex: a typical ATLAS analysis reads data at 6 megabytes per second per job slot When this work started, realistic LHC analysis loads had not been fully tested at the global scale ATLAS DA jobs through the PanDA system Up to 30k concurrent ATLAS DA jobs 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Introduction HammerCloud (HC) is a grid site testing system serving two use-cases: Stress Testing: on-demand large-scale stress tests using real jobs to test one or many sites simultaneously Help commission new sites Evaluate changes to site infrastructure Evaluate experiment software changes Compare site performances Functional Testing: frequent “ping” jobs to all sites to perform end-to-end site validation (and fully test all required services) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Service Components HammerCloud includes: A user-friendly web frontend to define tests and view results; developed using Django A job submission backend that uses Ganga to interface with the grid and monitor/manage the jobs “HC Logic” which contains the core algorithms of the HC tests. Includes building and delivering the target number of jobs per site 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Testing Overview HammerCloud offers both On-Demand and Automated Testing Experts define a test of type STRESS or FUNCTIONAL Stress tests are scheduled on demand as needed by: Central VO managers Cloud/Regional managers Site managers Functional tests are scheduled automatically Results are published on the HC website and can be pushed to other systems (e.g. Site-status-board (SSB), Service Availability Monitoring (SAM), Nagios) For all tests, a detailed report summarizing the job success rates and performances is produced 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Test Workflow An HC Test is described by: The code to run (typically a real analysis from the user community) The dataset list or pattern appropriate for the code The list of sites to be tested, and the target number of jobs to run concurrently per site A start time and an end time Test execution proceeds in 4 steps: Generate: The test description is converted to a set of jobs (e.g. Ganga job objects, one for each site under test) Submit: the job objects are submitted Run: jobs are monitored, outputs recorded to the HC database, jobs are resubmitted to achieve the target number of running jobs per site Exit: at the test end time, leftover jobs are killed At the same time the web frontend shows real-time test results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HammerCloud v4 HammerCloud version 4 has been in production since late 2010, and includes… A generic core: Experiment plugins for the front-end (django) and back-end (db interactions and test running) Makes adding a new VO quite straightforward More powerful results presentation: Plot arbitrary metric histograms, metric evolution over time, and site/metric rankings RSS Feeds: Subscribe to a site or cloud feed to be informed of test results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HC LHC Users Now used by three LHC experiments How they use HC differs from experiment to experiment Details in next slides (apologies for many screenshots!) Now used by three LHC experiments 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HammerCloud and CMS CMS has been using HC since mid-2010; continuous testing started in fall 2010 GangaCMS was implemented to abstract the CRAB job submission and monitoring During HC test generate step, HC queries the CMS “DBS” discovery service to find input data While running, HC extracts CMS specific job metrics from the Ganga jobs (sourced from CRAB Full Job Report) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HC-CMS Functional Testing HC-CMS is currently running ~10k short analysis jobs per day to test the CMS grid sites 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Example Test Results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

CMS Job Robot Faulty sites can be quickly identified in the Robot summary page Sites with <80% efficiency are highlighted in red Other sites can be viewed by hovering the mouse 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Robot History Historical Robot results are summarized in a grid view. Click to view details 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HammerCloud and LHCb HC-LHCb was deployed in test in fall 2010, and demonstrated its use immediately by helping to commission Castor 2.1.9 at RAL The implementation of the HC plugin for LHCb was relatively (to CMS) simple because of the existing GangaLHCb plugin Ganga is used extensively in the LHCb experiment The LHCb instance was upgraded to HCv4 recently 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Example test results Results from an example test are shown at right Metrics recorded: Wallclock NormCPUTime ScaledCPUTime MemoryUsed(kb) TotalCPUTime Wallclock / NormCPUTime Load Average 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Integration with LHCbDIRAC The DIRAC ResourceStatusSystem (RSS) continually evaluates policies against the set of grid resources (site, SE, CE) to detect problems: (DIRAC is the LHCb workload management system) Resource statuses: Active, Bad, Probing, Ban When RSS bans a resource, LHCbDIRAC will use the HC API to schedule a test at the related site. RSS will monitor the HC test results and activate the site again once the resource is again functional This component is under development now by LHCb 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

HammerCloud and ATLAS ATLAS initiated HC, and has made substantial use of it: More than 300000 CPU-days of HC jobs run on sites in EGEE, EGI, OSG, and NDGF via PanDA, gLite WMS, and ARC E.g. STEP’09 was quite an intensive stress test over 11 days Now running many thousand robot jobs per day, plus ongoing stress testing as needed by the sites Used to test new storage solutions: Xrootd/EOS at CERN Dcache & NFS 4.1 Active development for new use cases: Tier 3 site testing Production queue testing Panda Pilot testing STEP’09 Results: 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

ATLAS Functional Testing ATLAS has ~10 different functional test jobs running at all grid sites Basic but realistic test jobs. E.g. test the application software, data access, remote database access ~5-10 jobs per site per hour per test 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Efficiency Over Time Over the past 1.5 years the overall reliability of the ATLAS grid sites has noticeably improved The HC stress testing and continuous end-to-end testing aided this progress Plots credit: S. Panitkin, BNL 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Functional Test Errors Looking in details and the functional test errors, ATLAS consistently observes ~5% error rate across most sites. 99% of errors are related to the storage (SE or LFC) Plots credit: F. Legger, LMU Munich 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster Plot from Dashboard SSB via F. Legger, LMU Munich

ATLAS Automatic Site-Exclusion 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

Conclusions HammerCloud is a maturing testing system which has been adopted by three LHC experiments Feedback is positive: Frequent full-chain testing is critical to validate the infrastructure (reliability ++) Site admins feel empowered to test their facilities without experiment-specific knowledge (performance ++) We are excited for future challenges: Core improvements including improved metrics plotting and outlier-detection Further Robot Testing with CMS LHCb will start using the HC API to integrate testing with LHCbDIRAC ATLAS actively developing Production Queue testing Could be adopted by other VOs having Ganga-enabled applications 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster