Presentation is loading. Please wait.

Presentation is loading. Please wait.

D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich

Similar presentations


Presentation on theme: "D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich"— Presentation transcript:

1 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites
D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich F. Legger, LMU Munich A. Sciabà, CERN IT-ES M. Úbeda García, CERN PH-LBC (formerly CERN IT-ES) EGI User Forum 2011 (12 April 2011, Vilnius) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

2 Overview Motivation Introduction to HammerCloud
Reliability and Performance Introduction to HammerCloud User friendly yet powerful tool to stress test and/or continually validate grid sites HammerCloud in the LHC Experiments CMS, LHCb and ATLAS deployments Future Plans 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

3 Motivation I The ATLAS experiment at CERN surveyed their Grid Users about issues related to distributed analysis. There was one common response from many of the 241 users who completed the survey… “…I would like to mention that since I started using the GRID (in 2006), the tools became much more user-friendly... However, my colleagues and students do complain frequently because often about 10%-20% of the jobs do not succeed and they need to re-submit them several times and at certain point bookkeeping becomes a nightmare." 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

4 Motivation II Physics analysis jobs are quite I/O intensive:
Ex: a typical ATLAS analysis reads data at 6 megabytes per second per job slot When this work started, realistic LHC analysis loads had not been fully tested at the global scale ATLAS DA jobs through the PanDA system Up to 30k concurrent ATLAS DA jobs 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

5 Introduction HammerCloud (HC) is a grid site testing system serving two use-cases: Stress Testing: on-demand large-scale stress tests using real jobs to test one or many sites simultaneously Help commission new sites Evaluate changes to site infrastructure Evaluate experiment software changes Compare site performances Functional Testing: frequent “ping” jobs to all sites to perform end-to-end site validation (and fully test all required services) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

6 Service Components HammerCloud includes:
A user-friendly web frontend to define tests and view results; developed using Django A job submission backend that uses Ganga to interface with the grid and monitor/manage the jobs “HC Logic” which contains the core algorithms of the HC tests. Includes building and delivering the target number of jobs per site 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

7 Testing Overview HammerCloud offers both On-Demand and Automated Testing Experts define a test of type STRESS or FUNCTIONAL Stress tests are scheduled on demand as needed by: Central VO managers Cloud/Regional managers Site managers Functional tests are scheduled automatically Results are published on the HC website and can be pushed to other systems (e.g. Site-status-board (SSB), Service Availability Monitoring (SAM), Nagios) For all tests, a detailed report summarizing the job success rates and performances is produced 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

8 Test Workflow An HC Test is described by:
The code to run (typically a real analysis from the user community) The dataset list or pattern appropriate for the code The list of sites to be tested, and the target number of jobs to run concurrently per site A start time and an end time Test execution proceeds in 4 steps: Generate: The test description is converted to a set of jobs (e.g. Ganga job objects, one for each site under test) Submit: the job objects are submitted Run: jobs are monitored, outputs recorded to the HC database, jobs are resubmitted to achieve the target number of running jobs per site Exit: at the test end time, leftover jobs are killed At the same time the web frontend shows real-time test results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

9 HammerCloud v4 HammerCloud version 4 has been in production since late 2010, and includes… A generic core: Experiment plugins for the front-end (django) and back-end (db interactions and test running) Makes adding a new VO quite straightforward More powerful results presentation: Plot arbitrary metric histograms, metric evolution over time, and site/metric rankings RSS Feeds: Subscribe to a site or cloud feed to be informed of test results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

10 HC LHC Users Now used by three LHC experiments
How they use HC differs from experiment to experiment Details in next slides (apologies for many screenshots!) Now used by three LHC experiments 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

11 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

12 HammerCloud and CMS CMS has been using HC since mid-2010; continuous testing started in fall 2010 GangaCMS was implemented to abstract the CRAB job submission and monitoring During HC test generate step, HC queries the CMS “DBS” discovery service to find input data While running, HC extracts CMS specific job metrics from the Ganga jobs (sourced from CRAB Full Job Report) 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

13 HC-CMS Functional Testing
HC-CMS is currently running ~10k short analysis jobs per day to test the CMS grid sites 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

14 Example Test Results 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

15 CMS Job Robot Faulty sites can be quickly identified in the Robot summary page Sites with <80% efficiency are highlighted in red Other sites can be viewed by hovering the mouse 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

16 Robot History Historical Robot results are summarized in a grid view. Click to view details 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

17 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

18 HammerCloud and LHCb HC-LHCb was deployed in test in fall 2010, and demonstrated its use immediately by helping to commission Castor at RAL The implementation of the HC plugin for LHCb was relatively (to CMS) simple because of the existing GangaLHCb plugin Ganga is used extensively in the LHCb experiment The LHCb instance was upgraded to HCv4 recently 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

19 Example test results Results from an example test are shown at right
Metrics recorded: Wallclock NormCPUTime ScaledCPUTime MemoryUsed(kb) TotalCPUTime Wallclock / NormCPUTime Load Average 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

20 Integration with LHCbDIRAC
The DIRAC ResourceStatusSystem (RSS) continually evaluates policies against the set of grid resources (site, SE, CE) to detect problems: (DIRAC is the LHCb workload management system) Resource statuses: Active, Bad, Probing, Ban When RSS bans a resource, LHCbDIRAC will use the HC API to schedule a test at the related site. RSS will monitor the HC test results and activate the site again once the resource is again functional This component is under development now by LHCb 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

21 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

22 HammerCloud and ATLAS ATLAS initiated HC, and has made substantial use of it: More than CPU-days of HC jobs run on sites in EGEE, EGI, OSG, and NDGF via PanDA, gLite WMS, and ARC E.g. STEP’09 was quite an intensive stress test over 11 days Now running many thousand robot jobs per day, plus ongoing stress testing as needed by the sites Used to test new storage solutions: Xrootd/EOS at CERN Dcache & NFS 4.1 Active development for new use cases: Tier 3 site testing Production queue testing Panda Pilot testing STEP’09 Results: 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

23 ATLAS Functional Testing
ATLAS has ~10 different functional test jobs running at all grid sites Basic but realistic test jobs. E.g. test the application software, data access, remote database access ~5-10 jobs per site per hour per test 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

24 Efficiency Over Time Over the past 1.5 years the overall reliability of the ATLAS grid sites has noticeably improved The HC stress testing and continuous end-to-end testing aided this progress Plots credit: S. Panitkin, BNL 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

25 Functional Test Errors
Looking in details and the functional test errors, ATLAS consistently observes ~5% error rate across most sites. 99% of errors are related to the storage (SE or LFC) Plots credit: F. Legger, LMU Munich 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

26 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster Plot from Dashboard SSB via F. Legger, LMU Munich

27 ATLAS Automatic Site-Exclusion
11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster

28 Conclusions HammerCloud is a maturing testing system which has been adopted by three LHC experiments Feedback is positive: Frequent full-chain testing is critical to validate the infrastructure (reliability ++) Site admins feel empowered to test their facilities without experiment-specific knowledge (performance ++) We are excited for future challenges: Core improvements including improved metrics plotting and outlier-detection Further Robot Testing with CMS LHCb will start using the HC API to integrate testing with LHCbDIRAC ATLAS actively developing Production Queue testing Could be adopted by other VOs having Ganga-enabled applications 11/14/2018 HammerCloud: An Automated Service for Stress and Functional Testing of Grid Sites – Dan van der Ster


Download ppt "D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich"

Similar presentations


Ads by Google