Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010
Experiment Support Outline Introduction to HammerCloud –Motivation, History, Use-Cases How HammerCloud works –Design and Implementation Details Interface Tour for Users and Admins Possibilities for an LHCb Plugin HammerCloud Introduction for LHCb – 2
Experiment Support Introduction to HammerCloud HammerCloud (HC) is a Distributed Analysis testing system serving two use-cases: –Robot-like Functional Testing: frequent “ping” jobs to all sites to perform basic site validation –DA Stress Testing: on-demand large-scale stress tests using real analysis jobs to test one or many sites simultaneously to: Help commission new sites Evaluate changes to site infrastructure Evaluate SW changes Compare site performances… HammerCloud Introduction for LHCb – 3
Experiment Support HammerCloud and Job Robots HammerCloud is part of an evolution of job robots: –CMS Job Robot inspired the ATLAS GangaRobot (functional testing) –In ~Sept 2008, a form of the ATLAS GangaRobot was used to manually stress test the Italian ATLAS Tier2’s: 5 users manually submitting hundreds of instrumented jobs simultaneously (SIMD) Manual results collection and summarization Early results were shown to be very useful: –One early test showed a bimodal performance plot that was later traced to a faulty network switch which negatively affected the performance of some WNs. The need for an automated DA stress testing system was clear. –HammerCloud was born in November 2008 to deliver on-demand stress tests to ATLAS sites: Since then HC has run >1300 “Tests” using more than 4 million jobs. ATLAS has invested >200k CPU-days in HC tests –CMS has also agreed to use HC: in April a prototype was delivered, and now scale tests are about to begin. HammerCloud Introduction for LHCb – 4
Experiment Support HC and ATLAS during STEP’09 HammerCloud Introduction for LHCb – 5 STEP’09
Experiment Support HammerCloud Use-Cases Provides On-Demand and Automated Testing HC Operators define test templates: FUNCTIONAL and STRESS Functional Tests are automatically scheduled –Results are published on the HC website and can be pushed to other systems (e.g. SAM) Stress tests are generally scheduled on demand as needed by: –Central VO managers –Cloud/Regional managers –Site managers For all tests, a detailed report summarizing the job success rates and performances is produced. HammerCloud Introduction for LHCb – 6
Experiment Support HammerCloud Components The HC UI is implemented as a Django web app: –View test results –View cloud/site evolution –DB Admin State is maintained in a MySQL DB HC Logic (job submission, monitoring, resubmission) implemented on top of the Ganga Grid Programming Interface (GPI) HammerCloud Introduction for LHCb – 7
Experiment Support HammerCloud Logic An HC Test is described by: –The analysis code to run (typically a real analysis from the user community) –The dataset pattern (which can be resolved to a set of datasets appropriate for the analysis code) –The list of sites to be tested, and the target number of jobs to run concurrently per site –A start time and an end time Test execution proceeds in 4 steps: –Generate: Test description is converted to a set of submittable jobs (e.g. Ganga job objects, one for each site under test) –Submit: the job objects are submitted –Run: jobs are monitored, outputs recorded to the HC DB, jobs are resubmitted to achieve the target number of running jobs per site –Exit: at the test end time, leftover jobs are killed Concurrently, the HC Web shows real time test results HammerCloud Introduction for LHCb – 8
Experiment Support An HC-LHCb Plugin What customizations would be needed for an HC-LHCb plugin? HC is built upon Ganga and exploits its job management features: –job repository, job configuration via python, job submission, job monitoring in background thread(s) Given the existing GangaLHCb plugins, modifications to HC itself would be relatively minor, e.g. –HC Test Generation: Query a data discovery service to form a job processing random input data –HC Test Running: Changes to extract LHCb-specific job metrics from Ganga HammerCloud Introduction for LHCb – 9
Experiment Support Interface Tour 1. The Public User Interface HammerCloud Introduction for LHCb – 10
Experiment Support HC Home The HC Homepage lists the running and scheduled tests. HammerCloud Introduction for LHCb – 11
Experiment Support Viewing a Test The test overview gives a quick summary of: Overall job efficiency, CPU/Walltime, Events/WrapperTime Also shows a summary of the jobs running at each site involved in the test. HammerCloud Introduction for LHCb – 12
Experiment Support Viewing a Test: Summary Stats The Test Overview page also gives summary statistics by site Here you can see some example metrics (for CMS) HammerCloud Introduction for LHCb – 13
Experiment Support Viewing a Test: Per-Site Plots View plots of the recorded metrics for each site HammerCloud Introduction for LHCb – 14
Experiment Support Viewing a Test: Metric Comparisons View the plots for all sites for a specific metric Used to compare site-by-site HammerCloud Introduction for LHCb – 15
Experiment Support Modify a Running Test Authorized users can modify the parameters of a test at run time –E.g. change the end time, or number of running jobs per site HammerCloud Introduction for LHCb – 16
Experiment Support Clone a Previous Test Cloning a previous test is simple –Useful to repeat the test or to run an identical test at a different set of sites HammerCloud Introduction for LHCb – 17
Experiment Support Overall HC Plots Historical plots show previous test statistics Currently shows # running jobs per site. Plots showing the evolution of the performance metrics are in development. HammerCloud Introduction for LHCb – 18
Experiment Support HC Robot View The “Robot” view is used to show the success rates of functional test jobs over the past 24 hrs. (Similar to SSB) Clicking a site takes you to the list of Robot jobs executed at that site HammerCloud Introduction for LHCb – 19
Experiment Support Interface Tour 2. Admin Interface HammerCloud Introduction for LHCb – 20
Experiment Support HC Admin: Operator and User Views HC Operators have access to admin all tables in the HC DB via a web interface HC Users have more limited access HammerCloud Introduction for LHCb – 21
Experiment Support HC Admin: Tests and Templates Above: List all Test Templates Below: List all Tests HammerCloud Introduction for LHCb – 22
Experiment Support HC Admin: Edit a Test Template Test templates are defined via the Admin UI All of the parameters of a test are here, plus: –An active flag indicating that a template should be auto- scheduled –A default lifetime: auto- scheduled test instances of this template will run for this time period Normally, functional test templates include the list of sites to be tested, whereas stress test templates do not include a list of sites. HammerCloud Introduction for LHCb – 23
Experiment Support HC Admin: Adding a new Test Adding a new test on-demand is simple. Select the test template of interest, a start time, and an end time. If needed, Tests can be further customized after the template is copied over. HammerCloud Introduction for LHCb – 24
Experiment Support Summary HammerCloud is a DA functional and stress testing system used widely by ATLAS and coming soon for CMS Two basic use-cases: –Continuous stream of test jobs to measure site availability –Enable central managers to define standardized (stress) tests, and empower site managers to invoke those tests on-demand. An HC-LHCb plugin would leverage the existing GangaLHCb work –A prototype plugin would not take significant effort HammerCloud Introduction for LHCb – 25