Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli
1 WLCG, sites 42 countries 30 PB/year WLCG WLCG 1
HammerCloud Distributed Analysis Testing System 2 2 HammerCloud v4,
Why Testing?
Between 5% and 10% of jobs fail 3 Intermittent failures? Systemic problems? Need testing to diagnose 3 J. Elmsheuser, F. Legger, R. Medrano Llamas, G. Sciacca, and D. van der Ster, J. Phys. Conf. Ser. 396, (2012).
Why Testing? Between 5% and 10% of jobs fail 3 Intermittent failures? Systemic problems? Need testing to diagnose Purpose of HammerCloud: Validates grid health Helps test new sites Verifies correct operation of new software Allows performance comparisons 3 J. Elmsheuser, F. Legger, R. Medrano Llamas, G. Sciacca, and D. van der Ster, J. Phys. Conf. Ser. 396, (2012).
Project Overview Use HammerCloud LHCb to… Test LHCb data storage access Test new releases of user analysis programs Report data to Resource Status System
Project Overview Use HammerCloud LHCb to… Test LHCb data storage access Test new releases of user analysis programs Report data to Resource Status System Tools Django/Python (web interface) Ganga (job submission) OpenStack/Puppet (virtual machines, system management)
Levels of HammerCloud: Front EndBack EndGrid Tests
Front End User interface shows list of current and past tests and offers management tools
Front End User interface shows list of current and past tests and offers management tools Data visualizations categorize errors and the sites they affect (right)
Back End The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data)
Back End The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data) The backend produces data visualizations, e.g. jobs by status: complete, running, schedule, or failed (right)
Back End The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data) The backend produces data visualizations, e.g. jobs by status: complete, running, schedule, or failed (right) HammerCloud sites automatically update to match the WLCG topology Reports data via a REST API to DIRAC Resource Status System
Grid Tests (Getting Results) Detecting and classifying data access failure is the key purpose of HammerCloud
Grid Tests (Getting Results) Detecting and classifying data access failure is the key purpose of HammerCloud Grid metrics like Time to Start (right) give an indication of site load
Grid Tests (Getting Results) Detecting and classifying data access failure is the key purpose of HammerCloud Grid metrics like Time to Start (right) give an indication of site load Analyzing logs to determine reasons for failure / failover
Future Work New testing architecture: the LHCb “mesh” More useful data visualizations and metrics Provide grid site status information to RSS (Resource Status System) via REST API Long-term plan: Testing as a Service 4 4 R. M. Llamas, et. al., J. Phys. Conf. Ser. 513, (2014).
At CERN, I… Experienced global-scale computing Learned the inner workings of the Grid Improved understanding of Django framework Engaged in a variety of cultural activities & scientific studies Refined my career interests Had an amazing summer!
Thank you for your time.