Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.

Slides:



Advertisements
Similar presentations
Summary XBRL Challenge Objective: Tools that rely on XBRL data, e.g., tool that extracts data for multi-company comparison via desktop application; or.
Advertisements

GENI Experiment Control Using Gush Jeannie Albrecht and Amin Vahdat Williams College and UC San Diego.
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
Bookshelf.EXE - BX A dynamic version of Bookshelf –Automatic submission of algorithm implementations, data and benchmarks into database Distributed computing.
A Computation Management Agent for Multi-Institutional Grids
Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Łukasz Skitał 2, Renata Słota 1, Maciej Janusz 1 and Jacek Kitowski 1,2 1 Institute of Computer Science AGH University of Science and Technology, Mickiewicza.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Cluster Reliability Project ISIS Vanderbilt University.
ENTERPRISE COMPUTING QUIZ By: Lean F. Torida
Nicholas LoulloudesMarch 3 rd, 2009 g-Eclipse Testing and Benchmarking Grid Infrastructures using the g-Eclipse Framework Nicholas Loulloudes On behalf.
Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
Students: Aiman Md Uslim, Jin Bai, Sam Yellin, Laolu Peters Professors: Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many CAMeras The Problem Currently.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL
Analysis of job submissions through the EGEE Grid Overview The Grid as an environment for large scale job execution is now moving beyond the prototyping.
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
Alex Undrus – GRID Testing – 18 Oct Nightlies Testing on the GRID: Status Oct Purposes: Moderate scale production for quick validation (when.
CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Seven things you should know about Ganga K. Harrison (University of Cambridge) Distributed Analysis Tutorial ATLAS Software & Computing Workshop, CERN,
A Statistical Analysis of Job Performance on LCG Grid David Colling, Olivier van der Aa, Mona Aggarwal, Gidon Moont (Imperial College, London)
Daniele Bonacorsi Andrea Sciabà
Dan van der Ster for the Ganga Team
Overview of the Belle II computing
POW MND section.
Monitoring of the infrastructure from the VO perspective
D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich
Danilo Dongiovanni INFN-CNAF
Gridifying the LHCb Monte Carlo production system
Production Manager Tools (New Architecture)
Presentation transcript:

Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli

1 WLCG, sites 42 countries 30 PB/year WLCG WLCG 1

HammerCloud Distributed Analysis Testing System 2 2 HammerCloud v4,

Why Testing?

 Between 5% and 10% of jobs fail 3  Intermittent failures?  Systemic problems?  Need testing to diagnose 3 J. Elmsheuser, F. Legger, R. Medrano Llamas, G. Sciacca, and D. van der Ster, J. Phys. Conf. Ser. 396, (2012).

Why Testing?  Between 5% and 10% of jobs fail 3  Intermittent failures?  Systemic problems?  Need testing to diagnose  Purpose of HammerCloud:  Validates grid health  Helps test new sites  Verifies correct operation of new software  Allows performance comparisons 3 J. Elmsheuser, F. Legger, R. Medrano Llamas, G. Sciacca, and D. van der Ster, J. Phys. Conf. Ser. 396, (2012).

Project Overview  Use HammerCloud LHCb to…  Test LHCb data storage access  Test new releases of user analysis programs  Report data to Resource Status System

Project Overview  Use HammerCloud LHCb to…  Test LHCb data storage access  Test new releases of user analysis programs  Report data to Resource Status System  Tools  Django/Python (web interface)  Ganga (job submission)  OpenStack/Puppet (virtual machines, system management)

Levels of HammerCloud: Front EndBack EndGrid Tests

Front End  User interface shows list of current and past tests and offers management tools

Front End  User interface shows list of current and past tests and offers management tools  Data visualizations categorize errors and the sites they affect (right)

Back End  The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data)

Back End  The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data)  The backend produces data visualizations, e.g. jobs by status: complete, running, schedule, or failed (right)

Back End  The test manager interfaces between Ganga (to submit grid jobs) and Django (to display data)  The backend produces data visualizations, e.g. jobs by status: complete, running, schedule, or failed (right)  HammerCloud sites automatically update to match the WLCG topology  Reports data via a REST API to DIRAC Resource Status System

Grid Tests (Getting Results)  Detecting and classifying data access failure is the key purpose of HammerCloud

Grid Tests (Getting Results)  Detecting and classifying data access failure is the key purpose of HammerCloud  Grid metrics like Time to Start (right) give an indication of site load

Grid Tests (Getting Results)  Detecting and classifying data access failure is the key purpose of HammerCloud  Grid metrics like Time to Start (right) give an indication of site load  Analyzing logs to determine reasons for failure / failover

Future Work  New testing architecture: the LHCb “mesh”  More useful data visualizations and metrics  Provide grid site status information to RSS (Resource Status System) via REST API  Long-term plan: Testing as a Service 4 4 R. M. Llamas, et. al., J. Phys. Conf. Ser. 513, (2014).

At CERN, I…  Experienced global-scale computing  Learned the inner workings of the Grid  Improved understanding of Django framework  Engaged in a variety of cultural activities & scientific studies  Refined my career interests  Had an amazing summer!

Thank you for your time.