Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Software Quality Assurance Plan
Computing Lectures Introduction to Ganga 1 Ganga: Introduction Object Orientated Interactive Job Submission System –Written in python –Based on the concept.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
A tool to enable CMS Distributed Analysis
MSF Testing Introduction Functional Testing Performance Testing.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Chapter 6 Control Using Wireless Throttling Valves.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
Java Portals and Portlets Submitted By: Rashi Chopra CIS 764 Fall 2007 Rashi Chopra.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Overview of the Automated Build & Deployment Process Johnita Beasley Tuesday, April 29, 2008.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 – The Ganga Evolution Andrew Maier.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
Maintaining and Updating Windows Server 2008 Lesson 8.
17 Copyright © 2006, Oracle. All rights reserved. Information Publisher.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
OFFICE OF FINANCIAL MANAGEMENT 0 Office of Financial Management Office of Financial Management TALS Draft Conceptual Solution February 24, 2004.
Daniele Bonacorsi Andrea Sciabà
IST 220 – Intro to Databases
Monitoring of the infrastructure from the VO perspective
D. van der Ster, CERN IT-ES J. Elmsheuser, LMU Munich
Presentation transcript:

Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010

Experiment Support Outline Introduction to HammerCloud –Motivation, History, Use-Cases How HammerCloud works –Design and Implementation Details Interface Tour for Users and Admins Possibilities for an LHCb Plugin HammerCloud Introduction for LHCb – 2

Experiment Support Introduction to HammerCloud HammerCloud (HC) is a Distributed Analysis testing system serving two use-cases: –Robot-like Functional Testing: frequent “ping” jobs to all sites to perform basic site validation –DA Stress Testing: on-demand large-scale stress tests using real analysis jobs to test one or many sites simultaneously to: Help commission new sites Evaluate changes to site infrastructure Evaluate SW changes Compare site performances… HammerCloud Introduction for LHCb – 3

Experiment Support HammerCloud and Job Robots HammerCloud is part of an evolution of job robots: –CMS Job Robot inspired the ATLAS GangaRobot (functional testing) –In ~Sept 2008, a form of the ATLAS GangaRobot was used to manually stress test the Italian ATLAS Tier2’s: 5 users manually submitting hundreds of instrumented jobs simultaneously (SIMD) Manual results collection and summarization Early results were shown to be very useful: –One early test showed a bimodal performance plot that was later traced to a faulty network switch which negatively affected the performance of some WNs. The need for an automated DA stress testing system was clear. –HammerCloud was born in November 2008 to deliver on-demand stress tests to ATLAS sites: Since then HC has run >1300 “Tests” using more than 4 million jobs. ATLAS has invested >200k CPU-days in HC tests –CMS has also agreed to use HC: in April a prototype was delivered, and now scale tests are about to begin. HammerCloud Introduction for LHCb – 4

Experiment Support HC and ATLAS during STEP’09 HammerCloud Introduction for LHCb – 5 STEP’09

Experiment Support HammerCloud Use-Cases Provides On-Demand and Automated Testing HC Operators define test templates: FUNCTIONAL and STRESS Functional Tests are automatically scheduled –Results are published on the HC website and can be pushed to other systems (e.g. SAM) Stress tests are generally scheduled on demand as needed by: –Central VO managers –Cloud/Regional managers –Site managers For all tests, a detailed report summarizing the job success rates and performances is produced. HammerCloud Introduction for LHCb – 6

Experiment Support HammerCloud Components The HC UI is implemented as a Django web app: –View test results –View cloud/site evolution –DB Admin State is maintained in a MySQL DB HC Logic (job submission, monitoring, resubmission) implemented on top of the Ganga Grid Programming Interface (GPI) HammerCloud Introduction for LHCb – 7

Experiment Support HammerCloud Logic An HC Test is described by: –The analysis code to run (typically a real analysis from the user community) –The dataset pattern (which can be resolved to a set of datasets appropriate for the analysis code) –The list of sites to be tested, and the target number of jobs to run concurrently per site –A start time and an end time Test execution proceeds in 4 steps: –Generate: Test description is converted to a set of submittable jobs (e.g. Ganga job objects, one for each site under test) –Submit: the job objects are submitted –Run: jobs are monitored, outputs recorded to the HC DB, jobs are resubmitted to achieve the target number of running jobs per site –Exit: at the test end time, leftover jobs are killed Concurrently, the HC Web shows real time test results HammerCloud Introduction for LHCb – 8

Experiment Support An HC-LHCb Plugin What customizations would be needed for an HC-LHCb plugin? HC is built upon Ganga and exploits its job management features: –job repository, job configuration via python, job submission, job monitoring in background thread(s) Given the existing GangaLHCb plugins, modifications to HC itself would be relatively minor, e.g. –HC Test Generation: Query a data discovery service to form a job processing random input data –HC Test Running: Changes to extract LHCb-specific job metrics from Ganga HammerCloud Introduction for LHCb – 9

Experiment Support Interface Tour 1. The Public User Interface HammerCloud Introduction for LHCb – 10

Experiment Support HC Home The HC Homepage lists the running and scheduled tests. HammerCloud Introduction for LHCb – 11

Experiment Support Viewing a Test The test overview gives a quick summary of: Overall job efficiency, CPU/Walltime, Events/WrapperTime Also shows a summary of the jobs running at each site involved in the test. HammerCloud Introduction for LHCb – 12

Experiment Support Viewing a Test: Summary Stats The Test Overview page also gives summary statistics by site Here you can see some example metrics (for CMS) HammerCloud Introduction for LHCb – 13

Experiment Support Viewing a Test: Per-Site Plots View plots of the recorded metrics for each site HammerCloud Introduction for LHCb – 14

Experiment Support Viewing a Test: Metric Comparisons View the plots for all sites for a specific metric Used to compare site-by-site HammerCloud Introduction for LHCb – 15

Experiment Support Modify a Running Test Authorized users can modify the parameters of a test at run time –E.g. change the end time, or number of running jobs per site HammerCloud Introduction for LHCb – 16

Experiment Support Clone a Previous Test Cloning a previous test is simple –Useful to repeat the test or to run an identical test at a different set of sites HammerCloud Introduction for LHCb – 17

Experiment Support Overall HC Plots Historical plots show previous test statistics Currently shows # running jobs per site. Plots showing the evolution of the performance metrics are in development. HammerCloud Introduction for LHCb – 18

Experiment Support HC Robot View The “Robot” view is used to show the success rates of functional test jobs over the past 24 hrs. (Similar to SSB) Clicking a site takes you to the list of Robot jobs executed at that site HammerCloud Introduction for LHCb – 19

Experiment Support Interface Tour 2. Admin Interface HammerCloud Introduction for LHCb – 20

Experiment Support HC Admin: Operator and User Views HC Operators have access to admin all tables in the HC DB via a web interface HC Users have more limited access HammerCloud Introduction for LHCb – 21

Experiment Support HC Admin: Tests and Templates Above: List all Test Templates Below: List all Tests HammerCloud Introduction for LHCb – 22

Experiment Support HC Admin: Edit a Test Template Test templates are defined via the Admin UI All of the parameters of a test are here, plus: –An active flag indicating that a template should be auto- scheduled –A default lifetime: auto- scheduled test instances of this template will run for this time period Normally, functional test templates include the list of sites to be tested, whereas stress test templates do not include a list of sites. HammerCloud Introduction for LHCb – 23

Experiment Support HC Admin: Adding a new Test Adding a new test on-demand is simple. Select the test template of interest, a start time, and an end time. If needed, Tests can be further customized after the template is copied over. HammerCloud Introduction for LHCb – 24

Experiment Support Summary HammerCloud is a DA functional and stress testing system used widely by ATLAS and coming soon for CMS Two basic use-cases: –Continuous stream of test jobs to measure site availability –Enable central managers to define standardized (stress) tests, and empower site managers to invoke those tests on-demand. An HC-LHCb plugin would leverage the existing GangaLHCb work –A prototype plugin would not take significant effort HammerCloud Introduction for LHCb – 25