A Statistical Analysis of Job Performance on LCG Grid David Colling, Olivier van der Aa, Mona Aggarwal, Gidon Moont (Imperial College, London)

Slides:



Advertisements
Similar presentations
London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,
Advertisements

Your university or experiment logo here What is it? What is it for? The Grid.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
Experiment Support Introduction to HammerCloud for The LHCb Experiment Dan van der Ster CERN IT Experiment Support 3 June 2010.
A tool to enable CMS Distributed Analysis
JetWeb on the Grid Ben Waugh (UCL), GridPP6, What is JetWeb? How can JetWeb use the Grid? Progress report The Future Conclusions.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
The ILC And the Grid Andreas Gellrich DESY LCWS2007 DESY, Hamburg, Germany
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Automated Grid Monitoring for LHCb Experiment through HammerCloud Bradley Dice Valentina Mancinelli.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.
Last update 31/01/ :41 LCG 1 Maria Dimou Procedures for introducing new Virtual Organisations to EGEE NA4 Open Meeting Catania.
Julia Andreeva on behalf of the MND section MND review.
The GridPP DIRAC project DIRAC for non-LHC communities.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
Analysis of job submissions through the EGEE Grid Overview The Grid as an environment for large scale job execution is now moving beyond the prototyping.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
EGEE is a project funded by the European Union under contract IST GENIUS and GILDA Guy Warner NeSC Training Team Induction to Grid Computing.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Using HLRmon for advanced visualization of resource usage Enrico Fattibene INFN - CNAF ISCG 2010 – Taipei March 11 th, 2010.
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
First collisions in LHC
Daniele Bonacorsi Andrea Sciabà
On behalf of D. Colling, G. Moont, M. Aggarwal
The EDG Testbed Deployment Details
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Cluster Optimisation using Cgroups
U.S. ATLAS Tier 2 Computing Center
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Practical: The Information Systems
Data Challenge with the Grid in ATLAS
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Brief overview on GridICE and Ticketing System
Introduction to Grid Technology
Readiness of ATLAS Computing - A personal view
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
JRA2 Pisa, Tuesday, 25 October 2005
a VO-oriented perspective
Monitoring of the infrastructure from the VO perspective
LHCb Grid Computing LHCb is a particle physics experiment which will study the subtle differences between matter and antimatter. The international collaboration.
LHC Data Analysis using a worldwide computing grid
Introduction to Operating Systems
Presentation transcript:

A Statistical Analysis of Job Performance on LCG Grid David Colling, Olivier van der Aa, Mona Aggarwal, Gidon Moont (Imperial College, London)

Introduction

Introduction We decided to keep the data that we gather and to perform some statistical analysis on it. In this talk I will briefly discuss.. What it can tell us about the different usage of the system by the different VOs What it can tell us about the performance of the individual components and the system as a whole We now produce daily reports (available from the website) In general I will just describe what we see rather than trying to interpret it. That is the next step. This is still very much work in progress

Caveats This view of the LCG is that of the RBs (well actually the LBs) We don’t see any jobs that are submitted by local users We don’t see any any grid jobs that are submitted via RBs to which we do not have access (small effect) We do not see grid jobs submitted by directly not using an LCG RB. Specifically we do not see jobs submitted by Rod Walker’s CondorG submission system. These stats are only for the last quarter.

The system The LCG is an operational Grid currently running over 200 sites in 36 countries, offering its users access to nearly 14,000 CPUs and approximately 8PB of storage. Defining meaningful metrics and monitoring the performance of such a system is challenging exercise but important for successful operation. Primary motivation for this research is to analyze LCG performance through a statistical analysis of the lifecycles of all jobs on the grid. In this paper we define metrics that describe typical job lifecycles. The statistical analysis of these metrics enables us to gain insight into the work load management characteristics of the LCG Grid [2]. Finally we will show how those metrics can be used to spot Grid failures by identifying statistical changes over time in the monitored metrics.

Analysis Dataset The dataset is obtained by –the information published by the about 28 Grid Resource Brokers (RBs) across the EGEE grid. –Job lifecycle obtained through RBs log files. –Dataset are taken from Sept 2005 –Jan 2006 – More than 3 million jobs. The performance metrics are measured for main four LHC VO’s: –ALICE –ATLAS –LHCB –CMS Metrics are defined to measure performance and effectiveness from three perspectives: –User –Resource –Grid

So what can see?

Number of Active Users in a system at a given time.

Distribution of Job Run Time(h) for the LHC VO.

Distribution of Job Run Time(h) weighted by Job Run Time (h). (where the CPU hours are used)

Distribution of Job Efficiency for each LHC VO (efficiency=Time spent running successfully/total time in system)

Job Efficiency versus Job Run Time (h).

RB Load Number of Jobs on a given RB.

CE Jobs Distribution Number of Jobs

CE Hours distribution Job Hours

Number of Jobs LCG Efficiency=N Success/N total

Number of Jobs Alice Efficiency=N Success/N total

Number of Jobs Atlas Efficiency=N Success/N total

Number of Jobs CMS Efficiency=N Success/N total

Number of Jobs LHCB Efficiency=N Success/N total

Grid Load Number of Job Hours submitted at a given time

Efficiency CMS and Atlas Efficiency=Total Succ Hours/Total Hours

Efficiency LHCB & Alice Efficiency=Total Succ Hours/Total Hours

Grid Load Number of jobs in the system at a given time.

Grid Load Number of jobs in the system at a given time.

Grid Load Number of jobs in the system at a given time.

RB Load Job scheduling (Match Time) versus load (mean number of jobs/sec during the matching)

RB Load Job scheduling (Match Time) versus load (mean number of jobs/sec during the matching)

RB Load Job scheduling (Match Time) versus load (mean number of jobs/sec during the matching) RB.(gdrb04)

Conclusions We have started to analyse the distribution of jobs submitted to the LCG Distinct usage patterns are beginning to emerge for each VO These uasge patterns have different efficiencies There are many more plots that I could have shown and there is a lot more work to do to try to understand what we see

References [1] GridPP-UK Computing for Particle Physics: [2] Crosby P, Colling D, Waters D, Efficiency of resource brokering in grids for high- energy physics computing, IEEE Transactions on Nuclear Science, 2004, vol: 51, Pages: , ISSN:

Backup slides

Number of Job Hours submitted at a given time Grid Load

Number of Job Hours submitted at a given time Grid Load

Number of Job hours submitted at a given time

Grid Load Number of Job hours submitted at a given time