ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
1 port BOSS on Wenjing Wu (IHEP-CC)
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
EGEE is a project funded by the European Union under contract IST Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague.
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
ATLAS Distributed Analysis Experiences in STEP'09 Dan van der Ster for the DA stress testing team and ATLAS Distributed Computing WLCG STEP'09 Post-Mortem.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
Data Access for Analysis Jeff Templon PDP Groep, NIKHEF A. Tsaregorodtsev, F. Carminati, D. Liko, R. Trompert GDB Meeting 8 march 2006.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.
EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
Liverpool Experience of MDC 1 MAP (and in our belief any system which attempts to be scaleable to 1000s of nodes) broadcasts the code to all the nodes.
Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
GridKa Cloud T1/T2 at Forschungszentrum Karlsruhe (FZK)
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Monitoring with InfluxDB & Grafana
1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Phase 2 of the Physics Data Challenge ‘04 Peter Hristov For the ALICE DC team Russia-CERN Joint Group on Computing CERN, September 20, 2004.
EGEE is a project funded by the European Union under contract IST The Workload Management System: an example Simone Campana LCG Experiment.
ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
GRID commands lines Original presentation from David Bouvet CC/IN2P3/CNRS.
GDB Meeting CERN 09/11/05 EGEE is a project funded by the European Union under contract IST A new LCG VO for GEANT4 Patricia Méndez Lorenzo.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
Lessons learned administering a larger setup for LHCb
ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.
Real Time Fake Analysis at PIC
U.S. ATLAS Grid Production Experience
INFN GRID Workshop Bari, 26th October 2004
INFN-GRID Workshop Bari, October, 26, 2004
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Nicolas Jacq LPC, IN2P3/CNRS, France
STORM & GPFS on Tier-2 Milan
MC data production, reconstruction and analysis - lessons from PDC’04
ATLAS DC2 & Continuous production
Status and plans for bookkeeping system and production tools
Presentation transcript:

ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005

Pile-up tasks Jobs defined in 3 tasks: –210 dc lumi10.A2_z_mumu.task –307 dc lumi10.A0_top.task –308 dc lumi10.A3_z_tautau.task Input files with min. bias were distributed to selected sites using DQ, 700GB Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal 1 GB RAM per job required

5 sites involved Number of jobs per site golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite

Status JOBSTATUS NJOBS failed 3702 finished 5703 pending 323 running 64 TASKDONE%DONEALL jobs have JOBSTATUS finished and CURRENTSTATE ABORTED - probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04

Why so big differences in the efficiency? PRAGUE: 48% TW: 70% ATTEMPT NJOBS ATTEMPT NJOBS Jobs with Attempt = 1 AllGoodFailedEff % TW Prague Other differences: RB on TW lexor running on UI on TW many signal files stored on SE on TW

Failures Not easy to get cause of failure from proddb –VALIDATIONDIAGNOSTIC quite difficult to parse by script: – t2-wn-36.roma1.infn.it 1 0m2.360s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._ pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn /WMS_t2-wn- 36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe- Ia3xLA/dc simul.A2_z_mumu._01477.pool.root: the server sent an error response: /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._01477.pool.root.1: Invalid argument. EDGFileCatalog: level[Always] Disconnected No log for stageout phase –mw failures: Job RetryCount (0) hit

Some Jobs with many Attempts JOBDEFINITIONID= –Attempt 1: 09-NOV-04 t2-wn- 42.roma1.infn.it 1 0m43.250s Transformation error: Problem report [Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================ Problem report [Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase –... –Attempt 11: 15-DEC-04 goliasx76.farm.particle.cz 1 0m41.460s Transformation error: Problem report [Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================ Problem report [Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase

JOBDEFINITIONID= Attempt 1: t2-wn-37.roma1.infn.it 1 0m2.830s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc simul.A2_z_mumu/dc simul.A2_z_mumu._02629.p ool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected No log for stageout phase Attempt 2: lcg00172.grid.sinica.edu.tw 2 0m23.660s Transforma tion error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase... Attempt 9: goliasx44.farm.particle.cz 2 0m23.340s Transformati on error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase

JOBDEFINITIONID= Attempt 1: t2-wn-48.roma1.infn.it 2 66m58.650s Transformation error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase Attempt 2: lcg00144.grid.sinica.edu.tw 2 66m56.800s Transform ation error: Problem report [SOFTWARE]AthenaCrash================================ No log for stageout phase the same up to attempt 5 Attempt 6: mw failure Attempt 7: goliasx60.farm.particle.cz 0 152m53.780s ???

Jobs properties no exact relation between a job in the oracle db and an entry in the PBS log file STARTTIME and ENDTIME are just hints Some jobs on golias: –1232 finished jobs in December registered in proddb –1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values Nodes: 3.06 GHz Xeon, 2GB RAM Histos based on information from PBS log files

some jobs (6) successfully ran on machine with only 1GB RAM but the wallTime was 20h – probably a lot of swapping

WN -> SE -> NFS server WN has the same NFS mount – could it be used directly?

Conclusions no job name in the local batch system – difficult to identify version of the lexor executor should be in the proddb proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8) a study of log files should be done before increasing MAXATTEMPT proddb should be cleaned