EGEE is a project funded by the European Union under contract IST-2003-508833 Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague.

Slides:



Advertisements
Similar presentations
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Advertisements

Jiri Chudoba for the Pierre Auger Collaboration Institute of Physics of the CAS and CESNET.
Current Monte Carlo calculation activities in ATLAS (ATLAS Data Challenges) Oxana Smirnova LCG/ATLAS, Lund University SWEGRID Seminar (April 9, 2003, Uppsala)
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Applications Area Issues RWL Jones Deployment Team – 2 nd June 2005.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Computing for HEP in the Czech Republic Jiří Chudoba Institute of Physics, AS CR, Prague.
ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005.
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.
1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25,
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Large Simulations using EGEE Grid for the.
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simulations and Offline Data Processing for.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VO auger experience with large scale simulations on the grid Jiří Chudoba.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
EGEE is a project funded by the European Union under contract IST Experiment Software Installation toolkit on LCG-2
Jiri Chudoba for the Pierre Auger Collaboration Institute of Physics of the CAS and CESNET.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Pierre Auger Observatory Jiří Chudoba Institute of Physics and CESNET, Prague.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Feedback to sites from the VO auger Jiří Chudoba (Institute of Physics and.
ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.
U.S. ATLAS Grid Production Experience
INFN-GRID Workshop Bari, October, 26, 2004
INFNGRID Workshop – Bari, Italy, October 2004
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Nicolas Jacq LPC, IN2P3/CNRS, France
ATLAS DC2 ISGC-2005 Taipei 27th April 2005
Zhongliang Ren 12 June 2006 WLCG Tier2 Workshop at CERN
INFNGRID Workshop – Bari, Italy, October 2004
ATLAS DC2 & Continuous production
The LHCb Computing Data Challenge DC06
Presentation transcript:

EGEE is a project funded by the European Union under contract IST Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague EGEE Summer School, Budapest,

EGEE Summer School, Budapest, Contents Single job vs. big production ATLAS Data Challenges Job rates, distributions Problems – global, local Outlook

EGEE Summer School, Budapest, From 1 Job to Big Productions basic job submition on the EGEE grid:  edg-job-submit my.jdl  edg-job-getstatus –i myjobs  edg-job-getoutput –i myjobs ~100 of jobs – still manageable with a couple of shell scripts with loops around basic commands ~1000 of jobs – need some small database to be able to resubmit failed jobs LCG experiments require much more – complicated systems around basic commands ATLAS Requirements for 2008 CPU: 51 MSI2K ( current processors) Storage: disk: TB tape: TB

EGEE Summer School, Budapest, Example: ATLAS Data Processing ATLAS (A Toroidal LHC Apparatus) experiment at the Large Hadron Collider at CERN will start taking data in proton-proton collisions with a 14 TeV center-of-mass energy with rate 10 9 Hz, storage rate 200 Hz Total amount of “raw” data:  1 PB/year Analysis Object Data target size 100 kB per event Each collaborator must have transparent access to data ~2000 Collaborators ~150 Institutes 34 Countries

EGEE Summer School, Budapest, ATLAS Productions Large scale tests of readiness for many components ATLAS Data Challenge 1 (2002) – production sites independent, no grid tools  drawbacks: long delays for some jobs, although some sites akready idle at least 1 ATLAS person per site ATLAS DC2 (2004) – first big test of the ATLAS production system  upgrade of LCG mw during production, several versions of ATLAS sw ATLAS Rome production – simulations for the ATLAS Physics Workshop in Rome (June 2005)

EGEE Summer School, Budapest, ATLAS Production System ATLAS uses 3 grids:  LCG (=EGEE)  Nordugrid (evolved from EDG)  GRID3 (US) plus possibility for local batch submition  4 interfaces Input and output data must be accessible from all grids jobs vary in requirements on I/O, CPU time and RAM ATLAS developed custom system from several components

EGEE Summer School, Budapest, ATLAS Production System Overview LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea

EGEE Summer School, Budapest, ATLAS Production Rates

EGEE Summer School, Budapest, DC2 on LCG

EGEE Summer School, Budapest, Production Rates July – September 2004 DC2 GEANT4 simulation (long jobs) LGC/EGEE : GRID3 : NorduGrid = 40 : 30 : 30 October – December 2004 DC2 digitization and reconstruction (short jobs) February – May 2005 Rome production (mix of jobs) LGC/EGEE : GRID3 : NorduGrid = 65 : 24 : 11  CondorG improved efficiency of the LCG sites usage

EGEE Summer School, Budapest, ATLAS Rome Production: countries (sites) Austria (1) Canada (3) CERN (1) Czech Republic (2) Denmark (3) France (4) Germany (1+2) Greece (1) Hungary (1) Italy (17) Netherlands (2) Norway (2) Poland (1) Portugal (1) Russia (2) Slovakia (1) Slovenia (1) Spain (3) Sweden (5) Switzerland (1+1) Taiwan (1) UK (8) USA (19) 17 countries; 51 sites 7 countries; 14 sites 22 countries 84 sites

EGEE Summer School, Budapest, Rome Production Statistics 73 data sets containing 6.1M events simulated and reconstructed (without pile-up) Total simulated data: 8.5M events Pile-up done later (for 1.3M events done, 50K reconstructed)

EGEE Summer School, Budapest, Rome Production: Number of Jobs As of 17 June 2005

EGEE Summer School, Budapest,

EGEE Summer School, Budapest,

EGEE Summer School, Budapest, Critical Servers RB, BDII, RLS, SE, UI, DQ, MyProxy, DB servers just 1 combo machine with RB, UI, BDII, DQ in the beginning of DC2 quickly evolved in a complex system of many services running on many machines  RB – several machines, 1 RB at CERN with reliable disk array, can change if 1 machine has problem (but some jobs may be lost)  UI – 1 or 2 machines per submitter, up to 1000 jobs handled by 1 instance of lexor. Big memory requirements  BDII – 2 machines behind DNS alias  RLS – single point of failure, cannot be replaced by other machine, problems in the beginning of May stopped production and data replication for several days. Missing features implemented in new catalogues.

EGEE Summer School, Budapest, Critical servers (cont) DQ server for data management production DB – Oracle server shared with other clients, service guaranteed by CERN db group other DB’s required by ATLAS sw: geometryDB, conditionsDB  MySQL servers  Hard limit of a 1000 connections to a server was hit during Rome production. Replica servers were quickly introduced and code was change to select between them. SE – problems if input data are on a SE which is down

EGEE Summer School, Budapest, Jobs validation output files are marked by suffix with attempt number only when job is validated (by parsing its log file) output files are renamed and job marked as validated renaming of physical files would be difficult (may be on a tape!)  only entry in the catalogue is changed  pollution of disk and tape space with files from failed attempts, no systematic clean up done

EGEE Summer School, Budapest, Monitoring Production overview:  via proddb – ATLAS specific Grid monitors:  GOC monitor:  Site Functional Tests  BDII monitors (several)

EGEE Summer School, Budapest, Monitoring (con’t)  GRIDICE ATLAS VO view: ATLAS production view bin/rome_jm.cgihttp://atlfarm003.mi.infn.it/~negri/cgi- bin/rome_jm.cgi

EGEE Summer School, Budapest, Black Holes Wrongly configured site can attract many jobs, since it process them in very short time Protection:  ATLAS sw not found – job sends a mail and sleeps for 4 hours often caused by nfs server problems  Automatic site exclusion from the BDII if it does not pass SFT SFT run once a day – too long delay for big production sites Sometimes error caused by external site possibility to include/exclude sites by hand (2 persons in different time zones to cover almost 24 hours/day) since spring 2005 possibility to select which tests are critical –VO dependent selection!  statistics from ATLAS prodDB

EGEE Summer School, Budapest, Monitor of Blocked CPUs Started by doing qstat on a local farm Later extended to more sites using globus-job-run Only some sites scanned, LSF not supported

EGEE Summer School, Budapest, Jobs distribution Some sites had many jobs in queues, some had free resources Different ranking expression tried based on ERT, number of waiting jobs,... Local sites of submitters were better used Problems if site publishes incorrect info (local MDS or BDII stuck or died) Missing information per VO, new Glue scheme should help CondorG bypassed RB – better distribution LHCb approach: submit many jobs as placeholders, actual content of a job defined only when the job starts

EGEE Summer School, Budapest, Local problems (experience from 2 CZ sites) Load on the local SE, when many jobs start at once  no crash, but enormous increase of clock time per job  dCache installed on a new machine, under tests now Stability of the NFS server  solved by several upgrades of kernel Disk array crash  backplane problem lead to data loss  RLS entries for lost files removed HyperThreading  not used on nodes for ATLAS production  some increase of performance in simulation tests, but introduces more dependence on other jobs running on the same node

EGEE Summer School, Budapest, Local problems (cont) Job distribution based on the default expression for ERT  no new jobs for bigger site if some jobs were already running although some CPU still free  OK after change in evaluation of ERT Remote SE overloaded or down  input data on SE with problems – lcg-cp command blocks –timeout introduced later in lexor  predefined SE for output not available – lcg-cr blocks Misbehaved jobs had infinite loop with output to a log file, took all available space on a local disk  crashed the other job on the WN too  $EDG_JOB_SCRATCH definition was missing on a reinstalled WN  job filled shared /home partition

EGEE Summer School, Budapest, Special Jobs (DC2) Pileup jobs combine signal event with several background events Required > 1 GB RAM per job, 700 GB of input data with background events Data were copied to selected sites in advance using DQ Special “InputHint” for these jobs Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal Selected Sites allowed atlas jobs only on machines with enough RAM

EGEE Summer School, Budapest, Conclusions ATLAS DC2 and Rome production were done on 3 grids, LCG/EGEE had the biggest share Rate of several thousands jobs/day achieved MW is not yet mature enough, many problems met, but all were solved Productions require a lot of manpower  mostly covered by ATLAS  good support from EIS team  some services managed by other CERN groups (DB, RLS, BDII, CASTOR,...)

EGEE Summer School, Budapest, Pileup jobs

EGEE Summer School, Budapest, Outlook Only large scale tests as DC can find problems and bottlenecks in the system – must continue Service Challenges  several phases, start with tests of basic services and add more  phase 3 during 2 nd half of 2005 will include LHC experiments gLite components should solve some problems (WMS, new data catalogues) ATLAS DC3 (Computing System Commissioning) in 2006 LHC experiment data taking starts already 2007 – we must have reliable well tested system then

EGEE Summer School, Budapest, Thanks to my colleagues from ATLAS production team for a good cooperation leading to such good results special thanks to Gilbert Poulard for plots organizers of this EGEE school