INFN GRID Workshop Bari, 26th October 2004

Slides:



Advertisements
Similar presentations
Réunion DataGrid France, Lyon, fév CMS test of EDG Testbed Production MC CMS Objectifs Résultats Conclusions et perspectives C. Charlot / LLR-École.
Advertisements

CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
CHEP 2004, 27 September - 1 October 2004, Interlaken1 DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004,
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Your university or experiment logo here LHCb Development Glenn Patrick Raja Nandakumar GridPP18, 20 March 2007.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
29 Sept 2004 CHEP04 A. Fanfani INFN Bologna 1 A. Fanfani Dept. of Physics and INFN, Bologna on behalf of the CMS Collaboration Distributed Computing Grid.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
EGEE is a project funded by the European Commission under contract IST NA4/HEP work F Harris (Oxford/CERN) M.Lamanna(CERN) NA4 Open meeting.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
LHCb D ata P rocessing S oftware J. Blouw, A. Zhelezov Physikalisches Institut, Universitaet Heidelberg DESY Computing Seminar, Nov. 29th, 2010.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Workload Management Workpackage
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
SuperB – INFN-Bari Giacinto DONVITO.
The EDG Testbed Deployment Details
Real Time Fake Analysis at PIC
L’analisi in LHCb Angelo Carbone INFN Bologna
AWS Integration in Distributed Computing
Overview of the Belle II computing
U.S. ATLAS Grid Production Experience
Workload Management System ( WMS )
Summary on PPS-pilot activity on CREAM CE
Work report Xianghu Zhao Nov 11, 2014.
INFN-GRID Workshop Bari, October, 26, 2004
The LHCb Software and Computing NSS/IEEE workshop Ph. Charpentier, CERN B00le.
ALICE Physics Data Challenge 3
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Nicolas Jacq LPC, IN2P3/CNRS, France
MC data production, reconstruction and analysis - lessons from PDC’04
Grid Deployment Board meeting, 8 November 2006, CERN
WLCG Collaboration Workshop;
R. Graciani for LHCb Mumbay, Feb 2006
LHCb Grid Computing LHCb is a particle physics experiment which will study the subtle differences between matter and antimatter. The international collaboration.
LHC Data Analysis using a worldwide computing grid
LHCb status and plans Ph.Charpentier CERN.
Job Application Monitoring (JAM)
The LHCb Computing Data Challenge DC06
Presentation transcript:

INFN GRID Workshop Bari, 26th October 2004 Results of the LHCb DC04 Vincenzo Vagnoni INFN Bologna INFN GRID Workshop Bari, 26th October 2004

Outline Aims of the LHCb DC04 Production model Performance of DC04 Lessons from DC04 Conclusions

Computing Goals Main goal: gather information for the LHCb Computing TDR Robustness test of the LHCb software and production system Test of the LHCb distributed computing model Including distributed analyses Realistic test of analysis environment (needs realistic analyses) Incorporation of the LCG application area software into the LHCb production environment Use of LCG resources (at least 50% of the production capacity) DC04 split in 3 phases Production: MC simulation, digitization and reconstruction Stripping: Event pre-selection with loose cuts to reduce DST data set End user analysis

Physics goals Physics goals HLT studies, consolidating efficiencies Background/Signal studies, consolidate background estimates + background properties Validation of Gauss/Geant 4 and Generators Requires quantitative increase in number of signal and background events compared to DC03 30 106 signal events 15 106 specific background events 125 106 background events (B events with inclusive decays + minimum bias, ratio 1:1.8)

Production model DIRAC and LCG Production has been started using mainly DIRAC, the LHCb distributed computing system: Light implementation with python scripts. Easy to deploy on various platforms. Non-intrusive (no root privileges, no dedicated machines on sites). Easy to configure, maintain and operate. During DC04 production has been moved to LCG. Using LCG services to deploy DIRAC infrastructure. Sending DIRAC agent as a regular LCG job. Turning a WN into a virtual LHCb production site.

DIRAC Services and Resources DIRAC Job Management Service DIRAC CE LCG Resource Broker CE 1 DIRAC Sites Agent CE 2 CE 3 Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor InformationSvc FileCatalogSvc MonitoringSvc BookkeepingSvc BK query webpage FileCatalog browser User interfaces DIRAC services resources DIRAC Storage DiskFile gridftp bbftp rfio

LHCb way to LCG Dynamically Deployed Agents The Workload Management System: Puts all jobs in its task queue; Submits immediately in push mode an agent to all CEs which satisfy initial matchmaking job requirements: This agent does many configuration checks on the WN; Only once these are satisfied pull the real jobs onto the WN. Born originally as a hack, it has shown several benefits: It copes with misconfiguration problems minimizing their effect. When the grid is full and there are no free CE, pull jobs to queues which are progressing better. Jobs are consumed and executed in the order of submission.

LHCb job LCG site DIRAC (non LCG) site Input SandBox: Small bash script (~50 lines). Check environment: Site, hostname, CPU, Memory, Disk Space… Install DIRAC: Download DIRAC tarball (~1 MB). Deploy DIRAC on WN. Execute the job: Request a DIRAC task (LHCb Simulation job) Execute task Check Steps Upload results Retrieval of SandBox Analysis of Retrieved Output SandBox DIRAC (non LCG) site DIRAC deployment (CE). DIRAC JobAgent: Check CE status. Request a DIRAC task. Install LHCb software if needed Submit to Local Batch System the job. Execute task: Check Steps. Upload results DIRAC TransferAgent.

Strategy DIRAC LCG Test sites: Enable site: Each site is tested with special and production-like jobs. Enable site: DIRAC Workload Management System. Always keep jobs in the queues DIRAC Run Local Agent continuously on CE: Via cron jobs Via daemon LCG Submit agent jobs continuously: Via cron job on User Interface PS: LCG is considered as a site from the DIRAC point of view

Data Storage All the output of the reconstruction (DSTs) are sent to CERN (as Tier0) All the intermediate files are not kept DSTs produced at a Tier1 (or a Tier2 associated to a Tier1) are also kept in one of our 5 Tier1s CNAF (Italy) Karlsruhe (Germany) Lyon (France) PIC (Spain) RAL (United Kingdom)

Integrated event yield DIRAC alone LCG in action 1.8 106/day LCG paused Phase 1 Completed 3-5 106/day restarted 186 M Produced Events

Daily performance 5 million/day

Production Share LCG: 4 RB in use: 2 CERN 1 RAL 1 CNAF 20 DIRAC Sites GRID-IT resources used as well 20 DIRAC Sites DIRAC CNAF 5.56% TO 0.72% Roma 0.05% PD 0.10% 43 LCG Sites NA 0.06% MI 0.53% Legnaro 2.08% FE 0.09% CT 0.03% + CA 0.05% LCG CNAF 4.10% BA 0.01%

Production Share (II)

TIER tape storage TIER 0 Nb of Events Size (TB) CERN 185.5M 62 Tier 1 Nb of Events (in 106) Size (TB) CNAF 37.1 12.6 RAL 19.5 6.5 PIC 16.6 5.4 Karlsruhe 12.5 4 Lyon 4.4 1.5

DIRAC – LCG: CPU share ~370 (successful) CPU · Years May: 88%:12% 11% of DC’04 Jun: 78%:22% 25% of DC’04 Jul: 75%:25% 22% of DC’04 Aug: 26%:74% 42% of DC’04

DC04 LCG Performance Missing python, Fail DIRAC installation, Fail Connection DIRAC Servers, Fail Software installation… Error while running Applications (Hardware, System, LHCb Soft….) Error while transferring or registering output data (can be recovered retry). LHCb Accounting: 81k LCG Successful jobs

LHCb DC04 phases 2/3 Phase 2 Phase 3 (Phase 1) Stripping starting the next days Data set reduction needed for an efficient access to data in a user driven random analysis Analysis job that either: executes a physics selection on signal + bkgnd events with loose cuts; selects an event passing L0+L1 trigger on minimum bias events. Need to Run over 65 TB of data distributed over 5 Tier1 Sites (CERN, CNAF, FZK, PIC, Lyon), with “small” CPU requirements. Produced datasets (~1 TB) will be distributed to all Tier1s. Phase 3 End user analysis will follow GANGA tools in preparation (Phase 1) Keep a continuous rate of production activity with programmed mini DC (i.e., few days once a month).

Lessons learnt: LCG Improve OutputSandBox Upload/Retrieval mechanism: Should also be available for Failed and Aborted Jobs. Improve reliability of CE status collection methods. Add intelligence on CE or RB to detect and avoid large number of aborted jobs on start-up: Avoid miss-configured site to become a black-hole. Need to collect LCG-log info and tool to navigate them. Need a way to limit the CPU (and Wall-clock time): LCG Wrapper must issue appropriated signals to the user job to allow graceful termination. Problems with site configurations (LCG config, firewalls, gridFTP servers...)

Conclusions LHCb DC04 Phase 1 is over. The Production target was achieved: 186M events ~50% on LCG Resources (75-80% during the last weeks). LHCb strategy successful: Submitting “empty” DIRAC agents to LCG has proven to be very flexible allowing a success rate above LCG alone. Big room for improvements, both on DIRAC and LCG DIRAC needs to improve in the reliability of the servers: big step already during DC. LCG needs improvement on the single job efficiency: ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. Success due to dedicated support from LCG team and DIRAC Site Managers