1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

Slides:

Advertisements

Similar presentations

Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.

Advertisements

ATLAS/LHCb GANGA DEVELOPMENT Introduction Requirements Architecture and design Interfacing to the Grid Ganga prototyping A. Soroko (Oxford), K. Harrison.

Réunion DataGrid France, Lyon, fév CMS test of EDG Testbed Production MC CMS Objectifs Résultats Conclusions et perspectives C. Charlot / LLR-École.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.

GRID DATA MANAGEMENT PILOT (GDMP) Asad Samar (Caltech) ACAT 2000, Fermilab October , 2000.

Talend 5.4 Architecture Adam Pemble Talend Professional Services.

Bookkeeping data Monitoring info Get jobs Site A Site B Site C Site D Agent Production service Monitoring service Bookkeeping service Agent © Andrei Tsaregorodtsev.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

SUN HPC Consortium, Heidelberg 2004 Grid(Lab) Resource Management System (GRMS) and GridLab Services Krzysztof Kurowski Poznan Supercomputing and Networking.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.

5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)

LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

3 Sept 2001F HARRIS CHEP, Beijing 1 Moving the LHCb Monte Carlo production system to the GRID D.Galli,U.Marconi,V.Vagnoni INFN Bologna N Brook Bristol.

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.

Nick Brook Current status Future Collaboration Plans Future UK plans.

DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.

LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.

- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.

Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

LHCb Software Week November 2003 Gennady Kuznetsov Production Manager Tools (New Architecture)

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.

Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

LHCb-ATLAS GANGA Workshop, 21 April 2004, CERN 1 DIRAC Software distribution A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

LHCb Data Challenge in 2002 A.Tsaregorodtsev, CPPM, Marseille DataGRID France meeting, Lyon, 18 April 2002.

DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.

1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.

1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.

CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.

1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.

Daniele Bonacorsi Andrea Sciabà

(on behalf of the POOL team)

Overview of the Belle II computing

Moving the LHCb Monte Carlo production system to the GRID

Grid Deployment Board meeting, 8 November 2006, CERN

Status of CVS repository Production databases Production tools

Status and plans for bookkeeping system and production tools

Production Manager Tools (New Architecture)

Production client status

Presentation transcript:

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003

2 Outline  Introduction  DIRAC architecture  Implementation details  Deploying DIRAC on the DataGRID  Conclusions

3 What is it all about ?  Distributed MC production system for LHCb  Production tasks definition and steering;  Software installation on production sites;  Job scheduling and monitoring;  Data transfers and bookkeeping.  Automates most of the production tasks  Minimum participation of local production managers  PULL rather than PUSH concept for jobs scheduling DIRAC – Distributed Infrastructure with Remote Agent Control

4 Bookkeeping data Monitoring info Get jobs Site A Site B Site C Site D SW agent Production service Monitoring service Bookkeeping service SW agent DIRAC architecture

5 Advantages of the PULL approach  Better use of resources  no idle or forgotten CPU power;  natural load balancing – more powerful center gets more work automatically.  Less burden on the central production service  deals only with production tasks definition and bookkeeping;  do not bother about particular production sites.  No direct access to local disks from central service  Easy introduction of new sites into the production system  no information on local sites necessary at the central site.

6 Job description Gauss - v5 GenTag v7 Gauss - v5 Brunel - v12 Gauss - v5 Brunel - v12 Pythia – v2 Workflow description + - Event type - Application options - Number of events - Execution mode - Destination site … Production run description XML job descriptions Production manager Production DB Web based editors

7 Agent operations Production agent batch system Production service isQueueAvalable() requestJob(queue) SW distribution service installPackage() Monitoring service submitJob(queue) Bookkeeping service setJobStatus(step 1) setJobStatus(step 2) setJobStatus(step n) … sendBookkeeping() Mass Storage sendFileToCastor() addReplica() Running job

8 Implementation details  Central web services  XML-RPC servers ;  Web based editing and visualization ;  ORACLE production and bookkeeping databases.  Agent - a set of collaborating python classes  Python to be sure it is compatible with all the sites ; standard python library XML-RPC client ;  The agent is running as a daemon process or as a cron job on a production site.  Easily extendable via plugins: for new applications ; for new tools, e.g. file transport.  Data and log files transfer using bbftp ;

9 Agent customization at a production site  Easy setting up of a production site is crucial to absorb all available resources ;  One Python script where all the local configuration is defined :  Interface to the local batch system;  Interface to the local mass storage system;  Agent distribution comes with examples of typical cases  “Standard” site can be configured in few minutes e.g., PBS + disk mass storage.

10 Dealing with failures  Job is rescheduled in case of a local system failure to run it  Other sites can then pick it up.  Journaling  all the sensitive files (logs, bookkeeping, job descriptions) are kept at the production site caches.  Job can be restarted from where it failed  Accomplished steps are not redone.  File transfers are automatically retried after a predefined pause in case of failures.

11 Working experience  DIRAC production system was deployed on 17 LHCb production sites :  2 hours to 2 days of work for customization.  Smooth running for MC production tasks ;  Much less burden for local production managers :  automatic data upload to CERN/Castor ;  log files automatically available through a Web page ;  automatic recoveries from common failures (job submission, data transfers) ;  The current Data Challenge production using DIRAC advances ahead of schedule  ~1000 CPU’s in total used;  1M events produced per day.

12 Resource Broker WN DataGRID Replica catalog DIRAC on the DataGRID Production service Monitoring service Bookkeeping service Castor DataGRID portal job.xml JDL Replica manager CERN SE

13 Deploying agents on the DataGRID INPUT:  JDL InputSandbox contains:  job XML description;  agent launcher script: OUTPUT:  Use EDG replica_manager for data transfer to CERN SE/Castor ;  Log files are passed back via OutputSandbox. > wget ‘ > dmsetup --local DataGRID > shoot_agent job.xml > wget ‘ > dmsetup --local DataGRID > shoot_agent job.xml

14 Tests on the DataGRID testbed  Standard LHCb production jobs were used for the tests :  Jobs of different statistics with 8 steps workflow.  Jobs submitted to 4 EDG testbed Resource Brokers :  keeping ~50 jobs per broker ;  Software installed for each job ; Job type (hours)TotalSuccessSuccess rate Mini (0.2) % Short (6) % Medium (24) % Total % Total of ~300K events produced so far. This makes EDG testbed already a competitive LHCb production site.

15 Main problems  EDG middleware instability problems :  MDS information system failures – “no matching resources found”;  RB fails to get input files because of gridftp failures;  Jobs stuck in some unfinished state: “Done”,”Resubmitted”,etc  Long jobs suffering from site misconfiguration:  RB fails to find appropriate resources;  Jobs hit the limits of the local batch system;  “Estimated Traversal Time” failure as ranking criteria;  Software installation failures:  Disk quotas;  Forbidden outbound IP connections on WN’s on some sites.

16 Some lessons learnt  Needed an API for the software installation  For experiments to install software: independently from site managers; on per job basis if necessary.  For site managers to be sure the software is installed in an organized way.  Outbound IP connectivity should be available  Needed for the software installation;  Needed for jobs exchanging messages with production services.  Uniform site descriptions:  EDG uniform CPU unit ?

17 Conclusions  The DIRAC production system is routinely running in production now at ~17 sites ;  The PULL paradigm for jobs scheduling proved to be very successful ;  It is of great help for local production managers and a key for the success of the LHCb Data Challenge 2003 ;  The DataGRID testbed is integrated in the DIRAC production system, extensive tests are in progress.