1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003
2 Outline Introduction DIRAC architecture Implementation details Deploying DIRAC on the DataGRID Conclusions
3 What is it all about ? Distributed MC production system for LHCb Production tasks definition and steering; Software installation on production sites; Job scheduling and monitoring; Data transfers and bookkeeping. Automates most of the production tasks Minimum participation of local production managers PULL rather than PUSH concept for jobs scheduling DIRAC – Distributed Infrastructure with Remote Agent Control
4 Bookkeeping data Monitoring info Get jobs Site A Site B Site C Site D SW agent Production service Monitoring service Bookkeeping service SW agent DIRAC architecture
5 Advantages of the PULL approach Better use of resources no idle or forgotten CPU power; natural load balancing – more powerful center gets more work automatically. Less burden on the central production service deals only with production tasks definition and bookkeeping; do not bother about particular production sites. No direct access to local disks from central service Easy introduction of new sites into the production system no information on local sites necessary at the central site.
6 Job description Gauss - v5 GenTag v7 Gauss - v5 Brunel - v12 Gauss - v5 Brunel - v12 Pythia – v2 Workflow description + - Event type - Application options - Number of events - Execution mode - Destination site … Production run description XML job descriptions Production manager Production DB Web based editors
7 Agent operations Production agent batch system Production service isQueueAvalable() requestJob(queue) SW distribution service installPackage() Monitoring service submitJob(queue) Bookkeeping service setJobStatus(step 1) setJobStatus(step 2) setJobStatus(step n) … sendBookkeeping() Mass Storage sendFileToCastor() addReplica() Running job
8 Implementation details Central web services XML-RPC servers ; Web based editing and visualization ; ORACLE production and bookkeeping databases. Agent - a set of collaborating python classes Python to be sure it is compatible with all the sites ; standard python library XML-RPC client ; The agent is running as a daemon process or as a cron job on a production site. Easily extendable via plugins: for new applications ; for new tools, e.g. file transport. Data and log files transfer using bbftp ;
9 Agent customization at a production site Easy setting up of a production site is crucial to absorb all available resources ; One Python script where all the local configuration is defined : Interface to the local batch system; Interface to the local mass storage system; Agent distribution comes with examples of typical cases “Standard” site can be configured in few minutes e.g., PBS + disk mass storage.
10 Dealing with failures Job is rescheduled in case of a local system failure to run it Other sites can then pick it up. Journaling all the sensitive files (logs, bookkeeping, job descriptions) are kept at the production site caches. Job can be restarted from where it failed Accomplished steps are not redone. File transfers are automatically retried after a predefined pause in case of failures.
11 Working experience DIRAC production system was deployed on 17 LHCb production sites : 2 hours to 2 days of work for customization. Smooth running for MC production tasks ; Much less burden for local production managers : automatic data upload to CERN/Castor ; log files automatically available through a Web page ; automatic recoveries from common failures (job submission, data transfers) ; The current Data Challenge production using DIRAC advances ahead of schedule ~1000 CPU’s in total used; 1M events produced per day.
12 Resource Broker WN DataGRID Replica catalog DIRAC on the DataGRID Production service Monitoring service Bookkeeping service Castor DataGRID portal job.xml JDL Replica manager CERN SE
13 Deploying agents on the DataGRID INPUT: JDL InputSandbox contains: job XML description; agent launcher script: OUTPUT: Use EDG replica_manager for data transfer to CERN SE/Castor ; Log files are passed back via OutputSandbox. > wget ‘ > dmsetup --local DataGRID > shoot_agent job.xml > wget ‘ > dmsetup --local DataGRID > shoot_agent job.xml
14 Tests on the DataGRID testbed Standard LHCb production jobs were used for the tests : Jobs of different statistics with 8 steps workflow. Jobs submitted to 4 EDG testbed Resource Brokers : keeping ~50 jobs per broker ; Software installed for each job ; Job type (hours)TotalSuccessSuccess rate Mini (0.2) % Short (6) % Medium (24) % Total % Total of ~300K events produced so far. This makes EDG testbed already a competitive LHCb production site.
15 Main problems EDG middleware instability problems : MDS information system failures – “no matching resources found”; RB fails to get input files because of gridftp failures; Jobs stuck in some unfinished state: “Done”,”Resubmitted”,etc Long jobs suffering from site misconfiguration: RB fails to find appropriate resources; Jobs hit the limits of the local batch system; “Estimated Traversal Time” failure as ranking criteria; Software installation failures: Disk quotas; Forbidden outbound IP connections on WN’s on some sites.
16 Some lessons learnt Needed an API for the software installation For experiments to install software: independently from site managers; on per job basis if necessary. For site managers to be sure the software is installed in an organized way. Outbound IP connectivity should be available Needed for the software installation; Needed for jobs exchanging messages with production services. Uniform site descriptions: EDG uniform CPU unit ?
17 Conclusions The DIRAC production system is routinely running in production now at ~17 sites ; The PULL paradigm for jobs scheduling proved to be very successful ; It is of great help for local production managers and a key for the success of the LHCb Data Challenge 2003 ; The DataGRID testbed is integrated in the DIRAC production system, extensive tests are in progress.