Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFN GRID Workshop Bari, 26th October 2004

Similar presentations


Presentation on theme: "INFN GRID Workshop Bari, 26th October 2004"— Presentation transcript:

1 INFN GRID Workshop Bari, 26th October 2004
Results of the LHCb DC04 Vincenzo Vagnoni INFN Bologna INFN GRID Workshop Bari, 26th October 2004

2 Outline Aims of the LHCb DC04 Production model Performance of DC04
Lessons from DC04 Conclusions

3 Computing Goals Main goal: gather information for the LHCb Computing TDR Robustness test of the LHCb software and production system Test of the LHCb distributed computing model Including distributed analyses Realistic test of analysis environment (needs realistic analyses) Incorporation of the LCG application area software into the LHCb production environment Use of LCG resources (at least 50% of the production capacity) DC04 split in 3 phases Production: MC simulation, digitization and reconstruction Stripping: Event pre-selection with loose cuts to reduce DST data set End user analysis

4 Physics goals Physics goals
HLT studies, consolidating efficiencies Background/Signal studies, consolidate background estimates + background properties Validation of Gauss/Geant 4 and Generators Requires quantitative increase in number of signal and background events compared to DC03 signal events specific background events background events (B events with inclusive decays + minimum bias, ratio 1:1.8)

5 Production model DIRAC and LCG
Production has been started using mainly DIRAC, the LHCb distributed computing system: Light implementation with python scripts. Easy to deploy on various platforms. Non-intrusive (no root privileges, no dedicated machines on sites). Easy to configure, maintain and operate. During DC04 production has been moved to LCG. Using LCG services to deploy DIRAC infrastructure. Sending DIRAC agent as a regular LCG job. Turning a WN into a virtual LHCb production site.

6 DIRAC Services and Resources
DIRAC Job Management Service DIRAC CE LCG Resource Broker CE 1 DIRAC Sites Agent CE 2 CE 3 Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor InformationSvc FileCatalogSvc MonitoringSvc BookkeepingSvc BK query webpage FileCatalog browser User interfaces DIRAC services resources DIRAC Storage DiskFile gridftp bbftp rfio

7 LHCb way to LCG Dynamically Deployed Agents
The Workload Management System: Puts all jobs in its task queue; Submits immediately in push mode an agent to all CEs which satisfy initial matchmaking job requirements: This agent does many configuration checks on the WN; Only once these are satisfied pull the real jobs onto the WN. Born originally as a hack, it has shown several benefits: It copes with misconfiguration problems minimizing their effect. When the grid is full and there are no free CE, pull jobs to queues which are progressing better. Jobs are consumed and executed in the order of submission.

8 LHCb job LCG site DIRAC (non LCG) site Input SandBox:
Small bash script (~50 lines). Check environment: Site, hostname, CPU, Memory, Disk Space… Install DIRAC: Download DIRAC tarball (~1 MB). Deploy DIRAC on WN. Execute the job: Request a DIRAC task (LHCb Simulation job) Execute task Check Steps Upload results Retrieval of SandBox Analysis of Retrieved Output SandBox DIRAC (non LCG) site DIRAC deployment (CE). DIRAC JobAgent: Check CE status. Request a DIRAC task. Install LHCb software if needed Submit to Local Batch System the job. Execute task: Check Steps. Upload results DIRAC TransferAgent.

9 Strategy DIRAC LCG Test sites: Enable site:
Each site is tested with special and production-like jobs. Enable site: DIRAC Workload Management System. Always keep jobs in the queues DIRAC Run Local Agent continuously on CE: Via cron jobs Via daemon LCG Submit agent jobs continuously: Via cron job on User Interface PS: LCG is considered as a site from the DIRAC point of view

10 Data Storage All the output of the reconstruction (DSTs) are sent to CERN (as Tier0) All the intermediate files are not kept DSTs produced at a Tier1 (or a Tier2 associated to a Tier1) are also kept in one of our 5 Tier1s CNAF (Italy) Karlsruhe (Germany) Lyon (France) PIC (Spain) RAL (United Kingdom)

11 Integrated event yield
DIRAC alone LCG in action /day LCG paused Phase 1 Completed /day restarted 186 M Produced Events

12 Daily performance 5 million/day

13 Production Share LCG: 4 RB in use: 2 CERN 1 RAL 1 CNAF 20 DIRAC Sites
GRID-IT resources used as well 20 DIRAC Sites DIRAC CNAF 5.56% TO 0.72% Roma 0.05% PD 0.10% 43 LCG Sites NA 0.06% MI 0.53% Legnaro 2.08% FE 0.09% CT 0.03% + CA 0.05% LCG CNAF 4.10% BA 0.01%

14 Production Share (II)

15 TIER tape storage TIER 0 Nb of Events Size (TB) CERN 185.5M 62 Tier 1
Nb of Events (in 106) Size (TB) CNAF 37.1 12.6 RAL 19.5 6.5 PIC 16.6 5.4 Karlsruhe 12.5 4 Lyon 4.4 1.5

16 DIRAC – LCG: CPU share ~370 (successful) CPU · Years May: 88%:12%
11% of DC’04 Jun: 78%:22% 25% of DC’04 Jul: 75%:25% 22% of DC’04 Aug: 26%:74% 42% of DC’04

17 DC04 LCG Performance Missing python, Fail DIRAC installation, Fail Connection DIRAC Servers, Fail Software installation… Error while running Applications (Hardware, System, LHCb Soft….) Error while transferring or registering output data (can be recovered retry). LHCb Accounting: 81k LCG Successful jobs

18 LHCb DC04 phases 2/3 Phase 2 Phase 3 (Phase 1)
Stripping starting the next days Data set reduction needed for an efficient access to data in a user driven random analysis Analysis job that either: executes a physics selection on signal + bkgnd events with loose cuts; selects an event passing L0+L1 trigger on minimum bias events. Need to Run over 65 TB of data distributed over 5 Tier1 Sites (CERN, CNAF, FZK, PIC, Lyon), with “small” CPU requirements. Produced datasets (~1 TB) will be distributed to all Tier1s. Phase 3 End user analysis will follow GANGA tools in preparation (Phase 1) Keep a continuous rate of production activity with programmed mini DC (i.e., few days once a month).

19 Lessons learnt: LCG Improve OutputSandBox Upload/Retrieval mechanism:
Should also be available for Failed and Aborted Jobs. Improve reliability of CE status collection methods. Add intelligence on CE or RB to detect and avoid large number of aborted jobs on start-up: Avoid miss-configured site to become a black-hole. Need to collect LCG-log info and tool to navigate them. Need a way to limit the CPU (and Wall-clock time): LCG Wrapper must issue appropriated signals to the user job to allow graceful termination. Problems with site configurations (LCG config, firewalls, gridFTP servers...)

20 Conclusions LHCb DC04 Phase 1 is over.
The Production target was achieved: 186M events ~50% on LCG Resources (75-80% during the last weeks). LHCb strategy successful: Submitting “empty” DIRAC agents to LCG has proven to be very flexible allowing a success rate above LCG alone. Big room for improvements, both on DIRAC and LCG DIRAC needs to improve in the reliability of the servers: big step already during DC. LCG needs improvement on the single job efficiency: ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. Success due to dedicated support from LCG team and DIRAC Site Managers


Download ppt "INFN GRID Workshop Bari, 26th October 2004"

Similar presentations


Ads by Google