Download presentation
Presentation is loading. Please wait.
Published byEric Powell Modified over 9 years ago
1
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04
2
Result of LHCb DC042 The LHCb DC04 team Dirac –Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees Production management –Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers LHCb Bookkeeping, Monitoring & accounting –Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya LCG-LHCb liaison –Flavia Donno, Roberto Santinelli The LCG-GDA team –Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz, David Smith, Zdenek Sekera, Marco Serra…
3
Result of LHCb DC043 Outline Aims of the LHCb Data Challenge 2004 Production model Performances of DC’04 Lessons from DC’04 Conclusions
4
Result of LHCb DC044 LHCb DC’04 aims Main goal :gather information to be used for writing the LHCb computing Technical Design Report –Robustness test of the LHCb software and production system Using software as realistic as possible in terms of performance –Test of the LHCb distributed computing model Including distributed analyses Realistic test of analysis environment, need realistic analyses –Incorporation of the LCG application area software into the LHCb production environment –Use of LCG resources (at least 50% of the production capacity) –3 phases Production : MC simulation and reconstruction Stripping : Event pre-selection Analysis
5
Result of LHCb DC045 LHCb DC04 aims (cont’d) Physics goals –HLT studies, consolidating efficiencies –Background/Signal studies, consolidate background estimates + background properties Requires quantitative increase in number of signal and background events compared to DC03: –30 10 6 signal events –15 10 6 specific background –125 10 6 background (B inclusive + minimum bias, ratio 1:1.8)
6
Result of LHCb DC046 Production Production done with DIRAC system –Track 4 - Distributed Computing Services : id 377 DIRAC is deployed to each site participating to DC’04 Central Services supporting the Data Challenge –Production database –Workload Management System –Monitoring, Accounting –Bookkeeping, ALIEN File Catalog Technologies used by the production services –C++, python, XML-RPC –ORACLE and mysql databases
7
Result of LHCb DC047 LHCb job Non LCG site 1.DIRAC deployment (CE). 2.DIRAC JobAgent: –Check CE status. –Request a DIRAC task (jdl). –Install LHCb software if needed –Submit to Local Batch System the job. –Execute task: –Check Steps. –Upload results 3.DIRAC TransferAgent. LCG site 1.Input SandBox: –Small bash script (~50 lines). 1.Check environment: Site, hostname, CPU, Memory, Disk Space… 2.Install DIRAC: Download DIRAC tarball (~1 MB). Deploy DIRAC on WN. 3.Execute the job: A.Request a DIRAC task (LHCb Simulation job) B.Execute task: C.Check Steps D.Upload results: 2.Retrieval of SandBox 3.Analysis of Retrieved Output SandBox
8
Result of LHCb DC048 Strategy Test sites: –Each site is tested with special and production-like jobs. Enable site : –DIRAC Workload Management System. Always keep jobs in the queues DIRAC Run Local Agent continuously: –Via cron jobs –Via runsv –Via daemon LCG Submit jobs continuously: –Via cron job on User Interface PS: LCG is considered as a site for DIRAC point of view
9
Result of LHCb DC049 Data Storage All the output of the reconstructed phase (DST) are send to CERN (as Tier0) All the intermediate files are not kept. DSTs are also stored in one of our 5 TIER1 –CNAF (Italy) –Karlsruhe (Germany) –Lyon (France) –PIC (Spain) –RAL (United Kingdom)
10
Result of LHCb DC0410 DC’04 performances
11
Result of LHCb DC0411 Phase 1 results DIRAC alone LCG in action 1.8 10 6 /day LCG paused Phase 1 Completed 3-5 10 6 /day LCG restarted 186 M Produced Events
12
Result of LHCb DC0412 Daily performance 5 million/day
13
Result of LHCb DC0413 Sites involved 43 LCG Sites (8 also DIRAC sites) 20 DIRAC Sites Used resources from non-LHCb countries e.g. Hungary produced ~2M events
14
Result of LHCb DC0414 Simultaneous jobs (a snapshot)
15
Result of LHCb DC0415 TIER storage Tier 1Nb of EventsSize (TB) CNAF37 129 35012.6 RAL19 462 8506.5 PIC16 505 0105.4 Karlsruhe12 486 3004 Lyon4 368 6561.5 TIER 0Nb of EventsSize (TB) CERN187 557 23162
16
Result of LHCb DC0416 DIRAC-LCG : events share 50% of events were produced using LCG
17
Result of LHCb DC0417 DIRAC – LCG : CPU share May: 88%:12% 11% of DC’04 Jun: 78%:22% 25% of DC’04 Jul: 75%:25% 22% of DC’04 Aug: 26%:74% 42% of DC’04 376 CPU · Years
18
Result of LHCb DC0418 211k Submitted Jobs to LCG After Running: LCG Efficiency: 61 % LCG performance 113 k Done (Successful) 34 k Aborted
19
Result of LHCb DC0419 DC’04 lessons
20
Result of LHCb DC0420 Lessons learnt: DIRAC The concept of the light, customizable and simple to deploy agents proved to be very effective Easy update procedure - propagate bug fixes quickly of DIRAC tools Applications software installation triggered by a running job Most of the central services were running on the same machine –Too many processes, high loads Improve Server Availability Improve Error Handling and Reporting.
21
Result of LHCb DC0421 Lessons learnt: LCG Improve OutputSandBox Upload | Retrieval mechanism: –Should also be available for Failed and Aborted Jobs. Improve reliability of CE status collection methods (timestamps?). Add intelligence on CE or RB to detect and avoid large number of aborted jobs on start-up: –Avoid miss-configured site to become a black-hole. Need to collect LCG-log info and tool to navigate them (including different JobIDs). Need a way to limit the CPU (and Wall-clock time): –LCG Wrapper must issue appropriated signals to User Job to allow graceful termination. How to manuals: –Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and/or upgrade). –Problems with site configurations (LCG config, firewalls, gridFTP servers..)
22
Result of LHCb DC0422 LHCb DC’04 Phase 1 is over. The Production Target was achieved: –186 M Events in 424 CPU years. –~ 50% on LCG Resources (75-80% at the last weeks). LHCb Strategy successful: –Submitting “empty” DIRAC Agents to LCG has proven to be very flexible allowing a success rate above LCG alone. Big room for improvements, both on DIRAC and LCG –DIRAC needs to improve in the reliability of the Servers: big step already during DC. –LCG needs improvement on the single job efficiency: ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. –In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. Success due to dedicated support from LCG team and DIRAC Site Managers Conclusions
23
Result of LHCb DC0423 Other links CHEP04 talks: –File-Metadata Management System for the LHCb Experiment (Track 4 - Distributed Computing Services) id 392 27-Sep-2004 17:30 –DIRAC Workload Management System (Track 5 - Distributed Computing Systems and Experiences) id 365 29-Sep-2004 10:00 –Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC (Track 4 - Distributed Computing Services) id 368 29-Sep-2004 10:00 –DIRAC - The Distributed MC Production and Analysis for LHCb (Track 4 - Distributed Computing Services) id 377 30-Sep-2004 18:10 –A Lightweight Monitoring and Accounting System for LHCb DC04 Production (Track 4 - Distributed Computing Services) id388 30-Sep-2004 17:30
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.