Phase 2 of the Physics Data Challenge ‘04 Peter Hristov For the ALICE DC team Russia-CERN Joint Group on Computing CERN, September 20, 2004.

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
A tool to enable CMS Distributed Analysis
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:
ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Status of PDC’06 Latchezar Betev TF meeting – September 28, 2006.
The ALICE short-term use case DataGrid WP6 Meeting Milano, 11 Dec 2000Piergiorgio Cerello 1 Physics Performance Report (PPR) production starting in Feb2001.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
PDC’06 – production status and issues Latchezar Betev TF meeting – May 04, 2006.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
Analysis trains – Status & experience from operation Mihaela Gheata.
Working with AliEn Kilian Schwarz ALICE Group Meeting April
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004.
EGEE is a project funded by the European Commission under contract IST NA4/HEP work F Harris (Oxford/CERN) M.Lamanna(CERN) NA4 Open meeting.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.
ARDA P.Cerello – INFN Torino ARDA Workshop June
ALICE experiences with CASTOR2 Latchezar Betev ALICE.
Status of AliEn2 Services ALICE offline week Latchezar Betev Geneva, June 01, 2005.
Service Challenge Report Federico Carminati GDB – January 11, 2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Monthly video-conference, 18/12/2003 P.Hristov1 Preparation for physics data challenge'04 P.Hristov Alice monthly off-line video-conference December 18,
INFNGRID Technical Board, Feb
LCG Service Challenge: Planning and Milestones
Status of the Production
Report PROOF session ALICE Offline FAIR Grid Workshop #1
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
INFN-GRID Workshop Bari, October, 26, 2004
ALICE Physics Data Challenge 3
ALICE – Evolving towards the use of EDG/LCG - the Data Challenge 2004
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
MC data production, reconstruction and analysis - lessons from PDC’04
Simulation use cases for T2 in ALICE
The LHCb Computing Data Challenge DC06
Presentation transcript:

Phase 2 of the Physics Data Challenge ‘04 Peter Hristov For the ALICE DC team Russia-CERN Joint Group on Computing CERN, September 20, 2004

2 Status of PDC04 15 Sep. 2004, Alice Offline week Outline  Purpose and conditions of Phase 2  Job structure  Experiences and improvements: AliEn and LCG  Statistics (up to today)  Toward phase 3  Conclusions

3 Status of PDC04 15 Sep. 2004, Alice Offline week Phase 2 purpose and tasks  Merging of signal events with different physics content into the underlying Pb+Pb events (underlying events are reused several times)  Test of:  Standard production of signal events  Stress test of network and file transfer tools  Storage at remote SEs, stability (crucial for phase 3)  Conditions, jobs …:  62 different conditions  340K jobs, 15.2M events  10 TB produced data  200 TB data transfer from CERN  500 MSI2K hours CPU

4 Status of PDC04 15 Sep. 2004, Alice Offline week  Repartition of tasks (physics signals):

5 Status of PDC04 15 Sep. 2004, Alice Offline week  Structure of event production in phase 2: Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Central servers CEs Sub-jobs Job processing AliEn-LCG interface Sub-jobs RB Job processing CEs Storage CERN CASTOR: underlying events Local SEs CERN CASTOR: backup copy Storage Primary copy Local SEs Output files Underlying event input files zip archive of output files Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN edg(lcg) copy&register File catalogue

6 Status of PDC04 15 Sep. 2004, Alice Offline week Experience: AliEn  AliEn system improvements:  AliEn processes tables – split in “running” (lightweight) and “done” (archive) – allows for faster process tracking  Implemented symbolic links and event groups (through sophisticated search algorithms):  Number of underlying events are grouped (through symbolic links) in a directory for a specific signal event type – example 1660 underlying events will be used for each jet signal condition. Another 1660 will be used for the next and so on up to in total (12 conditions)  Implemented zip archiving, mainly to overcome the limitations of the taping systems (less files, large size)  Fast resubmission of failed jobs – in this phase all jobs must finish  New job monitoring tools, including singe job trace logs from start to finish with logical steps and timing

7 Status of PDC04 15 Sep. 2004, Alice Offline week  AliEn problems:  Proxy server – out of memory due to a spiralling number of proxy connections: attempt to introduce a schema with pre-forked and limited number of proxies was not successful and the problem has to be studied further:  Not a show-stopper – we know what to monitor and how to avoid it  JobOptimizer – due to the very complex structure of the jobs (many files in the input box) the time needed to prepare one job for submission is large and the service sometimes cannot supply enough jobs to fill the available resources:  Not a show stopper now – we are mixing jobs with different execution time length, thus load-balancing the system  Has to be fixed for phase 3, where the input boxes of the jobs will be even larger and the processing time is very short – clever ideas how to speed-up the system already exist

8 Status of PDC04 15 Sep. 2004, Alice Offline week Experience: LCG  A document on Data Challenges on LCG is being finalised by the GAG, with contributions by ALICE, ATLAS, LHCb  LCG problems and solutions:  General  Problem -> reporting -> fixing -> green light – but no feedback  Same problem somewhere else…  Direct contact with site managers can be useful  Job Management  On average, it works fairly well  Maximum number of CPU served by a RB:  Average Job duration/Submission time  Max submission rate to LCG: 720 jobs/hour  For us it’s less as we do more than just submission  One entry point does not scale to the whole system size…  No multiple job management tools  Ranking: [1 – (jobs waiting)/(total cpus)] works well, but it’s not the default…  Jobs reported as “Running” by LCG fail to report to AliEn that they started – so they stay “queued” forever  Jobs stay Running forever, even if site managers report their completion  Jobs reported as “cancelled by user” even if they were not

9 Status of PDC04 15 Sep. 2004, Alice Offline week  LCG problems and solutions (cont’d)  Data Management  “Default SE” vs. “Close SE”  Edg-rm commands  Lcg-cr: lack of diagnostic information  Possible fix for temporarily unavailable SEs  Sites/Configuration  “Black-hole” effect: jobs fail and more and more are attracted  “alicesgm” not allowed to write in the SW installation area  Environment variables  VO_ALICE_SW_DIR not set  Misconfiguration  FZK with INFN certificates  Cambridge: bash not supported  “Default SE” vs. “Close SE” – see above  Library configuration – CNAF (solved, how?), CERN (?)  NFS not working: multiple job failures – see “black-hole” effect  Stability  Behaviour is all but uniform with time – but the general picture is improving

10 Status of PDC04 15 Sep. 2004, Alice Offline week Some results (19/09/04)  Phase 2 statistics (start July 2004 – end September 2004):  Jet signals: unquenched and quenched, cent 1: complete  Jet signals: unquenched per1: complete  Jet signals: quenched per1: 30% complete  Special TRD production at CNAF: phase 1 running  Number of jobs: 85K (number of done jobs/day is accelerating)  Number of output files: 422K data, 390K log  Data volume: 3.4 TB at local SEs, 3.4 TB at CERN (backup)  Job duration: 2h 30min cent1, 1h 20min per1:  Careful profiling of AliRoot and cleaning up of the code has reduced the processing time by a factor of 2!

11 Status of PDC04 15 Sep. 2004, Alice Offline week LCG Contribution to Phase II (15/09)  Mixing + Reconstruction  “more difficult”: large input to be transferred to the CE, output to a SE local to the CE that executes the job  Jobs (last month, 15 k jobs sent):  DONE5990  ERROR_IB1411 (error in staging input)  ERROR_V3090 (insufficient memory on the WN or AliRoot failure)  ERROR_SV2195 (Data Management or Storage Element failure)  ERROR_E1277 (typically NFS failures, so the executable is not found)  KILLED219 (jobs that fail to contact the AliEn Server when started and stay QUEUED forever while they are Running – also forever – in LCG)  RESUB851  FAILED330  Test of:  Data Management Services  Storage Element  Remarks  Up to 400 jobs Running on LCG on a single interface  No more use of Grid.it (avoid management of too many sites for phase III

12 Status of PDC04 15 Sep. 2004, Alice Offline week  Individual sites: CPU contribution  AliEn direct control: 17 CEs, each with a SE; CERN-LCG is encompassing the LCG resources worldwide (also with local/close SEs)

13 Status of PDC04 15 Sep. 2004, Alice Offline week  Individual sites: jobs successfully done

14 Status of PDC04 15 Sep. 2004, Alice Offline week Toward Phase 3  Purpose: distributed analysis of the processed in Phase 2 data  AliEn analysis prototype already exists:  Designated experts are trying to work with it, but it’s difficult with the production running…  We want to use gLite during this phase as much as possible (and provide feedback)  Service requirements:  In both Phase 1 and 2 the service quality of the computing centres has been excellent with very short response times in case of problems  Phase 3 will continue until the end of the year:  The remote computing centres will have to continue providing high level of service  Since the data are stored locally, interruptions of service will fail (or make very slow) the analysis jobs. The backup copy at CERN is on tape only and will take considerable amount of time to stage back in case the local copy is not accessible  The above is valid for the centres directly controlled through AliEn and the LCG sites

15 Status of PDC04 15 Sep. 2004, Alice Offline week Conclusions  Phase 2 of the PDC’04 is about 50% finished and is progressing well, despite its complexity  There is a keen competition for resources at all sites (LHCb and ATLAS are also running massive DCs)  We have not encountered any show-stoppers. All production problems arising are fixed by the AliEn and LCG crew very quickly.  The response of the experts at the computing centres is very efficient  We are also running a considerable amount of jobs on LCG sites and it is performing very well with more and more resources being made available for ALICE, thanks to the hard work of the LCG team  In about 3 weeks time we will seamlessly enter the last phase of the PDC’04…  It’s not over yet, but we are getting close!

16 Status of PDC04 15 Sep. 2004, Alice Offline week Acknowledgements  Special tanks to the site experts for the computing and storage resources and for the excellent support : Francesco Minafra – Bari Haavard Helstrup – Bergen Roberto Barbera – Catania Giuseppe Lo Re – CNAF Bologna Kilian Schwarz – FZK Karlsruhe Jason Holland – TLC² Houston Galina Shabratova – IHEP, ITEP, JINR Eygene Ryabinkin – KIAE Moscow Doug Olson – LBL Yves Schutz – CC-IN2P3 Lyon Doug Johnson – OSC Ohio Jiri Chudoba – Golias Prague Andrey Zarochencev – SPBsU St. Petersburg Jean-Michel Barbet – SUBATECH Nantes Mario Sitta – Torino And to: Patricia Lorenzo – LCG contact person for ALICE