Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Slides:



Advertisements
Similar presentations
Nadia LAJILI IN2P3 Computing Center Testbed Status IN2P3 Computing Center Testbed Status Lyon, February 2003.
Advertisements

DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Stephen Burke - WP8 Status - 9/5/2002 Partner Logo WP8 Status Stephen Burke, PPARC/RAL.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
INFN Testbed status report L. Gaido WP6 meeting CERN - October 30th, 2002.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
CMS Report – GridPP Collaboration Meeting VIII Peter Hobson, Brunel University22/9/2003 CMS Applications Progress towards GridPP milestones Data management.
LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to Frank Harris)
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting.
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
The ALICE short-term use case DataGrid WP6 Meeting Milano, 11 Dec 2000Piergiorgio Cerello 1 Physics Performance Report (PPR) production starting in Feb2001.
Offline report – 7TeV data taking period (Mar.30 – Apr.6) ALICE SRC April 6, 2010.
PDC’06 – production status and issues Latchezar Betev TF meeting – May 04, 2006.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
Analysis trains – Status & experience from operation Mihaela Gheata.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
CERN – Alice Offline – Thu, 20 Mar 2008 – Marco MEONI - 1 Status of Cosmic Reconstruction Offline weekly meeting.
DataGrid is a project funded by the European Commission under contract IST rd EU Review – 19-20/02/2004 WP8 - Demonstration ALICE – Evolving.
Phase 2 of the Physics Data Challenge ‘04 Latchezar Betev ALICE Offline week Geneva, September 15, 2004.
2 Sep 2002F Harris EDG/WP6 meeeting at Budapest LHC experiments use of EDG Testbed F Harris (Oxford/CERN)
Oxana Smirnova LCG/ATLAS/Lund September 3, 2002, Budapest 5 th EU DataGrid Conference ATLAS-EDG Task Force status report.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Physics selection: online changes & QA M Floris, JF Grosse-Oetringhaus Weekly offline meeting 30/01/
Oxana Smirnova LCG/ATLAS/Lund August 27, 2002, EDG Retreat ATLAS-EDG Task Force status report.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Phase 2 of the Physics Data Challenge ‘04 Peter Hristov For the ALICE DC team Russia-CERN Joint Group on Computing CERN, September 20, 2004.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Lessons learned administering a larger setup for LHCb
INFNGRID Technical Board, Feb
Grid Computing: Running your Jobs around the World
Status of the CERN Analysis Facility
INFN-GRID Workshop Bari, October, 26, 2004
ATLAS activities in the IT cloud in April 2008
ALICE Physics Data Challenge 3
ALICE – Evolving towards the use of EDG/LCG - the Data Challenge 2004
Nicolas Jacq LPC, IN2P3/CNRS, France
MC data production, reconstruction and analysis - lessons from PDC’04
5. Job Submission Grid Computing.
Simulation use cases for T2 in ALICE
AliEn central services (structure and operation)
US CMS Testbed.
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
The LHCb Computing Data Challenge DC06
Presentation transcript:

Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration: hrs Goal2: analyse these events 1 job to analyse all the events

Status Started on Mar, 15 th Stopped on May, 31 st About 450 central Pb-Pb events simulated (6 jobs/day) :-( Output registered in the EDG Alice RC Output stored on : EDG disk SE's (300) EDG MSS SE's (150) CASTOR at CNAF and CERN (all, registered in the AliEn Data Catalogue) Production test on EDG-1.4

Comments Average Efficiency: 35% More jobs would mean lower efficiency Application Testbed unstable on the time scale of our job duration (24 h) Most of the jobs failed because of services failures It takes a long time to track down the errors and recover (i.e., clean up the RC by hand when needed) Production test on EDG-1.4

Failure reasons: RB overloaded Service crash, jobs get lost even though under execution at a WN, and they can't be tracked/monitored anymore stdout/stderr can't be monitored during execution The job might complete correctly and store/register the output on/in the SE/RC No Output Sandbox available No change of job status Production test on EDG-1.4

Failure reasons: WN disk space full Alice jobs produce a 2 GB output Sometimes the available disk space on the executing WN is filled up and the job crashes Production test on EDG-1.4

Failure reasons: The "Lyon" problem WN's publish the total available memory in the IS The JDL memory requirement is compared to the published values When more than a job is allowed on the WN, the memory is shared. AliRoot jobs break because they need more memory than the actually available amount Workaround by F. Hernandez Production test on EDG-1.4

Behaviours not understood Some jobs go to "OutputReady" status after 6-8 days MSS jobs fail more frequently (and job information only available for CNAF jobs) Production test on EDG-1.4

MSS jobs OK 74 LDAP failure 23 RC failure 35 Disk full 16 Lost 32 Wrapper 39 Running 36 Submit Total 270

Production test on EDG-1.4 Conclusions The EDG Application Testbed is not suitable for large productions (lack of resources) Its use is very frustrating: instability, limited functionality, low efficiency at the present rate, it would take 18 months to complete the production :-( functionality for data analysis is now missing The application testbed is being closed use AliEn for data analysis and wait for LCG-1