M.MaseraALICE CALCOLO 2004 PHYSICS DATA CHALLENGE III IL CALCOLO NEL 2004: PHYSICS DATA CHALLENGE III Massimo Masera Commissione Scientifica Nazionale I 22 giugno 2004
M.MaseraALICE CALCOLO 2004 Sommario AliRootThe framework: status of AliRoot for the Physics Data Challenge III (PDCIII) AliEnThe production environment: AliEn AliEn as a meta-Grid: the use of LCG and Grid.It in the PDCIII Current status of the PDCIII –Phase I is finished –Starting of Phase II –Phase III Conclusions
M.MaseraALICE CALCOLO 2004 Period (milestone) Fraction of the final capacity (%) Physics Objective 06/01-12/011%pp studies, reconstruction of TPC and ITS 06/02-12/025%First test of the complete chain from simulation to reconstruction for the PPR Simple analysis tools. Digits in ROOT format. 01/04-06/0410%Complete chain used for trigger studies. Prototype of the analysis tools. Comparison with parameterised MonteCarlo. Simulated raw data. 01/06-06/0620%Test of the final system for reconstruction and analysis. ALICE PHYSICS DATA CHALLENGES ALICE Physics Data Challenge (2003)
M.MaseraALICE CALCOLO 2004 AliRoot
M.MaseraALICE CALCOLO 2004 Simulation and Reconstruction in AliRoot In the present data challenge simulation and reconstruction are steered by 2 classes: AliSimulation and AliReconstruction –simple user interface $AliSimulation sim; $sim.Run(); $AliReconstruction rec; $rec.Run(); standard –Goal: run the standard simulation and reconstruction for all detectors This is the simplest example: the number of events and the config file can be set merging and region of interest are also implemented
M.MaseraALICE CALCOLO 2004 Event Summary Data (ESD) The AliESD class is essentially a container for data No functions for the analysis It is the result of the reconstruction carried out systematically via batch/Grid jobs It aims to be the starting point for the analysis At reconstruction time, it can be used to exchange information among different rec. steps
M.MaseraALICE CALCOLO 2004 Event Summary Data (ESD) ESD TPC tracker TRD trackerITS tracker ITS stand-alone TOF PHOS MUON File The following detectors are currently contributing to the ESD: ITS, TPC, TRD, TOF, PHOS and MUON. The ESD structure is sufficient for the following “kinds of physics”: strangeness, charm, HBT, jets (the ones going to be tried in the DC2004). All the objects stored in the ESD are accessed via abstract interfaces (i.e. do not depend on sub-detector code) ITS Vertexers
M.MaseraALICE CALCOLO 2004 Example: primary vertex AliVertexer AliITSVertexer AliITSVertexerIonsAliITSVertexerZAliITSVertexerFast AliESDVertex STEER directory: Interfaces and subdetector-independent code ITS directory Pb-Pb 3-D info for central events p-p and peripheral events NEW code Just a gaussian smearing of the generated vertex AliITSVertexerTracks High precision vertexer with rec. tracks (pp D0) AliReconstruction AliReconstructor::CreateVertexer AliITSReconstructor::CreateVertexer AliESD
M.MaseraALICE CALCOLO 2004 AliRoot: present situation Major changes in the last year… –New multi-file I/O in full production –New coordinate system –New reconstruction and simulations “drivers” (AliSimulation and AliReconstruction classes) –First attempt at the ESD and analysis framework –Improvements in reconstruction and simulation … However, the system is evolving –ESD: the philosophy is still evolving –Introduction of FLUKA and new geometrical modeller –Development of the analysis framework –Raw data for all the detectors (already available for ITS and TPC). –Introduction of the condition database infrastructure
M.MaseraALICE CALCOLO 2004 AliEn and the Grid
M.MaseraALICE CALCOLO 2004 The ALICE Production Environment: AliEn Standards are now emerging for the basic building blocks of a GRID –There are millions lines of code in the OS domain dealing with these issues Why not using these to build the minimal GRID that does the job? –Fast development of a prototype, no problem in exploring new roads, restarting from scratch etc etc –Hundreds of users and developers –Immediate adoption of emerging standards An example, AliEn by ALICE (5% of code developed, 95% imported) (…) DBIDBD RDBMS (MySQL) LDAP V.O. Packages & Command s Perl Core Perl Modules External Libraries File & Metadata Catalogue SOAP/XML CE SE Logger Database Proxy Authentication RB User Interface ADBI Config Mgr Packa ge Mgr Web Portal User Applicatio n API (C/C++/perl) CLI GUI AliEn Core Components & servicesInterfacesExternal software Low levelHigh level FS
M.MaseraALICE CALCOLO 2004 AliEn Timeline Functionality + Simulation Interoperability + Reconstruction Performance, Scalability, Standards + Analysis First production (distributed simulation) Physics Performance Report (mixing & reconstruction) 10% Data Challenge (analysis) Start
M.MaseraALICE CALCOLO 2004 From AliEn to a Meta-Grid The Workload Management is “pull-model”: a server holds a master queue of jobs and it is up to the CE that provides the CPU cycles to call it and ask for a job The system is integrated with a large-scale job submission and bookkeeping system “tuned” for Data Challenge productions, with job splitting, statistics, pie charts, automatic resubmissions, etc. The Job Monitoring model requires no “sensors” installed on the WN. It is the jobwrapper itself that talks to the server. Several Grid infrastructures are (becoming) available: LCG, Grid.It, possibly others Lots of resources but, in principle, different middlewares Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complexPull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex
M.MaseraALICE CALCOLO 2004 From AliEn to a Meta-Grid Design strategy: Use AliEn as a general front-end –Owned and shared resource are exploited transparently Minimize points of contact between the systems –No need to reimplement services etc. –No special services required to run on remote CE/WNs Make full use of provided services: Data Catalogues, scheduling, monitoring… –Let the Grids do their jobs (they should know how) Use high-level tools and APIs to access Grid resources –Developers put a lot of abstraction effort into hiding the complexity and shielding the user from implementation changes
M.MaseraALICE CALCOLO 2004 Available resources for PDC III Several AliEn “native” sites (some rather large) –Bari, CERN, CNAF, Catania, Cyfronet, FZK, JINR, LBL, Lyon, OSC, Prague, Torino LCG-2 core sites –CERN, CNAF, FZK, NIKHEF, RAL, Taiwan (more than 1000 CPUs) –At CNAF and Catania, the same resources can be accessed either by LCG/Grid.It and by AliEn GRID.IT sites –LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF) Implementation: manage LCG resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface The whole of LCG computing is seen as a single, large AliEn CE associated with a single, large SE
M.MaseraALICE CALCOLO 2004 Software installation Both AliEn and AliRoot installed via LCG jobs –Do some checks, download tarballs, uncompress, build environment script and publish relevant tags –Single command available to get the list of available sites, send the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ –Manual intervention still needed in few sites (e.g. CERN/LSF) –Ready for integration into AliEn automatic installation system Experiment software shared area misconfiguration caused most of the trouble in the beginning LCG-UI NIKHEF Taiwan RAL CNAF TO.INFN installAlice.sh installAlice.jdl installAliEn.sh installAliEn.jdl …
M.MaseraALICE CALCOLO 2004 Alien CE LCG UI Alien CEs/SEs Server User submits jobs Catalog LCG RB LCG CEs/SEs LCG LFN LCG PFN LCG LFN = AliEn PFN Catalog AliEn, Genius & EDG/LCG LCG-2 is one CE of AliEn, which integrates LCG and non LCG resources –If LCG-2 can run a large number of jobs, it will be used heavily –If LCG-2 cannot do that, AliEn selects other resources, and it will be less used
M.MaseraALICE CALCOLO 2004 Physics Data Challenge III
M.MaseraALICE CALCOLO 2004 CERN Tier2Tier1Tier2Tier1 Production of RAW Shipment of RAW to CERN Reconstruction of RAW in all T1-2’s Analysis AliEn job control Data transfer PDC 3 schema
M.MaseraALICE CALCOLO 2004 Phases of ALICE Physics Data Challenge 2004 Phase 1 - production of underlying events using heavy ion MC generators –Status: Completed Phase 2 – mixing of signal events in the underlying events –Status – starting Phase 3 – analysis of signal+underlying events: –Goal – to test the data analysis model of ALICE –Status – will begin in ~2 months
M.MaseraALICE CALCOLO 2004 Signal-free event Mixed signal Merging
M.MaseraALICE CALCOLO 2004 Statistics for phase 1 of ALICE PDC 2004 Number of jobs: –Central 1 (long, 12 hours) – 20 K –Peripheral 1 (medium – 6 hours) – 20 K –Peripheral 2 to 5 (short – 1 to 3 hours) – 16 K Number of files: 3.8 million –AliEn file catalogue: 3.8 million (no degradation in performance observed) 1.3 million –CERN Castor: 1.3 million File size: –Total: 26 TB CPU work: –Total: 285 MSI-2K hours –LCG: 67 MSI-2K hours
M.MaseraALICE CALCOLO 2004 Phase I: 1 Pb-Pb event AliEn Catalog. 36 files
M.MaseraALICE CALCOLO 2004 Phase 1 resource statistics: –27 production centres, 12 major producers, no single site dominating the production –Individual contribution of sites not displayed is on the level of Bari –See slide 28 for a comparison between AliEn and LCG sites –Italian contribution > 40%
M.MaseraALICE CALCOLO 2004 Phase 1 CPU profile: –Aiming for sustained running (as allowed by resources availability), average 450 CPUs, max 1450 CPUs (not appearing due to binning) files Castor problem
M.MaseraALICE CALCOLO 2004 Problems with Phase I Two months delay mainly due to a delayed release of LCG-2 No SE in LCG-2 + poor storage availability in LCG sites – Natural solution in PDC- Phase I: all files migrated to Castor related problems –Initial lack of storage w.r.t. requests (30 TB… not yet available) – files limit above which the system performance dropped –servers reinstallation in March LCG: most of the problems are related to the configuration of the sites –Software management tools are still rudimentary –Large sites have often tighter security restrictions & other idiosincrasies –Investigating and fixing problems is hard and time-consuming The most difficult part of the management is monitoring LCG through a “keyhole”. –Only integrated information available natively –MonALISA for AliEn, GridICE for LCG
M.MaseraALICE CALCOLO 2004 LCG / AliEn Statistics after round 1 (ended april, 4): job distribution (LCG 46%) –Alice::CERN::LCG is the interface to LCG-2 –Alice::Torino::LCG is the interface to GRID.IT In the 2 nd round AliEn was used more because of the lack of storage continuous stop/start of the production SITUATION AT THE END OF ROUND 1
M.MaseraALICE CALCOLO 2004 Phase 2 layout Alien CE LCG-UI Server User submits jobs Catalog LCG RB LCG LFN = AliEn PFN lcg://host/ Catalog Alien CE/SE LCG CE/SE CERN Castor edg-copy-and-register Phase 2 -- about to start Mixing of the underlying events with signal events (jets, muons, J/ ) may have problems of storage at local SE We plan to use fully LCG DM tools, we may have problems of storage at local SE
M.MaseraALICE CALCOLO 2004 Problems with Phase II Phase II will generate lots (1M) of (rather small ~7MB) files We would need an extra stager at CERN, but this is not available at the moment We could use some TB of disk space, but this too is not available We are testing a plug-in to AliEn using tar to bunch small files The space available on the local LCG storage elements seems very low… we will see Preparation of the LCG-2 JDL is more complicated, due to the use of the data management features This has introduced a two weeks delay -- we hope to start soon!
M.MaseraALICE CALCOLO 2004 Phase 3 layout Alien CE LCG-UI Server User query Catalog LCG RB Catalog Alien CE/SE LCG CE/SE lfn 1 lfn 2 lfn 3 lfn 7 lfn 8 lfn 4 lfn 5 lfn 6 LCG CE/SE Phase 3 -- foreseen in two months Analysis of signal+underlying events: Test the data analysis model of ALICE AliEn job splitting tests with ARDA in September … … ARDA workshop today at Cern
M.MaseraALICE CALCOLO 2004 ARDA, EGEE, gLite, LCG… ARDA was a RTAG (Sep 2003) devoted to analysis: – it found AliEn “the most complete system among all considered” – it became a LCG project. Setting up meeting: Jan, 2004 ARDA is interfaced to the EGEE middleware (gLite), disclosed on May, 18 th. Prototype with EGEE MW due by Sep 04 gLite is presently based on AliEn shell, Winsconsin CE, Globus gatekeeper, VOMS, GAS, … Next steps (F. Hemmer, PEB, Jun, 7 th ): integration of R-GMA and EDG-WMS (developed by INFN) Support of LCG-2 maintained until EGEE satisfies the requirements of the experiments (PEB, Jun, 7 th ) This picture seems to be reverted now (EGEE, All activities meeting Jun, 18 th ): –LCG-2 will evolve to LCG-3, being focused on production –gLite will evolve in parallel and will be focused on development and analysis. These parallel evolutions will occasionally converge: gLite components will be merged in LCG-x MW as soon as they are completed This major change within the LCG project occurred abruptly without a prior discussion at PEB and SC2 level and without the approval of the experiments ALICE will examine the current situation in its next offline week (start: Jun, 28 th )
M.MaseraALICE CALCOLO 2004 Conclusions
M.MaseraALICE CALCOLO 2004 We will have an additional DC The difficult start of the ongoing DC taught a lesson: We cannot stay 18 months without testing our “production capabilities” In particular we have to maintain the readiness of –Code (AliRoot + MW) –ALICE distributed computing facilities –LCG infrastructure –Human “production machinery” Getting all the partners into “production mode” was a non-negligible effort We have to plan carefully size and physics objectives of this data challenge
M.MaseraALICE CALCOLO 2004 Period (milestone) Fraction of the final capacity (%) Physics Objective 06/01-12/011% pp studies, reconstruction of TPC and ITS 06/02-12/025% First test of the complete chain from simulation to reconstruction for the PPR Simple analysis tools Digits in ROOT format 01/04-06/0410% Complete chain used for trigger studies Prototype of the analysis tools Comparison with parameterised MonteCarlo Simulated raw data 05/05-07/05TBD Refinement of jet studies Test of new infrastructure and MW TBD 01/06-06/0620% Test of the final system for reconstruction and analysis ALICE Physics Data Challenges NEW
M.MaseraALICE CALCOLO 2004 ALICE Offline Timeline ALICE PDC04 Analysis PDC04 Design of new components Development of new components Pre-challenge ‘06 PDC06 preparation PDC06 Final development of AliRoot First data taking preparation PDC04 PDC06 AliRoot ready PDC06 AliRoot ready
M.MaseraALICE CALCOLO 2004 Conclusions Several problems and difficulties… However our DC is progressing and Phase I is concluded The DC is completely carried out on the Grid AliEn –Tools OK for DC running and resources control –Feedback from the CE and WN proved to be essential for early spotting of problems –Centralized and compact master services allow for fast upgrades –DM was working just fine (providing that underlying MSS systems work well) –File catalogue works great, 4M entries and no noticeable performance degradation AliEn as meta-grid works well, across three grids, and this is a success in itsellf The INFN contribution to the DC and to the grid activities of the experiment is relevant. –>40% of CPU cycles provided by INFN sites. The efficiency was very high and the cooperation of the site managers was prompt. –The interface between AliEn and LCG/Grit.It has been developed in Italy We are going to use LCG SE for phase II… Possible bottle-neck for Phase II: lack of local storage resources Analysis: –AliEn job splitting –We hope to test the first ARDA prototype in Fall