Presentation is loading. Please wait.

Presentation is loading. Please wait.

F.Carminati BNL Seminar March 21, 2005 ALICE Computing Model.

Similar presentations


Presentation on theme: "F.Carminati BNL Seminar March 21, 2005 ALICE Computing Model."— Presentation transcript:

1 F.Carminati BNL Seminar March 21, 2005 ALICE Computing Model

2 March 21, 2005ALICE Computing Model2 Offline framework AliRoot in development since 1998  Entirely based on ROOT  Used since the detector TDR’s for all ALICE studies Two packages to install (ROOT and AliRoot)  Plus MC’s Ported on most common architectures  Linux IA32, IA64 and AMD, Mac OS X, Digital True64, SunOS… Distributed development  Over 50 developers and a single CVS repository  2/3 of the code developed outside CERN Tight integration with DAQ (data recorder) and HLT (same code- base) Wide use of abstract interfaces for modularity “Restricted” subset of c++ used for maximum portability

3 March 21, 2005ALICE Computing Model3 AliRoot layout ROOT AliRoot STEER Virtual MC G3 G4 FLUKA HIJING MEVSIM PYTHIA6 PDF EVGEN HBTP HBTAN ISAJET AliEn/gLite EMCALZDCITSPHOSTRDTOFRICH ESD AliAnalysis AliReconstruction PMD CRTFMDMUONTPCSTARTRALICE STRUCT AliSimulation

4 March 21, 2005ALICE Computing Model4 Software management Regular release schedule  Major release every six months, minor release (tag) every month Emphasis on delivering production code  Corrections, protections, code cleaning, geometry Nightly produced UML diagrams, code listing, coding rule violations, build and tests, single repository with all the codeUML diagramscode listing, coding rule violationsbuild and tests,repository  No version management software (we have only two packages!) Advanced code tools under development (collaboration with IRST/Trento)  Smell detection (already under testing)  Aspect oriented programming tools  Automated genetic testing

5 March 21, 2005ALICE Computing Model5 ALICE Detector Construction Database (DCDB) Specifically designed to aid detector construction in distributed environment:  Sub-detector groups around the world work independently  All data collected in central repository and used to move components from one sub- detector group to another and during integration and operation phase at CERN Multitude of user interfaces:  WEB-based for humans  LabView, XML for laboratory equipment and other sources  ROOT for visualisation In production since 2002 A very ambitious project with important spin-offs  Cable Database  Calibration Database

6 March 21, 2005ALICE Computing Model6 The Virtual MC User Code VMC Geometrical Modeller G3 G3 transport G4 transport G4 FLUKA transport FLUKA Reconstruction Visualisation Generators

7 March 21, 2005ALICE Computing Model7 TGeo modeller

8 March 21, 2005ALICE Computing Model8 Results 5000 1GeV/c protons in 60T field Geant3FLUKA HMPID 5 GeV Pions

9 March 21, 2005ALICE Computing Model9 ITS – SPD: Cluster Size PRELIMINARY!

10 March 21, 2005ALICE Computing Model10 Reconstruction strategy Main challenge - Reconstruction in the high flux environment (occupancy in the TPC up to 40%) requires a new approach to tracking Basic principle – Maximum information approach  Use everything you can, you will get the best Algorithms and data structures optimized for fast access and usage of all relevant information  Localize relevant information  Keep this information until it is needed

11 March 21, 2005ALICE Computing Model11 Tracking strategy – Primary tracks Incremental process  Forward propagation towards to the vertex TPC  ITS  Back propagation ITS  TPC  TRD  TOF  Refit inward TOF  TRD  TPC  ITS Continuous seeding  Track segment finding in all detectors Combinatorial tracking in ITS  Weighted two-tracks  2 calculated  Effective probability of cluster sharing  Probability not to cross given layer for secondary particles

12 March 21, 2005ALICE Computing Model12 Tracking & PID PIV 3GHz – (dN/dy – 6000)  TPC tracking - ~ 40s  TPC kink finder ~ 10 s  ITS tracking ~ 40 s  TRD tracking ~ 200 s TPCITS+TPC+TOF+TRD Contamination Efficiency ITS & TPC & TOF

13 March 21, 2005ALICE Computing Model13 Condition and alignment Heterogeneous information sources are p eriodically polled ROOT files with condition information are created These files are published on the Grid and distributed as needed by the Grid DMS Files contain validity information and are identified via DMS metadata No need for a distributed DBMS Reuse of the existing Grid services

14 March 21, 2005ALICE Computing Model14 External relations and DB connectivity DAQ Trigger DCS ECS Physics data DCDB AliEn  gLite: metadata file store calibration procedures calibration files AliRoot Calibration classes API files From URs: Source, volume, granularity, update frequency, access pattern, runtime environment and dependencies API – Application Program Interface Relations between DBs not final not all shown API HLT Call for UR sent to subdetectors

15 March 21, 2005ALICE Computing Model15 Metadata MetaData are essential for the selection of events We hope to be able to use the Grid file catalogue for one part of the MetaData  During the Data Challenge we used the AliEn file catalogue for storing part of the MetaData  However these are file-level MetaData We will need an additional event-level MetaData  This can be simply the TAG catalogue with externalisable references  We are discussing with STAR on this subject We will take a decision soon  We would prefer that the Grid scenario be clearer

16 March 21, 2005ALICE Computing Model16 ALICE CDC’s DateMBytes/s Tbytes to MSS Offline milestone 10/2002200 Rootification of raw data -Raw data for TPC and ITS 9/2003300 Integration of single detector HLT code Partial data replication to remote centres 5/2004  3/200 5(?) 450 HLT prototype for more detectors - Remote reconstruction of partial data streams -Raw digits for barrel and MUON 5/2005750 Prototype of the final remote data replication (Raw digits for all detectors) 5/2006 750 (1250 if possible) Final test (Final system)

17 March 21, 2005ALICE Computing Model17 Use of HLT for monitoring in CDC’s Aliroot Simulation Digits Raw Data LDC GDC Event builder alimdc Root file CASTOR AliEn Monitoring HLT Algorithms ESD Histograms

18 March 21, 2005ALICE Computing Model18 Period (milestone) Fraction of the final capacity (%) Physics Objective 06/01-12/011% pp studies, reconstruction of TPC and ITS 06/02-12/025% First test of the complete chain from simulation to reconstruction for the PPR Simple analysis tools Digits in ROOT format 01/04-06/0410% Complete chain used for trigger studies Prototype of the analysis tools Comparison with parameterised MonteCarlo Simulated raw data 05/05-07/05TBD Test of condition infrastructure and FLUKA To be combined with SDC 3 Speed test of distributing data from CERN 01/06-06/0620% Test of the final system for reconstruction and analysis ALICE Physics Data Challenges

19 March 21, 2005ALICE Computing Model19 CERN Tier2Tier1Tier2Tier1 Production of RAW Shipment of RAW to CERN Reconstruction of RAW in all T1’s Analysis AliEn job control Data transfer PDC04 schema

20 March 21, 2005ALICE Computing Model20 Signal-free event Mixed signal Phase 2 principle

21 March 21, 2005ALICE Computing Model21 Simplified view of the ALICE Grid with AliEn Local scheduler ALICE VO – central services Central Task Queue Job submission File Catalogue Configuration Accounting User authentication Computing Element Workload management Job Monitoring Storage volume manager Data Transfer Storage Element Cluster Monitor AliEn Site services Disk and MSS Existing site components ALICE VO – Site services integration

22 March 21, 2005ALICE Computing Model22 Site services Inobtrusive – entirely in user space:  Singe user account  All authentication already assured by central services  Tuned to the existing site configuration – supports various schedulers and storage solutions  Running on many Linux flavours and platforms (IA32, IA64, Opteron)  Automatic software installation and updates (both service and application) Scalable and modular – different services can be run on different nodes (in front/behind firewalls) to preserve site security and integrity: Load balanced file transfer nodes (on HTAR) CERN firewall solution for large volume file transfers Fire wall ONLY High ports (50K- 55K) for parallel file transport CERN Intranet AliEn Data Transfer AliEn Other services

23 March 21, 2005ALICE Computing Model23 HP ProLiant DL380 AliEn Proxy Server Up to 2500 concurrent client connections HP ProLiant DL580 AliEn File Catalogue 9Mio entries, 400K directories, 10GB MySQL DB HP server rx2600 AliEn Job Services 500 K archived jobs HP ProLiant DL360 AliEn to CASTOR (MSS) interface HP ProLiant DL380 AliEn Storage Elements Volume Manager 4 Mio entries, 3GB MySQL DB Log files, application software storage 1TB SATA Disk server

24 March 21, 2005ALICE Computing Model24 Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Central servers CEs Sub-jobs Job processing AliEn-LCG interface Sub-jobs RB Job processing CEs Storage CERN CASTOR: underlying events Local SEs CERN CASTOR: backup copy Storage Primary copy Local SEs Output files Underlying event input files zip archive of output files Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN edg(lcg) copy&register File catalogue Phase 2 job structure Task - simulate the event reconstruction and remote event storage Completed Sep. 2004

25 March 21, 2005ALICE Computing Model25 Production history ALICE repository – history of the entire DC ~ 1 000 monitored parameters:  Running, completed processes  Job status and error conditions  Network traffic  Site status, central services monitoring  …. 7 GB data 24 million records with 1 minute granularity – analysed to improve GRID performance Statistics  400 000 jobs, 6 hours/job, 750 MSi2K hours  9M entries in the AliEn file catalogue  4M physical files at 20 AliEn SEs in centres world-wide  30 TB stored at CERN CASTOR  10 TB stored at remote AliEn SEs + 10 TB backup at CERN  200 TB network transfer CERN –> remote computing centres  AliEn efficiency observed >90%  LCG observed efficiency 60% (see GAG document)

26 March 21, 2005ALICE Computing Model26 Job repartition Jobs (AliEn/LCG): Phase 1 - 75/25%, Phase 2 – 89/11% More operation sites added to the ALICE GRID as PDC progressed Phase 2 Phase 1 17 permanent sites (33 total) under AliEn direct control and additional resources through GRID federation (LCG)

27 March 21, 2005ALICE Computing Model27 Summary of PDC’04  Computing resources  It took some effort to ‘tune’ the resources at the remote computing centres  The centres’ response was very positive – more CPU and storage capacity was made available during the PDC  Middleware  AliEn proved to be fully capable of executing high-complexity jobs and controlling large amounts of resources  Functionality for Phase 3 has been demonstrated, but cannot be used  LCG MW proved adequate for Phase 1, but not for Phase 2 and in a competitive environment  It cannot provide the additional functionality needed for Phase 3  ALICE computing model validation:  AliRoot – all parts of the code successfully tested  Computing elements configuration  Need for a high-functionality MSS shown  Phase 2 distributed data storage schema proved robust and fast  Data Analysis could not be tested

28 March 21, 2005ALICE Computing Model28 Development of Analysis Analysis Object Data designed for efficiency  Contain only data needed for a particular analysis Analysis à la PAW  ROOT + at most a small library Work on the distributed infrastructure has been done by the ARDA project Batch analysis infrastructure  Prototype published at the end of 2004 with AliEn Interactive analysis infrastructure  Demonstration performed at the end 2004 with AliEn  gLite Physics working groups are just starting now, so timing is right to receive requirements and feedback

29 March 21, 2005ALICE Computing Model29 Forward Proxy Rootd Proofd Grid/Root Authentication Grid Access Control Service TGrid UI/Queue UI Proofd StartupPROOFClient PROOFMaster Slave Registration/ Booking- DB Site PROOF SLAVE SERVERS Site A PROOF SLAVE SERVERS Site B LCGPROOFSteer Master Setup New Elements Grid Service Interfaces Grid File/Metadata Catalogue Client retrieves list of logical file (LFN + MSN) Booking Request with logical file names “Standard” Proof Session Slave ports mirrored on Master host Optional Site Gateway Master Client Grid-Middleware independend PROOF Setup Only outgoing connectivity

30 March 21, 2005ALICE Computing Model30 Grid situation History  Jan ‘04: AliEn developers are hired by EGEE and start working on new MW  May ‘04: A prototype derived from AliEn is offered to pilot users (ARDA, Biomed..) under the gLite name  Dec ‘04: The four experiments ask for this prototype to be deployed on larger preproduction service and be part of the EGEE release  Jan ‘05: This is vetoed at management level -- AliEn will not be common software Current situation  EGEE has vaguely promised to provide the same functionality of AliEn-derived MW But with a 2-4 months delay at least on top of the one already accumulated But even this will be just the beginning of the story: the different components will have to be field tested in a real environment, it took four years for AliEn  All experiments have their own middleware  Our is not maintained because our developers have been hired by EGEE  EGEE has formally vetoed any further work on AliEn or AliEn-derived software  LCG has allowed some support for ALICE but the situation is far from being clear gLite gLate

31 March 21, 2005ALICE Computing Model31 ALICE computing model For pp similar to the other experiments  Quasi-online data distribution and first reconstruction at T0  Further reconstruction passes at T1’s For AA different model  Calibration, alignment and pilot reconstructions during data taking  Data distribution and first reconstruction at T0 during the four months after AA run (shutdown)  Second and third pass distributed at T1’s For safety one copy of RAW at T0 and a second one distributed among all T1’s T0: First pass reconstruction, storage of one copy of RAW, calibration data and first-pass ESD’s T1: Subsequent reconstructions and scheduled analysis, storage of the second collective copy of RAW and one copy of all data to be safely kept (including simulation), disk replicas of ESD’s and AOD’s T2: S imulation and end-user analysis, disk replicas of ESD’s and AOD’s Very difficult to estimate network load

32 March 21, 2005ALICE Computing Model32 ALICE requirements on MiddleWare One of the main uncertainties of the ALICE computing model comes from the Grid component  ALICE was developing its computing model assuming that a MW with the same quality and functionality that AliEn would have had in two years from now will be deployable on the LCG computing infrastructure  If not, we will still analyse the data (!), but Less efficiency  more computers  more time and money More people for production  more money To elaborate an alternative model we should know what will be The functionality of the MW developed by EGEE The support we can count on from LCG Our “political” “margin of manoeuvre”

33 March 21, 2005ALICE Computing Model33 Possible strategy If a)Basic services from LCG/EGEE MW can be trusted at some level b)We can get some support to port the “higher functionality” MW onto these services We have a solution If a) above is not true but if a)We have support for deploying the ARDA-tested AliEn- derived gLite b)We do not have a political “veto” We still have a solution Otherwise we are in trouble

34 March 21, 2005ALICE Computing Model34 ALICE Offline Timeline 200420052006 PDC04 Analysis PDC04 Design of new components Development of new components PDC05 PDC06 preparation PDC06 Final development of AliRoot First data taking preparation PDC05 Computing TDR PDC06 AliRoot ready nous sommes ici CDC 04? PDC04 CDC 05

35 March 21, 2005ALICE Computing Model35 Main parameters

36 March 21, 2005ALICE Computing Model36 Processing pattern

37 March 21, 2005ALICE Computing Model37 Conclusions ALICE has made a number of technical choices for the Computing framework since 1998 that have been validated by experience  The Offline development is on schedule, although contingency is scarce Collaboration between physicists and computer scientists is excellent Tight integration with ROOT allows fast prototyping and development cycle AliEn goes a long way in providing a GRID solution adapted to HEP needs  However its evolution into a common project has been “stopped”  This is probably the largest single “risk factor” for ALICE computing Some ALICE-developed solutions have a high potential to be adopted by other experiments and indeed are becoming “common solutions”


Download ppt "F.Carminati BNL Seminar March 21, 2005 ALICE Computing Model."

Similar presentations


Ads by Google