Presentation is loading. Please wait.

Presentation is loading. Please wait.

ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP.

Similar presentations


Presentation on theme: "ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP."— Presentation transcript:

1 ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP Annecy May 19, 2009

2 2 Outline General resources/T2s User analysis and storage on the Grid (LAF is covered by Laurent’s presentation) WMS Software distribution STEP 09 Operations support

3 3 French computing centres contribution for ALICE T1 – CCIN2P3 6 T2 + T2 federation (GRIF)

4 4 Relative CPU share Last 2 months ~1/2 from T2s!

5 5 Relative contribution – T2s T2 share of the resources is substantial (globally) T2s provide ~50% of the CPU capacity for ALICE, they should also provide ~50% of the disk capacity The T0/T1 disk is mostly MSS buffer, therefore completely different function T2 role in the ALICE computing model MC production User analysis MC+RAW ESDs replicas are kept on T2 disk storage

6 6 Focus on analysis Grid responsiveness for user analysis ALICE uses common Task Queue for all grid jobs, internal prioritization Pilot Jobs are indispensible part of the schema They check the ‘sanity’ of the WN environment (and die if something is wrong) Pull the ‘top priority’ jobs for execution first

7 7 Grid response time - user jobs Type 1 - jobs with little input data (MC) Average waiting time – 11 minutes Average running time – 65 minutes 62% probability of waiting time <5 minutes Type 2 – large input data ESDs/AODs (analysis) Average waiting time – 37 minutes Average running time – 50 minutes Response time proportional to number of replicas

8 8 Grid response time – user jobs (2) Type 1 (MC) can be regarded as ‘maximum Grid responce efficiency’ Type 2 (ESDs/AODs) can be improved Trivial - more data replication (not an option – not enough storage capacity) Analysis train – grouping many analysis tasks in a common data set – improves the efficiency of tasks and resources utilization (CPU/Wall + storage load) Non-local data access through xrootd global redirector Inter-site SE cooperation, common file namespace Off-site access to storage from a job – is that really ‘off limits’?

9 9 Storage stability Critical for the analysis – nothing helps if the storage is down A site can have half of the WNs off, but not half of the storage servers… Impossible to know before the client tries to access the data Unless we allow the off-site access… ALICE computing model foresees 3 active replicas of all ESDs/AODs

10 10 Storage stability (2) T2 storage stability test under load (user tasks + production)

11 11 Storage availability scores Storage type 1 – average 73.9% Probability of all three alive (3 replicas) = 41% This defines the job waiting time and success rate xrootd native – average 92.8% Probability of all three alive (3 replicas) = 87% The above underlines the importance of extremely reliable storage, in the absence of infinite storage resources as compensation

12 12 Storage continued Storage availability/stability remains one of the top priorities for ALICE For strategic directions see Fabrizio’s talk All other parameters being equal (protocol access speed and security): ALICE recommends wherever feasible a pure xrootd installation Ancillary benefit from site admin point of view – no databases to worry about + storage cooperation through global redirector

13 13 Workload management: WMS and CREAM WMS + gLite CE Relatively long period of understanding the service parameters Big effort by GRIF experts to provide a French WMS, now with high stability and reliability Similar installations at other T1s (several at CERN) Still ‘inherits’ the gLite CE limitations CREAM CE The future, gLite CE days are numbered Strategic direction of the WLCG

14 14 Workload management (2) CREAM CE (cont’d) ALICE requires a CREAM CE at every centre, to be deployed before start of data taking Much better scalability, shown by extensive tests Hands-off operation after initial (still time- consuming) installation Excellent support by CNAF developers

15 15 Software deployment General need for improvement of software deployment tools Software distribution is a ‘Class 1 service’ – shared software area WNs and VO-box Always a (security related) point for critique Heterogeneous queues: mixed 32- and 64-bit hosts, various Linux flavors, other system library differences, hence need for various application software versions In addition, the shared area (typically NFS) Is often overloaded Single point of failure One ‘bad installation’ is fatal for the entire site operation

16 Combine all the required grid packages into distributions Combine all the required grid packages into distributions Full installation: 155 MB, mysql, ldap, perl, java... Full installation: 155 MB, mysql, ldap, perl, java... VO-box: 122 MB, monitor, perl, interfaces, VO-box: 122 MB, monitor, perl, interfaces, User: 55 MB, API client, gsoap, xrootd User: 55 MB, API client, gsoap, xrootd Worker node: 34 MB, min perl, openssl, xrootd Worker node: 34 MB, min perl, openssl, xrootd Experiment software: Experiment software: AliRoot: 160 MB AliRoot: 160 MB ROOT: 60 MB ROOT: 60 MB GEANT3: 25MB GEANT3: 25MB Packaging & size 16

17 Use existing technology 17 More than 150 million users!!

18 18 Torrent technology alitorrent.cern.ch Site A Site B No inter-site seeding

19 Torrent files created from the build system Torrent files created from the build system One seeder at CERN One seeder at CERN Standard tracker and seeder. Standard tracker and seeder. Get torrent client from ALICE web server Get torrent client from ALICE web server Aria2c Aria2c Download the files and install them Download the files and install them Seed the files while the job runs Seed the files while the job runs Application software path 19

20 20 ALICE activities calendar – STEP’09 12 01 02 03 04 05 06 07 08 09 10 11 12 2008 2009 RAW deletion Replication of RAW Reprocessing Pass 2 Analysis train WMS and CREAM CE Cosmics data taking Data taking STEP’09

21 21 ALICE STEP’09 activities Replication T0->T1 Replication T0->T1 Planned together with Cosmics data taking, must be moved forward, or Planned together with Cosmics data taking, must be moved forward, or We can repeat the last year’s exercise, same rates (~100MB/sec), same destinations We can repeat the last year’s exercise, same rates (~100MB/sec), same destinations Re-processing with data recalls from tape at T1 Re-processing with data recalls from tape at T1 Highly desirable exercise, the data is already at the T1 MSS Highly desirable exercise, the data is already at the T1 MSS CCIN2P3 MSS/xrootd setup is being organized, we can export fresh RAW data into the buffer CCIN2P3 MSS/xrootd setup is being organized, we can export fresh RAW data into the buffer

22 22 ALICE STEP’09 activities (2) Non-Grid activity – transfer rate tests (up to 1.25GB/sec) DAQ@P2 to CASTOR Non-Grid activity – transfer rate tests (up to 1.25GB/sec) DAQ@P2 to CASTOR Validation of new CASTOR and xrootd transfer protocol for RAW Validation of new CASTOR and xrootd transfer protocol for RAW Will go on just before or overlap with STEP’09 Will go on just before or overlap with STEP’09 CASTOR v.2.1.8 already deployed CASTOR v.2.1.8 already deployed The transfer rate test will be coupled with first pass reco@T0 and second pass reco@T1 The transfer rate test will be coupled with first pass reco@T0 and second pass reco@T1

23 23 Grid operation – site support We need more help from the regional experts and site administrators Proactively looking at local services problems With data taking around the corner, the pressure to identify and fix problem will be mounting STEP09 will hopefully demonstrate this (albeit for a short time) The data taking will be 9 months of uninterrupted operation!

24 24 Grid operation – site support Two day training session on 26/27 Maytraining session VO-box setup and operation (gLite and AliEn services) Common problems and solutions Monitoring Storage The training will be also on EVO All regional experts and site administrators are strongly encouraged to participate More than 40 people have registered already

25 25 Summary Grid operation/middleware The main focus is on reliable storage – not yet there After initial ‘teething’ pains, the WMS is under control CREAM CE must be everywhere and operational before data taking In general, everyone needs services which ‘just run’, with minimal intervention and debugging Grid operation/expert support STEP’09 is the last ‘large’ exercise before data taking Still, it will show only if there are big holes The long LHC run will put extraordinary load on all experts Training is organized for all – current status of software and procedures


Download ppt "ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP."

Similar presentations


Ads by Google