ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP.

Slides:



Advertisements
Similar presentations
1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Torrent-based Software Distribution in ALICE.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary LHCC Referees meeting June 12, 2012.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
ALICE data access WLCG data WG revival 4 October 2013.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
ALICE Roadmap for 2009/2010 Patricia Méndez Lorenzo (IT/GS) Patricia Méndez Lorenzo (IT/GS) On behalf of the ALICE Offline team Slides prepared by Latchezar.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
Offline shifter training tutorial L. Betev February 19, 2009.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
ALICE Grid operations: last year and perspectives (+ some general remarks) ALICE T1/T2 workshop Tsukuba 5 March 2014 Latchezar Betev Updated for the ALICE.
The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008.
Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
LCG-LHCC mini-review ALICE Latchezar Betev Latchezar Betev for the ALICE collaboration.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
ALICE experiences with CASTOR2 Latchezar Betev ALICE.
Status of AliEn2 Services ALICE offline week Latchezar Betev Geneva, June 01, 2005.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
ALICE Grid operations +some specific for T2s US-ALICE Grid operations review 7 March 2014 Latchezar Betev 1.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Federating Data in the ALICE Experiment
Kilian Schwarz ALICE Computing Meeting GSI, October 7, 2009
WLCG IPv6 deployment strategy
ALICE internal and external network
gLite->EMI2/UMD2 transition
Torrent-based software distribution
Data Challenge with the Grid in ATLAS
Status of the CERN Analysis Facility
INFN-GRID Workshop Bari, October, 26, 2004
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Update on Plan for KISTI-GSDC
Status and Prospects of The LHC Experiments Computing
Offline shifter training tutorial
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
MC data production, reconstruction and analysis - lessons from PDC’04
Torrent-based software distribution
Simulation use cases for T2 in ALICE
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Offline shifter training tutorial
The LHCb Computing Data Challenge DC06
Presentation transcript:

ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP Annecy May 19, 2009

2 Outline General resources/T2s User analysis and storage on the Grid (LAF is covered by Laurent’s presentation) WMS Software distribution STEP 09 Operations support

3 French computing centres contribution for ALICE T1 – CCIN2P3 6 T2 + T2 federation (GRIF)

4 Relative CPU share Last 2 months ~1/2 from T2s!

5 Relative contribution – T2s T2 share of the resources is substantial (globally) T2s provide ~50% of the CPU capacity for ALICE, they should also provide ~50% of the disk capacity The T0/T1 disk is mostly MSS buffer, therefore completely different function T2 role in the ALICE computing model MC production User analysis MC+RAW ESDs replicas are kept on T2 disk storage

6 Focus on analysis Grid responsiveness for user analysis ALICE uses common Task Queue for all grid jobs, internal prioritization Pilot Jobs are indispensible part of the schema They check the ‘sanity’ of the WN environment (and die if something is wrong) Pull the ‘top priority’ jobs for execution first

7 Grid response time - user jobs Type 1 - jobs with little input data (MC) Average waiting time – 11 minutes Average running time – 65 minutes 62% probability of waiting time <5 minutes Type 2 – large input data ESDs/AODs (analysis) Average waiting time – 37 minutes Average running time – 50 minutes Response time proportional to number of replicas

8 Grid response time – user jobs (2) Type 1 (MC) can be regarded as ‘maximum Grid responce efficiency’ Type 2 (ESDs/AODs) can be improved Trivial - more data replication (not an option – not enough storage capacity) Analysis train – grouping many analysis tasks in a common data set – improves the efficiency of tasks and resources utilization (CPU/Wall + storage load) Non-local data access through xrootd global redirector Inter-site SE cooperation, common file namespace Off-site access to storage from a job – is that really ‘off limits’?

9 Storage stability Critical for the analysis – nothing helps if the storage is down A site can have half of the WNs off, but not half of the storage servers… Impossible to know before the client tries to access the data Unless we allow the off-site access… ALICE computing model foresees 3 active replicas of all ESDs/AODs

10 Storage stability (2) T2 storage stability test under load (user tasks + production)

11 Storage availability scores Storage type 1 – average 73.9% Probability of all three alive (3 replicas) = 41% This defines the job waiting time and success rate xrootd native – average 92.8% Probability of all three alive (3 replicas) = 87% The above underlines the importance of extremely reliable storage, in the absence of infinite storage resources as compensation

12 Storage continued Storage availability/stability remains one of the top priorities for ALICE For strategic directions see Fabrizio’s talk All other parameters being equal (protocol access speed and security): ALICE recommends wherever feasible a pure xrootd installation Ancillary benefit from site admin point of view – no databases to worry about + storage cooperation through global redirector

13 Workload management: WMS and CREAM WMS + gLite CE Relatively long period of understanding the service parameters Big effort by GRIF experts to provide a French WMS, now with high stability and reliability Similar installations at other T1s (several at CERN) Still ‘inherits’ the gLite CE limitations CREAM CE The future, gLite CE days are numbered Strategic direction of the WLCG

14 Workload management (2) CREAM CE (cont’d) ALICE requires a CREAM CE at every centre, to be deployed before start of data taking Much better scalability, shown by extensive tests Hands-off operation after initial (still time- consuming) installation Excellent support by CNAF developers

15 Software deployment General need for improvement of software deployment tools Software distribution is a ‘Class 1 service’ – shared software area WNs and VO-box Always a (security related) point for critique Heterogeneous queues: mixed 32- and 64-bit hosts, various Linux flavors, other system library differences, hence need for various application software versions In addition, the shared area (typically NFS) Is often overloaded Single point of failure One ‘bad installation’ is fatal for the entire site operation

Combine all the required grid packages into distributions Combine all the required grid packages into distributions Full installation: 155 MB, mysql, ldap, perl, java... Full installation: 155 MB, mysql, ldap, perl, java... VO-box: 122 MB, monitor, perl, interfaces, VO-box: 122 MB, monitor, perl, interfaces, User: 55 MB, API client, gsoap, xrootd User: 55 MB, API client, gsoap, xrootd Worker node: 34 MB, min perl, openssl, xrootd Worker node: 34 MB, min perl, openssl, xrootd Experiment software: Experiment software: AliRoot: 160 MB AliRoot: 160 MB ROOT: 60 MB ROOT: 60 MB GEANT3: 25MB GEANT3: 25MB Packaging & size 16

Use existing technology 17 More than 150 million users!!

18 Torrent technology alitorrent.cern.ch Site A Site B No inter-site seeding

Torrent files created from the build system Torrent files created from the build system One seeder at CERN One seeder at CERN Standard tracker and seeder. Standard tracker and seeder. Get torrent client from ALICE web server Get torrent client from ALICE web server Aria2c Aria2c Download the files and install them Download the files and install them Seed the files while the job runs Seed the files while the job runs Application software path 19

20 ALICE activities calendar – STEP’ RAW deletion Replication of RAW Reprocessing Pass 2 Analysis train WMS and CREAM CE Cosmics data taking Data taking STEP’09

21 ALICE STEP’09 activities Replication T0->T1 Replication T0->T1 Planned together with Cosmics data taking, must be moved forward, or Planned together with Cosmics data taking, must be moved forward, or We can repeat the last year’s exercise, same rates (~100MB/sec), same destinations We can repeat the last year’s exercise, same rates (~100MB/sec), same destinations Re-processing with data recalls from tape at T1 Re-processing with data recalls from tape at T1 Highly desirable exercise, the data is already at the T1 MSS Highly desirable exercise, the data is already at the T1 MSS CCIN2P3 MSS/xrootd setup is being organized, we can export fresh RAW data into the buffer CCIN2P3 MSS/xrootd setup is being organized, we can export fresh RAW data into the buffer

22 ALICE STEP’09 activities (2) Non-Grid activity – transfer rate tests (up to 1.25GB/sec) to CASTOR Non-Grid activity – transfer rate tests (up to 1.25GB/sec) to CASTOR Validation of new CASTOR and xrootd transfer protocol for RAW Validation of new CASTOR and xrootd transfer protocol for RAW Will go on just before or overlap with STEP’09 Will go on just before or overlap with STEP’09 CASTOR v already deployed CASTOR v already deployed The transfer rate test will be coupled with first pass and second pass The transfer rate test will be coupled with first pass and second pass

23 Grid operation – site support We need more help from the regional experts and site administrators Proactively looking at local services problems With data taking around the corner, the pressure to identify and fix problem will be mounting STEP09 will hopefully demonstrate this (albeit for a short time) The data taking will be 9 months of uninterrupted operation!

24 Grid operation – site support Two day training session on 26/27 Maytraining session VO-box setup and operation (gLite and AliEn services) Common problems and solutions Monitoring Storage The training will be also on EVO All regional experts and site administrators are strongly encouraged to participate More than 40 people have registered already

25 Summary Grid operation/middleware The main focus is on reliable storage – not yet there After initial ‘teething’ pains, the WMS is under control CREAM CE must be everywhere and operational before data taking In general, everyone needs services which ‘just run’, with minimal intervention and debugging Grid operation/expert support STEP’09 is the last ‘large’ exercise before data taking Still, it will show only if there are big holes The long LHC run will put extraordinary load on all experts Training is organized for all – current status of software and procedures