LCG GDB – Nov’05 1 Expt SC3 Status Nick Brook In chronological order: ALICE CMS LHCb ATLAS.

Slides:

Advertisements

Similar presentations

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

Production Activities and Requirements by ALICE Patricia Méndez Lorenzo (on behalf of the ALICE Collaboration) Service Challenge Technical Meeting CERN,

FZU participation in the Tier0 test CERN August 3, 2006.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

IST E-infrastructure shared between Europe and Latin America High Energy Physics Applications in EELA Raquel Pezoa Universidad.

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

USATLAS SC4. 2 ?! …… The same host name for dual NIC dCache door is resolved to different IP addresses depending.

BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

LFC Replication Tests LCG 3D Workshop Barbara Martelli.

The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006.

Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Production Activities and Results by ALICE Patricia Méndez Lorenzo (on behalf of the ALICE Collaboration) Service Challenge Technical Meeting CERN, 15.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications.

ALICE experiences with CASTOR2 Latchezar Betev ALICE.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.

Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

LHCC meeting – Feb’06 1 SC3 - Experiments’ Experiences Nick Brook In chronological order: ALICE CMS LHCb ATLAS.

CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.

LCG Service Challenge: Planning and Milestones

ATLAS Use and Experience of FTS

Data Challenge with the Grid in ATLAS

INFN-GRID Workshop Bari, October, 26, 2004

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Readiness of ATLAS Computing - A personal view

MC data production, reconstruction and analysis - lessons from PDC’04

Simulation use cases for T2 in ALICE

R. Graciani for LHCb Mumbay, Feb 2006

LHC Data Analysis using a worldwide computing grid

The LHCb Computing Data Challenge DC06

Presentation transcript:

LCG GDB – Nov’05 1 Expt SC3 Status Nick Brook In chronological order: ALICE CMS LHCb ATLAS

LCG GDB – Nov’05 2

3 Alice Physics Data Challenge ’05 - goals PDC’05 : Test and validation of the remaining parts of the ALICE Offline computing model: –Quasi-online reconstruction of RAW data at CERN (T0), without calibration –Synchronised data replication from CERN to T1’s –Synchronised data replication from T2’s to their ‘host’ T1 –Second phase (delayed) reconstruction at T1’s with calibration and remote storage –Data analysis Data production: –List of physics signals defined by the ALICE Physics Working Groups –Data used for detector and physics studies –Approximately 500K Pb+Pb events with different physics content, 1M p+p events, 80TB production data and few TB user generated data –Structure – divided in three phases: –Phase 1 – Production of events on the GRID, storage at CERN and at T2s. –Phase 2 ( synchronized with SC3) – Pass 1 reconstruction at CERN, push data from CERN to T1’s, Pass 2 reconstruction at T1s with calibration and storage: Phase 2 (throughput phase of SC3) – how fast we can push data out –Phase 3 – Analysis of data (batch) and interactive analysis with PROOF

LCG GDB – Nov’05 4 Methods of operation Use LCG/EGEE SC3 baseline services: –Workload management –Reliable file transfer (FTS) –Local File Catalogue (LFC) –Storage (SRM), CASTOR2 Production and data replication phases are synchronised with LCG/EGEE SC3 Operation of PDC’05/SC3 coordinated through ALICE-LCG Task Force Run entirely on LCG resources: –Use the framework of VO-boxes provided at the sites Require approximately 1400 CPUs (but would like to have as much as possible) and 80 TB of storage capacity List of active SC3 sites for ALICE: –T1’s: CCIN2P3, CERN, CNAF, GridKa (up to few hundred CPUs) –T2’s: Bari, Catania, GSI, JINR, ITEP, Torino (up to hundred CPUs) –US (OSG), Nordic (NDGF) and a number of other sites joining the exercise presently –SC3 + others – approximately 25 centres

LCG GDB – Nov’05 5 Site ALICE central services Job submission Job 1lfn1, lfn2, lfn3, lfn4 Job 2lfn1, lfn2, lfn3, lfn4 Job 3lfn1, lfn2, lfn3 Job 1.1lfn1 Job 1.2lfn2 Job 1.3lfn3, lfn4 Job 2.1lfn1, lfn3 Job 2.1lfn2, lfn4 Job 3.1lfn1, lfn3 Job 3.2lfn2 Optimizer CE agent RB CEWN Env OK? Die with grac e Execs agent Sends job agent to site YesNo Knows close SE’s Matchmakes Receives work-load Asks work-load Retrieves workload Sends job result Updates TQ Submits job User ALICE Job Catalogue Submits job agent VO-Box LCG User Job ALICE catalogues Registers output lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} lfnguid{se’s} ALICE File Catalogue

LCG GDB – Nov’05 6 Status of production Setup and operational status of VO-boxes framework: –Gained very good experience during the installation and operation –Interaction between the ALICE-specific agents and LCG services is robust –The VO-box model is scaling with the increasing load –In production since almost 1 ½ months Many thanks to the IT/GD group for the help with the installation and operation And to the site administrators for making the VO-boxes available Setup and status of storage: –ALICE is now completely migrated to –Currently stored 200K files (Root ZIP archives), 20TB, adding ~4K files/day Operational issues discussed regularly with IT/FIO group, ALICE is providing feedback

LCG GDB – Nov’05 7 Status of production Current Job status: –Production job duration: 8 ½ hours on 1KSi2K CPU, output archive size: 1 GB (consists of 20 files); Total CPU work: 80 MSi2K hours; Total storage: 20 TB Last 24 hours of operation

LCG GDB – Nov’05 8 ALICE plans: File replication with FTS: –FTS endpoints tested at all ALICE SC3 sites –Start data migration in about 10 days, initially T0->T1 –Test, if possible, migration Tx->Ty Re-processing of data with calibration at T0/T1: –AliRoot framework ready, currently calibration and alignment algorithms implemented by the ALICE detector experts –Aiming for GRID tests at the end of 2005 Analysis of produced data: –Analysis framework developed by ARDA –Aiming at first controlled tests beginning of 2006

LCG GDB – Nov’05 9

10 Phase 1: (Data Moving) –Demonstrate Data Management to meet the requirements of the Computing Model –Planned: October-November Phase 2: (Data Processing) –Demonstrate the full data processing sequence in real time –Demonstrate full integration of the Data and Workload Management subsystems –Planned: mid-November + December SC3 Aims Currently still in Phase 1 - Phase 2 to start soon

LCG GDB – Nov’05 11 LHCb LCG Central Data Movement model based at CERN. –FTS+TransferAgent+ RequestDB TransferAgent+ReqDB developed for this purpose. Transfer Agent run on LHCb managed lxgate class machine LHCb Architecture for using FTS

LCG GDB – Nov’05 12 DIRAC transfer agent Gets transfer requests from Transfer Manager Maintains the pending transfer queue Validates transfer requests Submits transfers to the FTS Follows the transfers execution, resubmits if necessary Sends progress reports to the monitoring system Updates the replica information in the File Catalog; Accounting for the transfers –

LCG GDB – Nov’05 13 Phase 1  Distribute stripped data Tier0  Tier1’s (1-week). 1TB  The goal is to demonstrate the basic tools  Precursor activity to eventual distributed analysis  Distribute data Tier0  Tier1’s (2-week). 8TB  The data are already accumulated at CERN  The data are moved to Tier1 centres in parallel.  The goal is to demonstrate automatic tools for data moving and bookkeeping and to achieve a reasonable performance of the transfer operations  Removal of replicas (via LFN) from all Tier-1’s  Tier1 centre(s) to Tier0 and to other participating Tier1 centers  data are already accumulated  data are moved to Tier1 centres in parallel  Goal to meet transfer need during stripping process

LCG GDB – Nov’05 14 Tier0-Tier1 channels over dedicated network links Bi-directional FZK- CNAF channel on open network Tier1-Tier1 channel matrix requested from all sites - still in the process of configuration - central or expt coordination ? Participating Sites Need for central service for managing T1-T1 matrix ??

LCG GDB – Nov’05 15 Overview of SC3 activity SARA show almost no effective bandwidth from 25/10 When service stable - LHCb SC3 needs surpassed

LCG GDB – Nov’05 16 Problems… FTS files per channel dramatically effects performance By default set to 30 concurrent files per channel Each file with 10 GridFTP streams 300 streams proved to be too much for some endpoints PIC and RAL bandwidth stalled with 30 files 10 files gave good throughput Pre 19/10 many problems with Castor2/FTS interaction Files not staged cause FTS transfers to timeout/fail Current not possible to transfer files from tape directly with FTS Pre-staged files to disk - ~50k files for transfer (~75k in total: 10 TB) CASTOR2 too many problems to list … Reliability of service increased markedly when ORACLE server machine upgraded

LCG GDB – Nov’05 17 Problems… srm_advisory_delete Inconsistent behaviour of SRM depending on “backend” implementation Not well - defined functionality in SRM v1.1 Not possible to physically delete files in consistent way on the Grid at the moment dCache can “advisory delete” and re-write - can’t overwrite until an “advisory delete” CASTOR can simply overwrite ! FTS failure problems Partial transfer can’t re-transfer after failure FTS failed to issue an “advisory delete” after a failed transfer Can’t re-schedule transfer to dCache sites until an “advisory delete” issued manually

LCG GDB – Nov’05 18 Problems… LFC registration/query This is currently limiting factor in our system Moving to using “sessions” - remove authentication overhead for each operation Under evaluation (another approach read-only insecure front-end for query operations)

LCG GDB – Nov’05 19

LCG GDB – Nov’05 20 ATLAS SC3 goals Exercise ATLAS data flow Integration of data flow with the ATLAS Production System Tier-0 exercise Completion of a “Distributed Production” exercise –Has been delayed Following slides on Tier0 dataflow exercise which is running now! More information: –

LCG GDB – Nov’05 21 ATLAS-SC3 Tier0 Quasi-RAW data generated at CERN and reconstruction jobs run at CERN –No data transferred from the pit to the computer centre “Raw data” and the reconstructed ESD and AOD data are replicated to Tier 1 sites using agents on the VO Boxes at each site. Exercising use of CERN infrastructure … –Castor 2, LSF … and the LCG Grid middleware … –FTS, LFC, VO Boxes … and expt Distributed Data Management (DDM) software

LCG GDB – Nov’05 22 ATLAS Tier-0 EF CPU T1 castor RAW 1.6 GB/file 0.2 Hz 17K f/day 320 MB/s 27 TB/day ESD 0.5 GB/file 0.2 Hz 17K f/day 100 MB/s 8 TB/day AOD 10 MB/file 2 Hz 170K f/day 20 MB/s 1.6 TB/day AODm 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day RAW AOD RAW ESD (2x) AODm (10x) RAW ESD AODm 0.44 Hz 37K f/day 440 MB/s 1 Hz 85K f/day 720 MB/s 0.4 Hz 190K f/day 340 MB/s 2.24 Hz 170K f/day (temp) 20K f/day (perm) 140 MB/s

LCG GDB – Nov’05 23 ATLAS-SC3 Tier-0 Main goal is a 10% exercise –Reconstruct “10%” of the number of events ATLAS will get in 2007 using “10%” of the full resources that will be needed at that time Tier-0 –~300 kSI2k –“EF” to CASTOR: 32 MB/s –Disk to tape: 44 MB/s (32 for raw and 12 for ESD+AOD) –Disk to WN: 34 MB/s –T0 to T1: 72 MB/s –3.8 TB to “tape” per day Tier-1 (in average): –~1000 files per day –0.6 TB per day –At a rate of ~7.2 MB/s

LCG GDB – Nov’05 24 SC3 pre-production testing For 2/3 weeks up to 1st November tested the functionality of SC3 services integrated with ATLAS DDM (FTS, LFC etc..) Very useful to have this testing phase with pilot services since there were many problems… –On expt side, good test of deployment mechanism - fixed bugs & optimised –on sites - mainly trivial like SRM paths etc. –A point on mailing lists… There are too many For official problem reporting: - sometimes ticketing system doesn’t work and response is slow Better coordination needed when deploying components to avoid conflicts (for example LCG/POOL/Castor)

LCG GDB – Nov’05 25 Data transfer

LCG GDB – Nov’ h before 4 day intervention 29/10 - 1/11 We achieved quite good rate in the testing phase (sustained MB/s to one site (PIC))

LCG GDB – Nov’05 27 SC3 experience in ‘production’ phase Started on Wed 2nd Nov - ran smoothly for ~24h (above bandwidth target) until… problems occurred with all 3 sites simultaneously –CERN: power cut and network problems which then caused castor namespace problem –PIC: Tape library problem meant FTS channel switched off –CNAF: LFC client upgraded and not working properly It took about 1 day to solve all these problems No jobs running during the last weekend.

LCG GDB – Nov’05 28 Data Distribution Use a generated “dataset” –Contains 6035 files (3 TB) and we tried to replicate it to CNAF and PIC PIC: 3600 files copied and registered –2195 ‘failed replication’ after 5 retries by us x 3 FTS retries Problem under investigation –205 ‘assigned’ - still waiting to be copied –31 ‘validation failed’ since SE is down –4 ‘no replicas found’ LFC connection error CNAF: 5932 files copied and registered –89 ‘failed replication’ –14 ‘no replicas found’

LCG GDB – Nov’05 29 General view of SC3 When everything is running smoothly ATLAS get good results The middleware (FTS, LFC) is stable but the sites’ infrastructure is still very unreliable –ATLAS DDM software dependencies can also cause problems when sites upgrade middleware good response from LCG and sites when there are problems (proviso earlier list comment) –But the sites are not running 24/7 support –Means a problem discovered at 6pm on Friday may not be answered until 9am on Monday so we lose 2 1/2 days production Good cooperation with CERN-IT Castor and LSF teams. not managed to exhaust anything production s/w; LCG m/w) Still far from concluding the exercise and not running stably in any way - cause for concern Exercise will continue adding new sites.

LCG GDB – Nov’05 30 General Summary of SC3 experiences Reliability seems to be the major issue: CASTOR2 - still ironing out problems, but big improvements in service Coordination issues Problems with sites and networks MSS, security, network, services… FTS: For well-defined site/channels performs well after tuning Timeout problems dealing with accessing data from MSS Clean-up problems after transfer failures Ability for a centralised service for 3rd party transfers Plenty to discuss at next week’s workshop SRM: Limitations/ambiguity (already flagged) in functionality