LHCC meeting – Feb’06 1 SC3 - Experiments’ Experiences Nick Brook In chronological order: ALICE CMS LHCb ATLAS.

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

1 The ALICE Tier-2’s in Italy Roberto Barbera (*) Univ. of Catania and INFN Workshop CCR INFN 2006 Otranto, (*) Many thanks to A. Dainese, D.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Les Les Robertson LCG Project Leader LCG - The Worldwide LHC Computing Grid LHC Data Analysis Challenges for 100 Computing Centres in 20 Countries HEPiX.
Production Activities and Requirements by ALICE Patricia Méndez Lorenzo (on behalf of the ALICE Collaboration) Service Challenge Technical Meeting CERN,
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
IST E-infrastructure shared between Europe and Latin America High Energy Physics Applications in EELA Raquel Pezoa Universidad.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
USATLAS SC4. 2 ?! …… The same host name for dual NIC dCache door is resolved to different IP addresses depending.
BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
USATLAS SC4. 2 ?! …… The same host name for dual NIC dCache door is resolved to different IP addresses depending.
The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006.
Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Summary of Services for the MC Production Patricia Méndez Lorenzo WLCG T2 Workshop CERN, 12 th June 2006.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
WLCG 1 Service Challenge 4: Preparation, Planning and Outstanding Issues at INFN Workshop sul Calcolo e Reti dell'INFN Jun.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Production Activities and Results by ALICE Patricia Méndez Lorenzo (on behalf of the ALICE Collaboration) Service Challenge Technical Meeting CERN, 15.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
Status of AliEn2 Services ALICE offline week Latchezar Betev Geneva, June 01, 2005.
Service Challenge Report Federico Carminati GDB – January 11, 2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
LCG GDB – Nov’05 1 Expt SC3 Status Nick Brook In chronological order: ALICE CMS LHCb ATLAS.
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
1 S. JEZEQUEL- First chinese-french workshop 13 December 2006 Grid: An LHC user point of vue S. Jézéquel (LAPP-CNRS/Université de Savoie)
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
ARDA-ALICE activity in 2005 and tasks in 2006
LCG Service Challenge: Planning and Milestones
ATLAS Use and Experience of FTS
Data Challenge with the Grid in ATLAS
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
LHC Data Analysis using a worldwide computing grid
The LHCb Computing Data Challenge DC06
Presentation transcript:

LHCC meeting – Feb’06 1 SC3 - Experiments’ Experiences Nick Brook In chronological order: ALICE CMS LHCb ATLAS

LHCC meeting – Feb’06 2

3 General running statistics Event sample (last two months running) –22,500 jobs completed (Pb+Pb and p+p) Average duration 8 hours, 67,500 cycles of jobs Total CPU work: 540 kSi2k hours Total output: 20 TB (90% CASTOR2, 10% Site SE’s) Centres participation (22 total) –4 T1’s: CERN, CNAF, GridKa, CCIN2P3 –18 T2’s: Bari (I), Clermont (FR), GSI (D), Houston (USA), ITEP (RUS), JINR (RUS), KNU (UKR), Muenster (D), NIHAM (RO), OSC (USA), PNPI (RUS), SPbSU (RUS), Prague (CZ), RMKI (HU), SARA (NL), Sejong (SK), Torino (I), UiB (NO) Jobs done per site –T1’s: CERN 19%, CNAF 17%, GridKa 31%, CCIN2P3 22% Very evenly distribution among the T1’s –T2’s: total of 11% Good stability at: Prague, Torino, NIHAM, Muenster, GSI, OSC Some under-utilization of T2 resources – more centres available, could not install the Grid software to use fully

LHCC meeting – Feb’06 4 Methods of operation Use LCG/EGEE SC3 baseline services: –Workload management –Reliable file transfer (FTS) –Local File Catalogue (LFC) –Storage (SRM), CASTOR2 Run entirely on LCG resources: –Use the framework of VO-boxes provided at the sites

LHCC meeting – Feb’06 5 Status of production Production job duration: 8 ½ hours on 1KSi2K CPU, output archive size: 1 GB (consists of 20 files) 2450 jobs

LHCC meeting – Feb’06 6 Results VO-box behaviour –No problems with services running, no interventions necessary –Load profile on VO-boxes – in average proportional to the number of jobs running on the site, nothing special CERN GridKA

LHCC meeting – Feb’06 7 ALICE VO-specific and LCG software Positive –Stable set of central services for production (user authentication, catalogue, single task queue, submission, retrieval of jobs) –Well established (and simple) installation methods for the ALICE specific VO-Box software –Good integration with the LCG VO-Box stack –Demonstrated scalability and robustness of the VO-Box model –Successful mass job submission through the LCG WMS Issues –Rapid updates of the MW problematic with inclusion of more computing centres on a stable basis However all centres in the SC3 plan were kept up-to-date Essentially due to a limited number of experts, currently focused on the software development –Not all services thoroughly tested, in particular LFC and FTS

LHCC meeting – Feb’06 8

9 SC3 Operations CMS central responsibilities –Data transfers entirely managed through PhEDEx by central transfer management database operated by PhEDEx operations Using underlying grid protocols srmcp, globus-url-copy and FTS Placing files through SRM on site storage based on Castor, dCache, DPM –CMS analysis jobs submitted by job robot through CMS CRAB system Using LCG RB (gdrb06.cern.ch ) and OSG Condor-G interfaces –monitoring info centrally collected using MonaLisa and CMS Dashboard Fed from RGMA, MonALISA and site monitoring infrastructure Site responsibilities (by CMS people at or “near” site) –ensuring site mass storage and mass storage interfaces are functional, grid interfaces are responding, and data publishing steps are succeeding Data publishing, discovery: RefDB, PubDB, ValidationTools Site local file catalogues: POOL XML, POOL MySQL –A lot of infrastructure tools are provided to the sites, but having the whole chain hang together requires perseverance

LHCC meeting – Feb’06 10

LHCC meeting – Feb’06 11

LHCC meeting – Feb’06 12

LHCC meeting – Feb’06 13 Summary of Experiences Months of intense debugging is beginning to bear fruit –Promising results and impressive effort by numerous sites, but... –debugging and shaking out components overwhelmed end-to-end goals Many services inefficiencies became apparent during challenge period De-scoped to debugging pieces that did not work as expected. Lessons learned and principal concerns –Castor-2: Innumerable problems - now hope to run more smoothly –SRM: Less standard than anticipated, lacking tuning at Castor/SRM sites –LFC: integration work was done for use as CMS/POOL file catalog –DPM: RFIO incompatibilities make CMS applications fail to access files –FTS: Integration ongoing, move to FTS 1.4 –CMS data publishing: Difficult to configure and very difficult to operate Looking forward to improvements with new system –CMS software releases: Improve release/distribution process,validation

LHCC meeting – Feb’06 14

LHCC meeting – Feb’06 15 Phase 1  Distribute stripped data Tier0  Tier1’s (1-week). 1TB  The goal is to demonstrate the basic tools  Precursor activity to eventual distributed analysis  Distribute data Tier0  Tier1’s (2-week). 8TB  The data are already accumulated at CERN  The data are moved to Tier1 centres in parallel.  The goal is to demonstrate automatic tools for data moving and bookkeeping and to achieve a reasonable performance of the transfer operations  Removal of replicas (via LFN) from all Tier-1’s  Tier1 centre(s) to Tier0 and to other participating Tier1 centers  data are already accumulated  data are moved to Tier1 centres in parallel  Goal to meet transfer need during stripping process

LHCC meeting – Feb’06 16 PIC No FTS Server Channels Managed by Source SE IN2P3 FTS server Manage Incoming Channels CNAF FTS Server Manage Incoming Channels FZK FTS Server Manage Incoming Channels RAL FTS Server Manage Incoming Channels SARA FTS Server Manage Incoming Channels T1 Site FTS Server Status Configuration of Channel Management Key T1-T1 Channel Status FTS central service for managing T1-T1 matrix

LHCC meeting – Feb’06 17 Overview of SC3 activity When service stable - LHCb SC3 needs surpassed

LHCC meeting – Feb’06 18 File removal Four phases: RAL only -RAL, CNAF, GRIDKA, IN2P3, PIC -CNAF, GRIDKA, IN2P3, PIC -GRIDKA, IN2P3 Time per 100 files 50k replicas removed in ~28 hours –10k replicas at each site –Each site with its own agent problems in the current middleware –Removing remote physical files is slow Bulk removal operations are necessary –Different storage flavours are showing slightly different functionality : No generic way to remove data on all the sites –LFC File catalog (un)registration operations are slow Bulk operations within secure sessions are necessary Secs 

LHCC meeting – Feb’06 19 Phase 1 1.Distribute stripped data Tier0  Tier1’s (1-week). 1TB  Succeeded but not a rate given in metrics of success 2.Distribute data Tier0  Tier1’s (2-week). 8TB  Succeeded - (nearly) achieved “acceptable” metric but not “success” metric  Like to repeat as part of the SC3 re-run 3.Removal of replicas (via LFN) from all Tier-1’s  Failure to meet 24 hr metric Inconsistent behaviour of SRM, bulk operations needed, … 4.Tier1 centre(s) to Tier0 and to other participating Tier1 centers  failure  FTS did not support third party transfer  Complicated T1-T1 matrix been set up - beginning to test

LHCC meeting – Feb’06 20

LHCC meeting – Feb’06 21 ATLAS-SC3 Tier0 Quasi-RAW data generated at CERN and reconstruction jobs run at CERN –No data transferred from the pit to the computer centre “Raw data” and the reconstructed ESD and AOD data are replicated to Tier 1 sites using agents on the VO Boxes at each site. Exercising use of CERN infrastructure … –Castor 2, LSF … and the LCG Grid middleware … –FTS, LFC, VO Boxes … and expt software –Production System: new Supervisor (Eowyn) –Tier0 Management System (TOM) – Raw Data generator (Jerry) –Distributed Data Management (DDM) software (DQ2)

LHCC meeting – Feb’06 22 Dataflow 2007 EF CPU T1 castor RAW 1.6 GB/file 0.2 Hz 17K f/day 320 MB/s 27 TB/day ESD 0.5 GB/file 0.2 Hz 17K f/day 100 MB/s 8 TB/day AOD 10 MB/file 2 Hz 170K f/day 20 MB/s 1.6 TB/day AODm 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day RAW AOD RAW ESD (2x) AODm (10x) RAW ESD AODm 0.44 Hz 37K f/day 440 MB/s 1 Hz 85K f/day 720 MB/s 0.4 Hz 190K f/day 340 MB/s 2.24 Hz 170K f/day (temp) 20K f/day (perm) 140 MB/s SC3 10% challenge of 2007 rates

LHCC meeting – Feb’06 23 Snapshot of Activity 24h period: 1-2 December achieved quite good rate (sustaining >80 MB/s to sites)

LHCC meeting – Feb’06 24 Daily rates

LHCC meeting – Feb’ /10 of ATLAS Tier1s used … SC3 experience in ‘production’ phase

LHCC meeting – Feb’06 26 General view of SC3 ATLAS software seems to work as required –Most problems with integration of “Grid” and “storage” middleware (srm-dCache; srm-Castor) at the sites. Met throughput targets at various points –But not consistently sustained Need to improve communication with sites

LHCC meeting – Feb’06 27 General Summary of SC3 experiences Extremely useful for shaking down sites, experiment systems & WLCG Many new components used for the 1 st time in anger Need for additional functionality in services FTS, LFC, SRM, … Reliability seems to be the major issue CASTOR2 - still ironing out problems, but big improvements Coordination issues Problems with sites and networks MSS, security, network, services… FTS: For well-defined site/channels performs well after tuning Timeout problems dealing with accessing data from MSS SRM: Limitations/ambiguity (already flagged) in functionality