CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th.

Slides:

Advertisements

Similar presentations

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.

Overview of day-to-day operations Suzanne Poulat.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

GridKa Summer 2010 T. Kress, G.Quast, A. Scheurer Migration of data from old to new dCache instance finished on Nov. 23 rd almost 500'000 files (600.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

II EGEE conference Den Haag November, ROC-CIC status in Italy

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)

Servizi core INFN Grid presso il CNAF: setup attuale

Daniele Bonacorsi Andrea Sciabà

Jean-Philippe Baud, IT-GD, CERN November 2007

WLCG IPv6 deployment strategy

The Beijing Tier 2: status and plans

Xiaomei Zhang CMS IHEP Group Meeting December

LCG Service Challenge: Planning and Milestones

Flavia Donno CERN GSSD Storage Workshop 3 July 2007

Jan 12, 2005 Improving CMS data transfers among its distributed Computing Facilities N. Magini CERN IT-ES-VOS, Geneva, Switzerland J. Flix Port d'Informació.

Data Challenge with the Grid in ATLAS

Update on Plan for KISTI-GSDC

Pierre Girard Réunion CMS

CMS transferts massif Artem Trunov.

Farida Fassi, Damien Mercie

Experiment Dashboard overviw of the applications

Grid services for CMS at CC-IN2P3

Luca dell’Agnello INFN-CNAF

Readiness of ATLAS Computing - A personal view

1 VO User Team Alarm Total ALICE ATLAS CMS

CMS staging from tape Natalia Ratnikova, Fermilab

CC IN2P3 - T1 for CMS: CSA07: production and transfer

Evolution of the distributed computing model The case of CMS

Glexec/SCAS Pilot: IN2P3-CC status

Artem Trunov and EKP team EPK – Uni Karlsruhe

Conditions Data access using FroNTier Squid cache Server

Simulation use cases for T2 in ALICE

Ákos Frohner EGEE'08 September 2008

Pierre Girard ATLAS Visit

lundi 25 février 2019 FTS configuration

The CMS Beijing Site: Status and Application

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th

2F. Fassi

3 Voms role and fairshare Voms role and fairshare T1 and T2 are deployed over the same computing center –sharing the same computing farm and using the same LRMS (BQS) –being able to manage separately the production of each grid site –T1 site policy T1 job slots: based on the fair share policy VOMS Role « lcgadmin » VOMS Role « production »  Reprocessing, Skimming (~100%) VOMS Role « t1acess » –T2 site policy T2 job slots: based on the fair share policy VOMS Role « lcgadmin » VOMS Role « production » MC production  50% VOMS Role « priorityuser» 25% Ordinary users  25% –mapping strategy applied on our CEs is as follows: Avoid account overlapping between local sites Split the grid accounts into 2 subsets (assign each subset to a CE) The limited number of pool account restricts the number of real users (Pierre) F. Fassi

4 Grid services for CMS Site Tier-1: 1 FTS: version 2.1 deployed 1 SRMV2, SE: dcache-server LCG CEs: V (SL4/32 bits)  CMS, LHCb 3 LCG CEs: V (SL5/64 bits)  CMS, LHCb 2 VOboxes: cclcgcms04: PhEDEx + SQUID cclcgcms05: PhEDEx + SQUID Site Tier-2: T1 and T2 separation process is in progress  (more details later) SRM: same server as Tier-1, but separate namespace VObox: cclcgcms06: PhEDEx CRAB server: in2p3 2 LCG CEs: V (SL4/32 bits)  LHC VOs LCG CE for SL5, still going on F. Fassi

5 Grid services for CMS: FTS Improvement 2008/ FTS server SL4/64 bits – Load Balanced on 4 machines All channel agents distributed among the 4 machines for all VOs –All T1-IN2P3 channels –some T2/T3 IN2P3 channels – e.g.: Belgium, Beijing T2s, IPNL T3 –some STAR-T2/T3 channels –T1_IN2P3-STAR channel to fit with CMS Data Management requirements (transfers from anywhere to anywhere ⇨ more difficult to solve the problem) –T2 channels created »IN2P3CCT2 IN2P3 + STAR-IN2P3CCT2 channels –DB backend on Oracle cluster –Stand by virtual machine for service availability F. Fassi

- Goal: Improve data access at T2 for analysis (reduce the load on HPSS) -The full separation T1 and T2 namespace (Tibor) -moving from old namespace /pnfs/in2p3.fr/data/cms/data (common with T1)  to the new one : /pnfs/in2p3.fr/data/cms/t2data - changed the corresponding mapping in TFC storage.xml why this migration is a bit slow? - about 38k files ….. ~ 40 TB - rate achieved about 50 – 60MB/s - failures: timeouts, proxy expirations - not a direct copy disk1  disk2 (but going via the 3rd disk) - the reason: pre-staging optimized for analysis jobs  prestaged files to “analysis-pool” not to “xfer-pool” T2 Data Migration (1/2) 6F. Fassi

T2 Data Migration (2/2) Status: - finished on 22th October To be done: - still to be checked for failures - do PhEDEx consistency checks - check the data which are at the same time in both T1 & T2  keep the data associated to T1 in T1 namespace  remove the data from the old namespace which are declared only at T2 F. Fassi7

T2_FR_CCIN2P3 Now dCache: the same for T1/T2 Disk pools : the same for T1/T2 Namespace: different /pnfs/in2p3.fr/data/cms/t2data (/pnfs/in2p3.fr/data/cms/data) Separate PhEDEx node for T2_FR_CCIN2P3 - disk only - separated from T1 (although the same SRM endpoints) - SE ccsrmt2.in2p3.fr (alias to ccsrm.in2p3.fr) Consequences - duplication of fraction of data on the same resources - T1  T2 intra CCIN2P3 transfers - remove hacks on different levels (CRAB – T1 wasn’t blacklisted) - T2 jobs cannot acces data at T1 anymore (different namespace) F. Fassi8

Current development and perspective SL5 –more nodes available on SL5 for production –>50% available, from 29th October CMS whoud run only on SL5 FTS –more efforts for improving the soft used for Load Balancing – Improve monitoring tools: global view of different components –Alarm Nagios on: connection oracle, daemons, channels CREAM –specific developments done New interface for BQS/GRID Validation test with Alice (local contact) ongoing Production CE will be setup soon Goal is to have a CREAM CE backup by BQS in production mid- November 9F. Fassi

Treqs: Tape performance (1) Treqs: –It is an interface between dcache and HPSS –it schedules the file staging requests according to the file location (which tape) and the file position within the tape –Version standalone tested for the first time during STEP09 – It was Integrated to all dcache pools on august More detail (dcache master ‘s talk) –Max rate achieved  600MB/s Prestaging via PhEDEx –For tape recalling we aim to use Phedex stager agent with Treqs –Prestage centrally triggered via block/dataset transfer subscription, after approving by site contact  better control of activity –Investigating an optimal agent customization for locally interacting with the storage system via Treqs – Timescale  soon 10F. Fassi

Treqs: Tape performance (2) Treqs on production - Tape read/write rate from HPSS - High tape performance since Treqs integration 11F. Fassi Reading rate Writing rate

12F. Fassi

Primary responsibility of a Tier-1 –Re-Reconstruction, Skimming – Archival and served copy of RECO –Archival storage for Simulation Review of CCIN2P3 participation in the following during both STEP09 and Oct exercise –Re-processing –Pre-staging –Transfer 13F. Fassi

Re-processing: STEP09 - Reprocessing ran smoothly - CMS got more than 1.2k job bath slots  expected number of slots - Good efficiency when processing from disk 14 F. Fassi

Pre-staging STEP test HPSS v 6.2 interfaced to TReqs used for the first time  We met CMS goals for tape pre-stage tests  52MB/s required, ~110MB/s achieved  23h required, 12.3h en average achieved  A very high load on HPSS (June 8-13 th) due to:  ALL CMS activities, in particular the CMS analysis and Other VOs activities #35 Drive in use Test begins #83 Drive wait 15F. Fassi

Moving data: STEP09 STEP week-2, first 3 days Test synchronization of 50 TB of AOD data between all 7 T1 sites T1->T1 Average transfer traffic is 966 MB/s T0->T1 (Transfer of cosmics) Transfer managed as part of cosmic data taking at the T0  no issues seen 16F. Fassi

Transfer tests (T1 ➞ T1) - STEP week 1: low import transfer quality for us due to: - Issue with adapting local configuration file to test condition - Issue with FTS timeout - Communication issue and better understanding of test goal 17F. Fassi

Recent rich ongoing CMS activity backfill, fake reprocessing Production October exercise F. Fassi18

Pre-staging - High tape reading rate on 9 th october  staging a sample of 50TB for the reprocessing test performed by Data operation team (Guillelmo) 19F. Fassi

- More than 60K terminated jobs on 9 th October Reprocessing 20F. Fassi Terminated jobs Submitted jobs

Fairshare All CMS grid running jobs All LHC grid running jobs Fraction: >25% is sustained over all LHC repartition CMS grid running jobs all CMS running jobs Fraction 21F. Fassi CMS could get its share of resources in spite of the competition with other VOs

T1 and T2 CMS jobs CMS T1 prod jobs, cms grid jobs, fraction CMS T2 jobs, CMS grid jobs, fraction 22F. Fassi

Dashboard statistics: Oct. exe. Job Failure details (from 18 th sept. to 16 oct.) - BQS mis-configuration on 18 th sep. when “verylong” queue was created - Scheduled downtime (28 th sept. to first Oct.) CEs were not closed - ccsrm down twice on 8 th and 10 th Oct., issue identified and fixed - CE08 mis-configuration (12 th oct.), issue identified and fixed - BQS mis-configuration (12 th and 14 th -15 th oct.)  issues identified and fixed ~74% successful jobs Downtime electrical intervention, dcache upgrade 23F. Fassi

Jobs activities  BQS misconfiguration when Priorityuser role was created  Issue identified and fixed  Issues in accessing data in the old T2 namespace (T1 namespace)  Many slow jobs on 12 th oct.  issue with WMS  ccsrm was down twice: bug on SRM ’Space Manager ’ issue fixed by deploying a new dcache version Oct. exe. 24F. Fassi

Few concerns regarding –Site availability –Site readiness Some statistics on Savannah tickets F. Fassi25

Site availability CEs were not closed when dcache migration to chimera However site was in downtime during this period  Why SAM test did not consider the scheduled downtimes? CMS critical tests policy seems too restrictive: CE/SE blacklisted with FCR at first test failure  possibility to wait 2 tests failure? 26F. Fassi

Site readiness: JobRobot failures - Issues in accessing data located at T2 namespaces - T1 and T2 separation process still going on  final phase - CCIN2P3 T1 readiness status should not be impacted by these failures - CRAB limitation in specifying the site name  issue fixed - The T1 readiness status will be corrected 27F. Fassi

Savannah statistics - 2 still Open Tickets on 21th October: - Check consistency  issue with script agent which is fixed by upgrading to Phedex Commissioning the 2 links between CCIN2P3 and 2Tiers-3 - for MC transfer for custodial storage  still going on 28F. Fassi

Plan: short term Pre-staging: – Test of stager agent  Phedex – Multi VOs interne Pre-stage test Cleanup: – data identification – Data operation help is needed F. Fassi29