Download presentation
Presentation is loading. Please wait.
Published byLetitia Tyler Modified over 8 years ago
1
CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th
2
2F. Fassi
3
3 Voms role and fairshare Voms role and fairshare T1 and T2 are deployed over the same computing center –sharing the same computing farm and using the same LRMS (BQS) –being able to manage separately the production of each grid site –T1 site policy T1 job slots: based on the fair share policy VOMS Role « lcgadmin » VOMS Role « production » Reprocessing, Skimming (~100%) VOMS Role « t1acess » –T2 site policy T2 job slots: based on the fair share policy VOMS Role « lcgadmin » VOMS Role « production » MC production 50% VOMS Role « priorityuser» 25% Ordinary users 25% –mapping strategy applied on our CEs is as follows: Avoid account overlapping between local sites Split the grid accounts into 2 subsets (assign each subset to a CE) The limited number of pool account restricts the number of real users (Pierre) F. Fassi
4
4 Grid services for CMS Site Tier-1: 1 FTS: version 2.1 deployed 1 SRMV2, SE: dcache-server-1.9.4-3 1 LCG CEs: V 3.1.29 (SL4/32 bits) CMS, LHCb 3 LCG CEs: V 3.1.35 (SL5/64 bits) CMS, LHCb 2 VOboxes: cclcgcms04: PhEDEx + SQUID cclcgcms05: PhEDEx + SQUID Site Tier-2: T1 and T2 separation process is in progress (more details later) SRM: same server as Tier-1, but separate namespace VObox: cclcgcms06: PhEDEx CRAB server: in2p3 2 LCG CEs: V 3.1.29 (SL4/32 bits) LHC VOs LCG CE 3.1.35 for SL5, still going on F. Fassi
5
5 Grid services for CMS: FTS Improvement 2008/2009 1 FTS server SL4/64 bits – Load Balanced on 4 machines All channel agents distributed among the 4 machines for all VOs –All T1-IN2P3 channels –some T2/T3 IN2P3 channels – e.g.: Belgium, Beijing T2s, IPNL T3 –some STAR-T2/T3 channels –T1_IN2P3-STAR channel to fit with CMS Data Management requirements (transfers from anywhere to anywhere ⇨ more difficult to solve the problem) –T2 channels created »IN2P3CCT2 IN2P3 + STAR-IN2P3CCT2 channels –DB backend on Oracle cluster –Stand by virtual machine for service availability F. Fassi
6
- Goal: Improve data access at T2 for analysis (reduce the load on HPSS) -The full separation T1 and T2 namespace (Tibor) -moving from old namespace /pnfs/in2p3.fr/data/cms/data (common with T1) to the new one : /pnfs/in2p3.fr/data/cms/t2data - changed the corresponding mapping in TFC storage.xml why this migration is a bit slow? - about 38k files ….. ~ 40 TB - rate achieved about 50 – 60MB/s - failures: timeouts, proxy expirations - not a direct copy disk1 disk2 (but going via the 3rd disk) - the reason: pre-staging optimized for analysis jobs prestaged files to “analysis-pool” not to “xfer-pool” T2 Data Migration (1/2) 6F. Fassi
7
T2 Data Migration (2/2) Status: - finished on 22th October To be done: - still to be checked for failures - do PhEDEx consistency checks - check the data which are at the same time in both T1 & T2 keep the data associated to T1 in T1 namespace remove the data from the old namespace which are declared only at T2 F. Fassi7
8
T2_FR_CCIN2P3 Now dCache: the same for T1/T2 Disk pools : the same for T1/T2 Namespace: different /pnfs/in2p3.fr/data/cms/t2data (/pnfs/in2p3.fr/data/cms/data) Separate PhEDEx node for T2_FR_CCIN2P3 - disk only - separated from T1 (although the same SRM endpoints) - SE ccsrmt2.in2p3.fr (alias to ccsrm.in2p3.fr) Consequences - duplication of fraction of data on the same resources - T1 T2 intra CCIN2P3 transfers - remove hacks on different levels (CRAB – T1 wasn’t blacklisted) - T2 jobs cannot acces data at T1 anymore (different namespace) F. Fassi8
9
Current development and perspective SL5 –more nodes available on SL5 for production –>50% available, from 29th October CMS whoud run only on SL5 FTS –more efforts for improving the soft used for Load Balancing – Improve monitoring tools: global view of different components –Alarm Nagios on: connection oracle, daemons, channels CREAM –specific developments done New interface for BQS/GRID Validation test with Alice (local contact) ongoing Production CE will be setup soon Goal is to have a CREAM CE backup by BQS in production mid- November 9F. Fassi
10
Treqs: Tape performance (1) Treqs: –It is an interface between dcache and HPSS –it schedules the file staging requests according to the file location (which tape) and the file position within the tape –Version standalone tested for the first time during STEP09 – It was Integrated to all dcache pools on august More detail (dcache master ‘s talk) –Max rate achieved 600MB/s Prestaging via PhEDEx –For tape recalling we aim to use Phedex stager agent with Treqs –Prestage centrally triggered via block/dataset transfer subscription, after approving by site contact better control of activity –Investigating an optimal agent customization for locally interacting with the storage system via Treqs – Timescale soon 10F. Fassi
11
Treqs: Tape performance (2) Treqs on production - Tape read/write rate from HPSS - High tape performance since Treqs integration 11F. Fassi Reading rate Writing rate
12
12F. Fassi
13
Primary responsibility of a Tier-1 –Re-Reconstruction, Skimming – Archival and served copy of RECO –Archival storage for Simulation Review of CCIN2P3 participation in the following during both STEP09 and Oct exercise –Re-processing –Pre-staging –Transfer 13F. Fassi
14
Re-processing: STEP09 - Reprocessing ran smoothly - CMS got more than 1.2k job bath slots expected number of slots - Good efficiency when processing from disk 14 F. Fassi
15
Pre-staging STEP test HPSS v 6.2 interfaced to TReqs used for the first time We met CMS goals for tape pre-stage tests 52MB/s required, ~110MB/s achieved 23h required, 12.3h en average achieved A very high load on HPSS (June 8-13 th) due to: ALL CMS activities, in particular the CMS analysis and Other VOs activities #35 Drive in use Test begins #83 Drive wait 15F. Fassi
16
Moving data: STEP09 STEP week-2, first 3 days Test synchronization of 50 TB of AOD data between all 7 T1 sites T1->T1 Average transfer traffic is 966 MB/s T0->T1 (Transfer of cosmics) Transfer managed as part of cosmic data taking at the T0 no issues seen 16F. Fassi
17
Transfer tests (T1 ➞ T1) - STEP week 1: low import transfer quality for us due to: - Issue with adapting local configuration file to test condition - Issue with FTS timeout - Communication issue and better understanding of test goal 17F. Fassi
18
Recent rich ongoing CMS activity backfill, fake reprocessing Production October exercise F. Fassi18
19
Pre-staging - High tape reading rate on 9 th october staging a sample of 50TB for the reprocessing test performed by Data operation team (Guillelmo) 19F. Fassi
20
- More than 60K terminated jobs on 9 th October Reprocessing 20F. Fassi Terminated jobs Submitted jobs
21
Fairshare All CMS grid running jobs All LHC grid running jobs Fraction: >25% is sustained over all LHC repartition CMS grid running jobs all CMS running jobs Fraction 21F. Fassi CMS could get its share of resources in spite of the competition with other VOs
22
T1 and T2 CMS jobs CMS T1 prod jobs, cms grid jobs, fraction CMS T2 jobs, CMS grid jobs, fraction 22F. Fassi
23
Dashboard statistics: Oct. exe. Job Failure details (from 18 th sept. to 16 oct.) - BQS mis-configuration on 18 th sep. when “verylong” queue was created - Scheduled downtime (28 th sept. to first Oct.) CEs were not closed - ccsrm down twice on 8 th and 10 th Oct., issue identified and fixed - CE08 mis-configuration (12 th oct.), issue identified and fixed - BQS mis-configuration (12 th and 14 th -15 th oct.) issues identified and fixed ~74% successful jobs Downtime electrical intervention, dcache upgrade 23F. Fassi
24
Jobs activities BQS misconfiguration when Priorityuser role was created Issue identified and fixed Issues in accessing data in the old T2 namespace (T1 namespace) Many slow jobs on 12 th oct. issue with WMS ccsrm was down twice: bug on SRM ’Space Manager ’ issue fixed by deploying a new dcache version Oct. exe. 24F. Fassi
25
Few concerns regarding –Site availability –Site readiness Some statistics on Savannah tickets F. Fassi25
26
Site availability CEs were not closed when dcache migration to chimera However site was in downtime during this period Why SAM test did not consider the scheduled downtimes? CMS critical tests policy seems too restrictive: CE/SE blacklisted with FCR at first test failure possibility to wait 2 tests failure? 26F. Fassi
27
Site readiness: JobRobot failures - Issues in accessing data located at T2 namespaces - T1 and T2 separation process still going on final phase - CCIN2P3 T1 readiness status should not be impacted by these failures - CRAB limitation in specifying the site name issue fixed - The T1 readiness status will be corrected 27F. Fassi
28
Savannah statistics - 2 still Open Tickets on 21th October: - Check consistency issue with script agent which is fixed by upgrading to Phedex 3.2.9. - Commissioning the 2 links between CCIN2P3 and 2Tiers-3 - for MC transfer for custodial storage still going on 28F. Fassi
29
Plan: short term Pre-staging: – Test of stager agent Phedex – Multi VOs interne Pre-stage test Cleanup: – data identification – Data operation help is needed F. Fassi29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.