Grid services for CMS at CC-IN2P3

Grid services for CMS at CC-IN2P3
Saturday, July 21, 2018Saturday, July 21, 2018 Grid services for CMS at CC-IN2P3 David Bouvet Pierre-Emmanuel Brinette, Pierre Girard, Rolf Rumler CMS visit – 30/11/2007

Tier-1 consolidation + tier-2 creation Grid site infrastructure
Content Deployment status Tier-1 consolidation + tier-2 creation Grid site infrastructure Tier-2 site Grid services for CMS Major concerns Global Operational Technical David Bouvet - CMS visit 30/11/2007

Deployment status Official version : Update 36 (SL3) and 3.1 Update 06 (SL4_32) Site tier-1: 1 load-balanced Top BDII (2 SL4 machines) for local use only until we validate the official SL4 release. 2 FTS : v2.0 and v1.5 (to be decommissioned) 1 central LFC (Biomed) + 2 local LFC 1 VOMS server (Biomed, Auvergird, Embrace, EGEODE, and local/regional VOs) 1 regional MonBox 1 site BDII 2 SRM SEs + 1 test SRMV2 SE 3 LCG CEs: 3.0.5, , instead of , configured for SL4/32 bits farm Partially updated UI/WN SL3: instead of UI/WN gLite 3.1: SL4/32 RLS/RMC and classic SEs phased out (June and September) Site tier-2: LCG CE for SL3, VOs ATLAS, CMS LCG CE for SL4, all VOs publish the 2 SRM SEs of the tier-1 Site PPS: test bed for local adaptation (BQS jobmanager, information provider) 2 LCG-CE with the latest production release : Used for local test VO (vo.rocfr.in2p3.fr) FTS v2.0 and LFC for LHC VOs David Bouvet - CMS visit 30/11/2007

Tier-1 consolidation + tier-2 creation
Site changes this year all CEs on the T1 (IN2P3-CC) configured for the SL4/32 bits farm a new site Tier 2 (IN2P3-CC-T2) dedicated for analysis facilities just one CE still configured on the T2 for SL3 farm (would disappear) 2 CEs (1 per VO LHC) 1 FTS (channel distribution) 1 LFC (1 for replication of LHCb central LFC) 1 SE dCache (Managed by Storage group) 2 “classic” SEs LDAP server (Auvergrid) RLS/RMC (Biomed) Commitment to provide a load balanced Top BDII (SL4) for France Machine upgrades to V20Z or better, as SL4 versions of middleware components become available Spare machines 1 CE « spare » (in case of hardware problems) pre-installed VMs TopBDII/sBDII/VOMS/LFC/FTS (for updates or in case of hardware problems) used during major power outages necessary for power supply upgrades David Bouvet - CMS visit 30/11/2007

Grid site infrastructure
VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Local LFC 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Local LFC 4 VOs LHC FTS 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 and T2 is deployed over the same computing center
sharing the same computing farm and using the same LRMS being able to manage separately the production of each grid site T1 site policy T1 job slots = (CMS’ job slots x #CPUT1) / (#CPUT1 + #CPUT2) VOMS Role « lcgadmin » VOMS Role « production » regular users T2 site policy T2 job slots = (CMS’ job slots x #CPUT2) / (#CPUT1 + #CPUT2) mapping strategy revisited on our CEs by prohibiting account overlapping between local sites by splitting the grid accounts into 2 subsets David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 only: T1 Site BDII Site policy CMS Mapping policy CE01
Role production Role lcgadmin All others cms050 cmsgrid cms[ ] Mapping policy CMS Site policy AFS rw access BQS priorities Jobs slot max T1 only: David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 + T2 T1 Site BDII T2 Site BDII Site T2 policy
CE01 CE03 CE04 T2 Site BDII CE05 production lcgadmin All others cms050 cmsgrid cms[ ] Mapping policy cms049 cms[ ] Site T1 policy Site T2 policy CE02 T1 + T2 David Bouvet - CMS visit 30/11/2007

Grid services for CMS Site tier-1: Site tier-2:
2 FTS: v2.0 and v1.5 (to be decommissioned) 1 SRM SE + 1 test SRMV2 SE 1 LCG CE: configured for SL4/32 bits, VO CMS only partially updated 2 VOboxes: cclcgcms: PhEDEx + SQUID cclcgcms02: will be used for SQUID Site tier-2: LCG CE for SL3, VOs ATLAS, CMS LCG CE for SL4, all VOs publish the 1 SRM SEs of the tier-1 David Bouvet - CMS visit 30/11/2007

Grid services for CMS: FTS
2 FTS servers: v2.0 and v1.5 v1.5 will be decommissioned after CMS-France green light v2.0: one single node for all agents (VO + channels) and all VO DB backend on Oracle cluster T1-IN2P3 channels some T2/T3-IN2P3 channels (e.g.: Belgium, Beijing T2s, IPNL T3) IN2P3-STAR channels to fit with CMS Data Management requirements (transfers from anywhere to anywhere ⇨ more difficult to solve the problem) One node will be added for channel distribution David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE
Site tier-1: 1 LCG CE: configured for SL4/32 bits, VO CMS only Site tier-2: LCG CE for SL3, VOs ATLAS, CMS LCG CE for SL4, all VOs Major concerns regarding CMS: we strongly emphasize the use of requirements in job submission instead of hard-coded hostname or IP addresses  less impact for CMS when changing the node CMS critical tests policy seems too restrictive: CE/SE blacklisted with FCR at first test failure  possibility to wait 2 tests failure? David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE deployment plans (1)
Now Problems fault tolerance as seen by a VO updates on-the-fly too difficult by using spare CE by temporary migration of a VO to other CEs because of use of hostname by the VOs configurations not uniform Computing Element Computing Element Computing Element Computing Element BQS Anastasie WN Calcul David Bouvet - CMS visit 30/11/2007

From… To ? Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element BQS Anastasie WN Calcul BQS Anastasie WN Calcul David Bouvet - CMS visit 30/11/2007

Identified problems load-balancing depends on VOs’ CE selection strategy risk of overloading particular CE solution based on assumption that the VOs use the information system via the IS, show a different cluster by CE (logical split of the BQS cluster) account mapping different users must not share the same account split account pool for tier-1 and tier-2 same user should not be mapped to different accounts Solutions share gridmapdir between CEs use GPFS? use a centralized LCAS/LCMAP database Future ? VO-oriented Load-Balancing Computing Element Computing Element Computing Element Computing Element Mapping David Bouvet - CMS visit 30/11/2007

And in addition… (1) Job priorities by VO Job traceability
temporary solution by modifying jobs already in queue implement by change to BQS (ongoing development ) necessary to better understand the VOMS organization of the VO Job traceability increase visibility of the grid jobs submitted to the Centre information needed by Operations: grid job identifier mail address of submitter David Bouvet - CMS visit 30/11/2007

And in addition… (2) CE gLite / CREAM SL4
specific developments ongoing (Sylvain Reynaud) deployment on our PPS? SL4 more and more nodes available on SL4 for production LCG-CE, glite-BDII, UI, WN (installation planed on PPS ASAP) still awaiting for an official SL4/64 bits WN/UI release current SL4/32 creates low memory load problems David Bouvet - CMS visit 30/11/2007

Major concerns : global
Our first major concerns many service nodes many different types of node few people to administrate few time before starting a production quality to be maintained So far as we can, we have… to reuse our practical abilities existing infrastructure when possible but, add nodes if that eases operations to avoid to introduce too much VO specificities experience acquiring operational procedures set up monitoring / administration tools adaptation David Bouvet - CMS visit 30/11/2007

Major concerns : global
Improving grid communication will be the challenge multiple information sources: LCG, EGEE, VOs, regional and internal site communication too much information or knowledge are still coming from mails or from a mass of meetings ⇨ a lot of progress has been achieved yet At CC, VO-Site communication improvement One VO support contact appointed by LHC VO speak VO language with site speak Site language with VO Shown to be a good point to improve the communication For CMS matters, grid site administrators systematically discuss with Nelli Pukhaeva Nelli Pukhaeva knows who is the best CC interlocutor for any CMS request CMS specific support mailing list: David Bouvet - CMS visit 30/11/2007

Major concerns : operational
We set up operational procedures to suppress or, at least, to reduce grid service outage ex.: CE update might be operated without any outage set up a new CE and validate it works well out of production close the old CE and replace it by the new one take out the old CE when its jobs are ended But “bad” VO usage can interfere with that ex.: job submission explicitly refers to a CE by specifying its hostname it is time-consuming because you must inform the supported VOs before taking any action on the CE job submission will failed if the CE is out of production Grid middleware theoretically enables those operations but problem certainly comes from the fact that M/W doesn’t enable to express requests like: “please submit my job to any allowed CE of site IN2P3-CC” Intensive access during jobs to a data file in VO_CMS_SW_DIR directory which is on AFS  heavy load on AFS solution we proposed to Peter Elmer: copy the file in job SCRATCH directory We hope this will be solved soon David Bouvet - CMS visit 30/11/2007

Major concerns : technical
Dealing with VOMS information if I was a VO, I would be very enthusiastic about the new possibilities offered by VOMS but as site… I need to know what behavior is expected behind a VOMS role/group I must find a technical solution to translate it in terms of site policy I must possibly adapt the interface between grid front-end and local services to implement the behavior CE Jobmanager ↔ BQS CE Information Provider ↔ BQS But, this solution raises some scalability problems with big sites ASAP we must identify together new requirements that will be introduced because of VOMS David Bouvet - CMS visit 30/11/2007

Thanks for your attention
David Bouvet - CMS visit 30/11/2007

Grid services for CMS at CC-IN2P3

Similar presentations

Presentation on theme: "Grid services for CMS at CC-IN2P3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid services for CMS at CC-IN2P3

Similar presentations

Presentation on theme: "Grid services for CMS at CC-IN2P3"— Presentation transcript:

Similar presentations

About project

Feedback