Grid services for CMS at CC-IN2P3

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
July 29' 2010INDIA-CMS_meeting_BARC1 LHC Computing Grid Makrand Siddhabhatti DHEP, TIFR Mumbai.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
2007/05/22 Integration of virtualization software Pierre Girard ATLAS 3T1 Meeting
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th.
Servizi core INFN Grid presso il CNAF: setup attuale
Piotr Bała, Marcin Radecki, Krzysztof Benedyczak
WLCG IPv6 deployment strategy
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
Regional Operations Centres Core infrastructure Centres
LCG Service Challenge: Planning and Milestones
Sviluppi in ambito WLCG Highlights
Report of Dubna discussion
Data Challenge with the Grid in ATLAS
Brief overview on GridICE and Ticketing System
EGEE VO Management.
SRM2 Migration Strategy
Pierre Girard Réunion CMS
Luca dell’Agnello INFN-CNAF
WLCG Service Interventions
CREAM-CE/HTCondor site
CC IN2P3 - T1 for CMS: CSA07: production and transfer
The CCIN2P3 and its role in EGEE/LCG
Short update on the latest gLite status
Glexec/SCAS Pilot: IN2P3-CC status
Artem Trunov and EKP team EPK – Uni Karlsruhe
TCG Discussion on CE Strategy & SL4 Move
Interoperability & Standards
Discussions on group meeting
Monitoring of the infrastructure from the VO perspective
Leigh Grundhoefer Indiana University
LHC Data Analysis using a worldwide computing grid
Pierre Girard ATLAS Visit
ETHZ, Zürich September 1st , 2016
CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019
lundi 25 février 2019 FTS configuration
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Installation/Configuration
Deploying Production GRID Servers & Services
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

Grid services for CMS at CC-IN2P3 Saturday, July 21, 2018Saturday, July 21, 2018 Grid services for CMS at CC-IN2P3 David Bouvet Pierre-Emmanuel Brinette, Pierre Girard, Rolf Rumler CMS visit – 30/11/2007

Tier-1 consolidation + tier-2 creation Grid site infrastructure Content Deployment status Tier-1 consolidation + tier-2 creation Grid site infrastructure Tier-2 site Grid services for CMS Major concerns Global Operational Technical David Bouvet - CMS visit 30/11/2007

Deployment status Official version : 3.0.2 Update 36 (SL3) and 3.1 Update 06 (SL4_32) Site tier-1: 1 load-balanced Top BDII (2 SL4 machines) for local use only until we validate the official SL4 release. 2 FTS : v2.0 and v1.5 (to be decommissioned) 1 central LFC (Biomed) + 2 local LFC 1 VOMS server (Biomed, Auvergird, Embrace, EGEODE, and  local/regional VOs) 1 regional MonBox 1 site BDII 2 SRM SEs + 1 test SRMV2 SE 3 LCG CEs: 3.0.5, 3.0.13, 3.0.14 instead of 3.0.19, configured for SL4/32 bits farm Partially updated UI/WN SL3: 3.0.22-2 instead of 3.0.27 UI/WN gLite 3.1: SL4/32 RLS/RMC and classic SEs phased out (June and September) Site tier-2: LCG CE 3.0.11 for SL3, VOs ATLAS, CMS LCG CE 3.0.14 for SL4, all VOs publish the 2 SRM SEs of the tier-1 Site PPS: test bed for local adaptation (BQS jobmanager, information provider) 2 LCG-CE with the latest production release : 3.0.19-0 Used for local test VO (vo.rocfr.in2p3.fr) FTS v2.0 and LFC for LHC VOs David Bouvet - CMS visit 30/11/2007

Tier-1 consolidation + tier-2 creation Site changes this year all CEs on the T1 (IN2P3-CC) configured for the SL4/32 bits farm a new site Tier 2 (IN2P3-CC-T2) dedicated for analysis facilities just one CE still configured on the T2 for SL3 farm (would disappear) 2 CEs (1 per VO LHC) 1 FTS (channel distribution) 1 LFC (1 for replication of LHCb central LFC) 1 SE dCache (Managed by Storage group) 2 “classic” SEs LDAP server (Auvergrid) RLS/RMC (Biomed) Commitment to provide a load balanced Top BDII (SL4) for France Machine upgrades to V20Z or better, as SL4 versions of middleware components become available Spare machines 1 CE « spare » (in case of hardware problems) pre-installed VMs TopBDII/sBDII/VOMS/LFC/FTS (for updates or in case of hardware problems) used during major power outages necessary for power supply upgrades David Bouvet - CMS visit 30/11/2007

Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Local LFC 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

Grid site infrastructure VO Box VO LHC Grid Information System VO Box VO LHC Top BDII VOMS 7 VOs MonBox 7 Sites V OBox VO LHC VO Box VO LHC Central LFC Biomed Local LFC 4 VOs LHC FTS 4 VOs LHC Local LFC 4 VOs LHC FTS 4 VOs LHC Site BDII Computing Element Computing Element SRM SRM Storage Element Storage Element Computing Element Computing Element Global service BQS HPSS DCACHE Regional service Anastasie Local Service WN Computing Storage David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 and T2 is deployed over the same computing center sharing the same computing farm and using the same LRMS being able to manage separately the production of each grid site T1 site policy T1 job slots = (CMS’ job slots x #CPUT1) / (#CPUT1 + #CPUT2) VOMS Role « lcgadmin » VOMS Role « production » regular users T2 site policy T2 job slots = (CMS’ job slots x #CPUT2) / (#CPUT1 + #CPUT2) mapping strategy revisited on our CEs by prohibiting account overlapping between local sites by splitting the grid accounts into 2 subsets David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 only: T1 Site BDII Site policy CMS Mapping policy CE01 Role production Role lcgadmin All others cms050 cmsgrid cms[001-049] Mapping policy CMS Site policy AFS rw access BQS priorities Jobs slot max T1 only: David Bouvet - CMS visit 30/11/2007

Tier-2 site T1 + T2 T1 Site BDII T2 Site BDII Site T2 policy CE01 CE03 CE04 T2 Site BDII CE05 production lcgadmin All others cms050 cmsgrid cms[001-024] Mapping policy cms049 cms[025-048] Site T1 policy Site T2 policy CE02 T1 + T2 David Bouvet - CMS visit 30/11/2007

Grid services for CMS Site tier-1: Site tier-2: 2 FTS: v2.0 and v1.5 (to be decommissioned) 1 SRM SE + 1 test SRMV2 SE 1 LCG CE: 3.0.5 configured for SL4/32 bits, VO CMS only partially updated 2 VOboxes: cclcgcms: PhEDEx + SQUID cclcgcms02: will be used for SQUID Site tier-2: LCG CE 3.0.11 for SL3, VOs ATLAS, CMS LCG CE 3.0.14 for SL4, all VOs publish the 1 SRM SEs of the tier-1 David Bouvet - CMS visit 30/11/2007

Grid services for CMS: FTS 2 FTS servers: v2.0 and v1.5 v1.5 will be decommissioned after CMS-France green light v2.0: one single node for all agents (VO + channels) and all VO DB backend on Oracle cluster T1-IN2P3 channels some T2/T3-IN2P3 channels (e.g.: Belgium, Beijing T2s, IPNL T3) IN2P3-STAR channels to fit with CMS Data Management requirements (transfers from anywhere to anywhere ⇨ more difficult to solve the problem) One node will be added for channel distribution David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE Site tier-1: 1 LCG CE: 3.0.5 configured for SL4/32 bits, VO CMS only Site tier-2: LCG CE 3.0.11 for SL3, VOs ATLAS, CMS LCG CE 3.0.14 for SL4, all VOs Major concerns regarding CMS: we strongly emphasize the use of requirements in job submission instead of hard-coded hostname or IP addresses  less impact for CMS when changing the node CMS critical tests policy seems too restrictive: CE/SE blacklisted with FCR at first test failure  possibility to wait 2 tests failure? David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE deployment plans (1) Now Problems fault tolerance as seen by a VO updates on-the-fly too difficult by using spare CE by temporary migration of a VO to other CEs because of use of hostname by the VOs configurations not uniform Computing Element Computing Element Computing Element Computing Element BQS Anastasie WN Calcul David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE deployment plans (2) From… To ? Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element Computing Element BQS Anastasie WN Calcul BQS Anastasie WN Calcul David Bouvet - CMS visit 30/11/2007

Grid services for CMS: CE deployment plans (3) Identified problems load-balancing depends on VOs’ CE selection strategy risk of overloading particular CE solution based on assumption that the VOs use the information system via the IS, show a different cluster by CE (logical split of the BQS cluster) account mapping different users must not share the same account split account pool for tier-1 and tier-2 same user should not be mapped to different accounts Solutions share gridmapdir between CEs use GPFS? use a centralized LCAS/LCMAP database Future ? VO-oriented Load-Balancing Computing Element Computing Element Computing Element Computing Element Mapping David Bouvet - CMS visit 30/11/2007

And in addition… (1) Job priorities by VO Job traceability temporary solution by modifying jobs already in queue implement by change to BQS (ongoing development ) necessary to better understand the VOMS organization of the VO Job traceability increase visibility of the grid jobs submitted to the Centre information needed by Operations: grid job identifier mail address of submitter David Bouvet - CMS visit 30/11/2007

And in addition… (2) CE gLite / CREAM SL4 specific developments ongoing (Sylvain Reynaud) deployment on our PPS? SL4 more and more nodes available on SL4 for production LCG-CE, glite-BDII, UI, WN (installation planed on PPS ASAP) still awaiting for an official SL4/64 bits WN/UI release current SL4/32 creates low memory load problems David Bouvet - CMS visit 30/11/2007

Major concerns : global Our first major concerns many service nodes many different types of node few people to administrate few time before starting a production quality to be maintained So far as we can, we have… to reuse our practical abilities existing infrastructure when possible but, add nodes if that eases operations to avoid to introduce too much VO specificities experience acquiring operational procedures set up monitoring / administration tools adaptation David Bouvet - CMS visit 30/11/2007

Major concerns : global Improving grid communication will be the challenge multiple information sources: LCG, EGEE, VOs, regional and internal site communication too much information or knowledge are still coming from mails or from a mass of meetings ⇨ a lot of progress has been achieved yet At CC, VO-Site communication improvement One VO support contact appointed by LHC VO speak VO language with site speak Site language with VO Shown to be a good point to improve the communication For CMS matters, grid site administrators systematically discuss with Nelli Pukhaeva Nelli Pukhaeva knows who is the best CC interlocutor for any CMS request CMS specific support mailing list: cms-support@cc.in2p3.fr David Bouvet - CMS visit 30/11/2007

Major concerns : operational We set up operational procedures to suppress or, at least, to reduce grid service outage ex.: CE update might be operated without any outage set up a new CE and validate it works well out of production close the old CE and replace it by the new one take out the old CE when its jobs are ended But “bad” VO usage can interfere with that ex.: job submission explicitly refers to a CE by specifying its hostname it is time-consuming because you must inform the supported VOs before taking any action on the CE job submission will failed if the CE is out of production Grid middleware theoretically enables those operations but problem certainly comes from the fact that M/W doesn’t enable to express requests like: “please submit my job to any allowed CE of site IN2P3-CC” Intensive access during jobs to a data file in VO_CMS_SW_DIR directory which is on AFS  heavy load on AFS solution we proposed to Peter Elmer: copy the file in job SCRATCH directory We hope this will be solved soon David Bouvet - CMS visit 30/11/2007

Major concerns : technical Dealing with VOMS information if I was a VO, I would be very enthusiastic about the new possibilities offered by VOMS but as site… I need to know what behavior is expected behind a VOMS role/group I must find a technical solution to translate it in terms of site policy I must possibly adapt the interface between grid front-end and local services to implement the behavior CE Jobmanager ↔ BQS CE Information Provider ↔ BQS But, this solution raises some scalability problems with big sites ASAP we must identify together new requirements that will be introduced because of VOMS David Bouvet - CMS visit 30/11/2007

Thanks for your attention David Bouvet - CMS visit 30/11/2007