Experiment support at IN2P3 Artem Trunov CC-IN2P3

Slides:

Advertisements

Similar presentations

User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.

Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004.

Overview of day-to-day operations Suzanne Poulat.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team 12 th CERN-Korea.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment

WLCG Service Report ~~~ WLCG Management Board, 9 th August

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.

SC4 Planning Planning for the Initial LCG Service September 2005.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6

Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

LHC Computing, CERN, & Federated Identities

Storage Classes report GDB Oct Artem Trunov

Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

GDB meeting - Lyon - 16/03/05 An example of data management in a Tier A/1 Jean-Yves Nief.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

II EGEE conference Den Haag November, ROC-CIC status in Italy

Operation team at Ccin2p3 Suzanne Poulat –

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Steinbuch Centre for Computing

The status of IHEP Beijing Site WLCG Asia-Pacific Workshop Yaodong CHENG IHEP, China 01 December 2006.

Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

The Beijing Tier 2: status and plans

Update on Plan for KISTI-GSDC

1 VO User Team Alarm Total ALICE ATLAS CMS

LCG Operations Workshop, e-IRG Workshop

Artem Trunov Computing Center IN2P3

LHC Data Analysis using a worldwide computing grid

Pierre Girard ATLAS Visit

CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019

The LHCb Computing Data Challenge DC06

Presentation transcript:

Experiment support at IN2P3 Artem Trunov CC-IN2P3

Experiment support at IN2P3 Site introduction System support Grid support Experiment support

Introduction: Centre de Calcul of IN2P3 Serving 40+ experiments HEP Astrophysics Biomedical 20 years of history 40+ employees

CC-IN2P3 - Stats ~900 batch workers CPUs (cores) – 3400 jobs Dual CPU P-IV 2.4 GHz, 2GB RAM Dual CPU Xeon 2.8GHz, 2GB RAM Opteron Dual CPU Dual core 2.2Ghz, 8GB RAM. Mass Storage – HPSS 1.6 PB total volume stored Daily average transfer volume ~10TB Rfio access to disk cache, xrootd, dcache Network 10 Gb link to CERN since January core router (catalyst 6500 series) 1 Gb uplink per 24 WN (upgradable to 2x1GB)

LCG Tier 1 Center in France CPU(kSI2K) Disk (TB) MSS (TB) LCG MoU Jan

Systems support A guard on duty 24 hours Smoke etc alarm sensors, service provider visits Telecom equipment sends an sms on failure Storage robotic equipment sends signals to the service provider who acts according to the service agreement. Unix administrator is on shift during evening hours. He can call an expert at home, but the expert is not obliged to work odd hours Weekly reviews of problems during shifts Nights and weekends not covered, overtime not paid. But in case of major accidents like power failure system group is making all efforts to bring the site back as soon as possible. During the day routine monitoring of Slow or hang jobs, jobs that end to quick, jobs exceeding requested resources – a message is sent to submitter Storage services for problems Some service to end users is delegated to experiments representatives Password reset, manipulation with job resources

Grid and admin support at IN2P3 Dedicated Grid team People have mixed responsibilities 2.5 – all middleware: CEs, LFC etc + operations 1.5 – dcache + FTS 1.5 – HPSS administrator 1 – Grid jobs support Few people doing CIC, GOC development/support + many people not working on grid Unix, network, DB, web Not actually providing 24/7 support Few people do too much work. Weekly meetings with “Tour of VOs” are very usefull GRID team storage production system

Grid and admin support at IN2P3 But Grid people not enough to make grid working Didn’t Grid mean to help experiments? - yes Are experiments happy? - no Are Grid people happy with experiments? – no “Atlas is the most ‘grid-compatible’, but it gives us a lot of problems”. “Alice is the least ‘grid-compatible’, but it gives us a lot of problems”. One of the answers: it’s necessary to work closely with experiments. Lyon: a part of user support group is oriented to LHC experiment support

Experiment support at CC-IN2P3 Experiment support group - 7 person 1 – BaBar + astrophysics 1 – all Biomed 1 – Alice and CMS 1 – Atlas 1 – D0 1 – CDF 1 – general support (sw etc) All – general support, GGUS (near future) Plan is to have one person for each LHC experiment All are former physicists The director of the CC is also physicist

Having physicists on site helps Usually had in-depth experience with at least one experiment Understand experiment’s computing models and requirements Proactive in problem solving, have broad view on experiments computing. Work to make physicists happy Bring mutual benefit to the site and the experiment Current level of Grid and experiments’ middleware really requires a lot of efforts And grid is not getting easier, but more complex.

CMS Tier1 people – who are they? Little survey on CMS computing support at sites Spain (PIC) – Have dedicated person Italy (CNAF) – Have dedicated person France (IN2P3) - yes Germany (FZK) – no Support is by physics community SARA/NIKEF – no Nordic countries – terra incognita US (FNAL) - yes UK (RAL) – yes, but virtually no experimental support at UK Tier2 sites

What are they doing at their sites? Making sure site setup works integration work optimization Site admins usually can’t check whether their setup works and asks VO to test it. Reducing “round trip time” between experiment and site admins Talking is much better for understanding than exchange. Sometime it’s just necessary to sit together in order to resolve problems Helping site admins to understand experiment’s use of the site Especially “exotic” cases, like Alice (asks for xrootd) Also requires lot of iterations Testing new solutions Then deploying and supporting Administration At Lyon, all xrootd servers are managed by user support Grid expertise, managing VO Boxes, services (FTS)

VO support scenarios Better “range” when experiment expert is on site Could interact directly with more people and systems Integration is a keyword Grid admins Site admins Grid storage systems DB Experiment community Grid admins Site admins Grid storage systems DB a) b)

At Lyon Some extra privileges that on-site people use Installing and supporting xrootd servers (BaBar, Alice, CMS, other experiments), root Managing VO boxes, root Installing all VO software at site - special Co-managing FTS channels - special Developing SW (Biomed) Debugging transfers, jobs – root Transfer and storage really needs attention!

Grid support problems SFT not objective Grid service depends on some external components E.g. foreign RB or BDII may be at fault Too many components are unreliable Impossible to keep a site up according to MoU GGUS support not 24 h either. Select some support areas and focus on them

Focus points In short, what does an experiment want from a site? Keep the data To ingest a data stream and archive it on tape Access the data Grid jobs must be able to read data back When experiments are not happy? Transfers are failing Jobs are failing What causes failures? ► Network is stable ► Local batch is stable ►Grid middleware ►Storage

Debugging grid jobs Difficult Log file not available even for superuser – they are on a Resource Broker Whether it’s user application failing or site’s infrastructure is not clear without VO expertise At Lyon, production people monitor slow jobs, jobs that end too quick, jobs killed sending messages to users – not scalable.

Storage failures Storage is another key component (transfer, data access) Transfer made too complex SRM Non-interoperable storage solutions Exotic use cases Site storage experts can’t always debug transfers themselves Don’t have VO certificates/roles, Don’t have access to VO middleware that initiates transfers Storage classes WG (T1 storage admins) is trying to reduce complexity, bring the terminology to the common ground, make developers understand the real use cases, make experiments aware of the real life solutions, make site storage experts fully comfortable with experiments’ demands This is all for the mutual interest File loss is inevitable An exchange mechanism for site-VO interactions is being developed by Storage Classes WG initiative. But detection of the file loss is still a site’s responsibility!

Storage Needs serious monitoring Pools down Staging fails Servers overloaded Make storage a part of 24h support? Train operators to debug storage problems and localize the damage to an experiment (e.g. by notification to experiments) Is this possible to develop such monitoring and support scenario, where problems are fixed before users complain? Shame, when a user tells an admin “You have a problem”. Better be vice-versa. At Lyon will try to expand storage expertise Experiment support people are already involved. VO will obviously have to train data taking crew on shift to recognize storage problems at sites (since exporting data from CERN is a part of initial reconstruction workflow).

Little summary about what should work in Grid support Dedicated grid team Deep interaction with experiments, regular meetings on progress and issues Storage/transfers monitoring Human factor