Pierre Girard LCG France 2010 Lyon, November 22th-23th, 2010 Lyon Tier-1 infrastructure & services : status and issues.

Slides:

Advertisements

Similar presentations

L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.

Advertisements

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.

Overview of day-to-day operations Suzanne Poulat.

11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

CERN Physics Database Services and Plans Maria Girone, CERN-IT

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.

CC - IN2P3 Site Report Hepix Fall meeting 2010 – Ithaca (NY) November 1st 2010

2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

CC-IN2P3 Pierre-Emmanuel Brinette Benoit Delaunay IN2P3-CC Storage Team 17 may 2011.

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

2007/05/22 Integration of virtualization software Pierre Girard ATLAS 3T1 Meeting

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

Grid Operations in Germany T1-T2 workshop 2015 Torino, Italy Kilian Schwarz WooJin Park Christopher Jung.

The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.

GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals

Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.

CMS-specific services and activities at CC-IN2P3 Farida Fassi October 23th.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Alice Operations In France

Dynamic Extension of the INFN Tier-1 on external resources

WLCG IPv6 deployment strategy

London Tier-2 Quarter Owen Maroney

The Beijing Tier 2: status and plans

LCG Service Challenge: Planning and Milestones

Report of Dubna discussion

CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd

Service Challenge 3 CERN

3D Application Tests Application test proposals

Data Challenge with the Grid in ATLAS

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

LCG-France activities

Update on Plan for KISTI-GSDC

Farida Fassi, Damien Mercie

Grid services for CMS at CC-IN2P3

Luca dell’Agnello INFN-CNAF

WLCG Management Board, 16th July 2013

The ADC Operations Story

Olof Bärring LCG-LHCC Review, 22nd September 2008

1 VO User Team Alarm Total ALICE ATLAS CMS

CC IN2P3 - T1 for CMS: CSA07: production and transfer

ATLAS Sites Jamboree, CERN January, 2017

Simulation use cases for T2 in ALICE

Data Management cluster summary

Monitoring of the infrastructure from the VO perspective

Database Services for CERN Deployment and Monitoring

LHC Data Analysis using a worldwide computing grid

Pierre Girard ATLAS Visit

CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019

GRIF : an EGEE site in Paris Region

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

Pierre Girard LCG France 2010 Lyon, November 22th-23th, 2010 Lyon Tier-1 infrastructure & services : status and issues

P.Girard 2 Contents Human resources status ■ New T1 representative ■ Manpower turnover Infrastructure status ■ Local network ■ Storage Issues Concerns and perspectives Conclusions Questions & Comments

P.Girard 3

Human resources status New T1 representative ■ Good connection with the LCG France technical manager (Frédérique) Even though a 100 GB/s connection could not compete against connection between Fabio’s cerebral hemispheres Must still improve communication from T1 representative to Technical Manager ■ Big progresses in understanding experiments’ computing/operating models Thanks to problems to be dealt with Thanks to experiment people ■ Big progresses in understanding T1 infrastructure Thanks to problems to be dealt with Thanks to CCIN2P3 experts Mostly storage system/usage ■ Not yet available enough for LCG duties French NGI involvement Grid deployment tasks 4

P.Girard Human resources status (Grid) Manpower turnover ■ Dcache administrators Entire team renewed in about one year Smooth know-how transmission was operated ■ Grid administrators Most of the team renewed in one year CEs, WNs, UIs, VO Boxes, LFC know-how transmission still to be operated ■ VO supporters 3 persons will leave soon (for very good reasons) ATLAS, CMS, and LHCB Is that a concern ? ■ Different reason for each case ■ Probably inevitable in a project mode ■ Must live with and adapt to 5

P.Girard 6

Local network Before sept. downtime 7 In august, network saturations were observed with Dcache, Xrootd, etc. We were about to reach the limits of this topology. 80Gb/s 40 Gb/s

P.Girard Local network After sept. downtime 8 A big router/switch (200 Kg) was added to reduce the network depth, to enhance the manageability, and increase the bandwith.

P.Girard Storage / Dcache TB 1209TB 728TB Dcache disk includes all tiers disks. No easy way to distinguish a tier from each other. Dcache disk for ALICE was decommissionned in september to be put in xrootd instead.

P.Girard Storage / Xrootd ALICE is now using only Xrootd ■ Xrootd-T1 HPSS backend 319TB of disk (SUN/Thor/Solaris) ■ Xrootd-T2 111TB of disk (SUN/Thor/Solaris) ALICE authentication plugin not anymore supported on Solaris ■ No migration to Linux planned at short terms for our storage ■ Require a discussion with ALICE to go ahead 10

P.Girard 11

P.Girard Problems solved since june Transfer between CCIN2P3 and BNL ■ Low rate (4MB/s) for file transfer ■ Problem between Solaris and Linux Put in evidence by introducing Linux GridFTP in Dcache Solved by changing TCP configuration on Solaris Local LFC instabilities solved by load-lbalancing several LFC servers SL5 WN crashes due to I/O problem ■ Various tuning steps to get something satisfactory enough Local network saturation 12

P.Girard SW Area problems Overload with Atlas SW area solved by RO AFS volume replications long time ago ■ But release installation requires some CC-specific adaptations to be achieved Makes it not grid-aware anymore Works better but human intervention still needed sometimes LHCB SW Area ■ Installation jobs failed since august We install manually missing LHCB modules on demand ■ Latencies in AFS read access can lead to 7% of failures Due to a timeout (600s) in job setup (CMT) No overload observed on AFS server AFS client is suspected and different tunings are tested at the time being ALAS suffered the same problem with their last release which introduced also a timeout on job setup (CMT) o Bypassed by deploying a new version with a longer timeout (6000s) That might also explained low CMS job efficiency 13

P.Girard ATLAS problems with Dcache Very chaotic Dcache restarting ■ 3 disk servers with precious data were not working properly Need to be reinstalled after long investigations Data had to be migrated ■ Huge amount of files kept blocked in import buffer because of a unexplained dcache service crash Released them took a long time because of a bug with a dcache command ■ Low throughputs on export pools were observed and unexplained during many days The problem disappeared and came back later ■ Many other problems occurred during october reprocessing 14 September downtime: security alert + many upgrades/updates at same time: network, solaris, dcache, etc.

P.Girard ATLAS problems with Dcache 15 ATLAS very unhappy during the reprocessing But CMS seems happy Is Dcache guilty or victim ?

P.Girard 16

P.Girard Problem solving vs operations We are facing problems from both computing and storage sides ■ Each problem is particularly tricky and time consuming monopolizes CCIN2P3 staff in a transversal way o From VO support to system administrator ■ In the meantime, we must still operate the T1, even in a degraded mode Be able to regulate the different activities to keep alive the T1 Communicate with the experiments o To define together with them the priorities in order to use in the best way our available resources o To provide them a clear status of the site We must organize ourself better ■ To evacuate asap problem solving from operations ■ To gather together the required experts to start solving the problem Creating a dedicated working group, as done with LHCB problem, seems a good way 17

P.Girard Monitoring/Communication There are evidences we miss well- adapted monitoring tools ■ To understand the load put by a VO, a VO activity, etc. ■ To speed-up problem detection and diagnostic Too many human exchanges needed between VO supporters and service experts 18

P.Girard Resources/Services sharing We must check that Resources/Services sharing scales with ■ T1, T2 and T3 simultaneous activities ■ 4 VOs using the same services at same time Some configuration should be checked CCIN2P3 is based on mutualization model ■ Manpower is sized for ■ Cannot provide dedicated infrastructure by VO 19

P.Girard Conclusions A staff turnover to deal with ■ Seems we are good trainers Problems we must learn to live with ■ Separate problem solving from operations ■ Staff worked hard without complaining (too much) Reconsider our resources/services sharing configuration ■ To be sure that experiment won’t impact the others ■ To check we’ll fill the future fullfilments 20

P.Girard 21 Questions & Comments

P.Girard 22

P.Girard Computing trend 23

P.Girard Storage service Storage chain composed of 3 main components ■ dCache: disk-based system exposing gridified interfaces One instance serving the 4 LHC experiments ■ HPSS: tape-based mass storage system, used as permanent storage back-end for dCache One instance for all the experiments served by CCIN2P3 ■ TReqS: mediator between dCache and HPSS for scheduling access to tapes, for optimization purposes To minimize tape mounts/dismounts and to optimize the sequence of files read within a single tape 24