INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.

Slides:

Advertisements

Similar presentations

INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.

Advertisements

Luca dell’Agnello INFN-CNAF FNAL, May

Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

Site report: Tokyo Tomoaki Nakamura ICEPP, The University of Tokyo 2014/12/10Tomoaki Nakamura1.

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

A.Guarise – F.Rosso 1 Enabling Grids for E-sciencE INFSO-RI Comprehensive Accounting Views on large computing farms. Andrea Guarise & Felice Rosso.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

Tier1 Status Report Martin Bly RAL 27,28 April 2005.

ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.

INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team 12 th CERN-Korea.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.

INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

INFN-T1 site report Andrea Chierici, Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX spring 2012.

Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.

LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

Eygene Ryabinkin, on behalf of KI and JINR Grid teams Russian Tier-1 status report May 9th 2014, WLCG Overview Board meeting.

INFN-T1 site report Luca dell’Agnello On behalf ot INFN-T1 staff HEPiX Spring 2013.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.

Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

Farming Andrea Chierici CNAF Review Current situation.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Data storage services at CC-IN2P3 Jean-Yves Nief.

Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Steinbuch Centre for Computing

Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.

Extending the farm to external sites: the INFN Tier-1 experience

INFN Computing infrastructure - Workload management at the Tier-1

Andrea Chierici On behalf of INFN-T1 staff

INFN CNAF TIER1 and TIER-X network infrastructure

Update on Plan for KISTI-GSDC

Luca dell’Agnello INFN-CNAF

The INFN Tier-1 Storage Implementation

GridPP Tier1 Review Fabric

Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017

Presentation transcript:

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013

Outline Network Farming Storage Common services 28/10/2013Andrea Chierici2

Network

WAN Connectivity NEXUSCisco7600 RAL PIC TRIUMPH BNL FNAL TW-ASGC NDFGF IN2P3 SARA T1 resources LHC ONE LHC OPN General IP GARR Bo1 20Gb/s 10Gb/s 10 Gb/s CNAF-FNAL CDF (Data Preservation) 20 Gb Physical Link (2x10Gb) shared for LHCOPN and LHCONE. 10Gb/s 10 Gb/s For General IP Connectivity General IP  20 Gb/s (Q3-Q4 2013) LHCOPN/ONE  40 Gb/s (Q3-Q4 2013) 28/10/20134

Farming and Storage current connection model INTERNET LHCOPN cisco 7600 bd8810 nexus Gb/s Disk Servers Farming Switch Worker Nodes 4X1Gb/s Old resources Farming Switch 20 Worker Nodes per switch 2x10Gb/s Up to 4x10Gb/s Core switches and routers are fully redundant (power, CPU, fabrics) Every Switch is connected with load sharing on different port modules Core switches and routers have a strict SLA (next solar day) for maintenance 28/10/2013Andrea Chierici5

Farming

Computing resources 195K HS-06 – 17K job slots 2013 tender installed in summer – AMD CPUs, 16 job slots Upgraded whole farm to SL6 – Per-VO and per-Node approach – Some CEs upgraded and serving only some VOs Older nehalem nodes got a significant boost switching to SL6 (and activating hyperthreading too…) 28/10/2013Andrea Chierici7

New CPU tender 2014 tender delayed until beginning of 2014 – Will probably cover also 2015 needs Taking into account TCO (energy consumption) not only sales price 10 Gbit WN connectivity – 5 MB/s per job (minimum) required – 1 gbit link is not enough to face the traffic generated by modern multi core CPUs Network bonding is hard to configure Blade servers are attractive – Cheaper 10 gbit network infrastructure – Cooling optimization – OPEX reduction – BUT: higher street price 28/10/2013Andrea Chierici8

Monitoring & Accounting (1) Rewritten our local resource accounting and monitoring portal Old system was completely home-made – Monitoring and accounting were separate things – Adding/removing queues on LSF meant editing lines in monitoring system code – Hard to maintain: >4000 lines of Perl code 28/10/2013Andrea Chierici9

Monitoring & Accounting (2) New system: monitoring and accounting share same data base Scalable and based on open source software (+ few python lines) Graphite ( – Time series oriented data base – Django Webapp to plot on-demand graphs – lsfmonacct module released on github Automatic queue management 28/10/2013Andrea Chierici10

Monitoring & Accounting (3) 28/10/2013Andrea Chierici11

Monitoring & Accounting (4) 28/10/2013Andrea Chierici12

Issues Grid accounting problems starting from April 2013 – Subtle bugs affecting the log parsing stage on the CEs (DGAS urcollector) and causing it to skip data WNODeS issue upgrading to SL6 – Code maturity problems: addressed quickly – Now ready for production Babar and CDF will be using it rather soon Potentially the whole farm can be used with WNODeS 28/10/2013Andrea Chierici13

New activities Investigation on Grid Engine as an alternative batch system ongoing Testing zabbix as a platform for monitoring computing resources – Possible alternative to nagios + lemon WNs dynamic update to deal mainly with kernel/cvmfs/gpfs upgrades Evaluating APEL as an alternative to DGAS for grid accounting system 28/10/2013Andrea Chierici14

Storage

Storage Resources Disk Space: 15.3 PB-N (net) on-line – 7 EMC2 CX EMC2 CX4-960 (~2 PB) servers (2x1 gbps connections) – 7 DDN S2A DDN SFA 10K + 1 DDN SFA 12K(~11.3PB) + ~80 servers (10 gbps) – Installation of the latest system (DDN SFA 12K 1.9 PB-N) was completed this summer ~1.8 PB-N expansion foreseen before Christmas break – Aggregate bandwidth: 70 GB/s Tape library SL8500 ~16 PB on line with 20 T10KB drives and 13 T10KC drives (3 additional drives were added during summer 2013) – 8800 x 1 TB tape capacity, ~ 100MB/s of bandwidth for each drive – 1200 x 5 TB tape capacity, ~ 200MB/s of bandwidth for each drive – Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access the shared drives – 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances A tender for additional 470 x 5TB tape capacity is under way All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s) 28/10/2013Andrea Chierici16

Storage Configuration All disk space is partitioned in ~10 GPFS clusters served by ~170 servers – One cluster per each main experiment (LHC) – GPFS deployed on the SAN implements a full HA system – System scalable to tens of PBs and able to serve thousands of concurrent processes with an aggregate bandwidth of tens of GB/s GPFS coupled with TSM offers a complete HSM solution: GEMSS Access to storage granted through standard interfaces (posix, srm, xrootd and soon webdav) – FS directly mounted on WNs 28/10/2013Andrea Chierici17

Storage research activities Studies on more flexible and user-friendly methods for accessing storage over WAN – Storage federation implementation – cloud-like approach We developed an integration between GEMSS Storage System and Xrootd in order to match the requirements of CMS and ALICE, using ad-hoc Xrootd modifications – CMS modification was validated by the official Xrootd integration build – This integration is currently in production Another alternative approach for storage federations, based on http/webdav (Atlas use-case), is under investigation 28/10/2013Andrea Chierici18

LTDP Long Term Data preservation (LTDP) for CDF experiment – FNAL-CNAF Data Copy Mechanism is completed Copy of the data will follow this timetable: – end early 2014 → All data and MC user level n-tuples (2.1 PB) – mid 2014 → All raw data (1.9 PB) + Databases Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL “code preservation” issue to be addressed 28/10/2013Andrea Chierici19

Common services

Installation and configuration tools Currently Quattor is the tool used at INFN-T1 Investigation done on an alternative installation and management tool (study carried on by storage group) Integration between two tools: – Cobbler, for installation phase – Puppet, for server provisioning and management operations Results of investigation demonstrate Cobbler + Puppet as a viable and valid alternative – currently used within CNAF OpenLAB 28/10/2013Andrea Chierici21

Grid Middleware status EMI-3 update status – Argus, BDII, Cream CE, UI, WN, Storm – Some UIs still at SL5 (will be upgraded soon) EMI-1 phasing-out (only FTS remains) VOBOX updated to WLCG release 28/10/2013Andrea Chierici22