INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.

Slides:



Advertisements
Similar presentations
US CMS Tier1 Facility Network Andrey Bobyshev (FNAL) Phil DeMar (FNAL) CHEP 2010 Academia Sinica Taipei, Taiwan.
Advertisements

A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.
Luca dell’Agnello INFN-CNAF FNAL, May
Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
RAL PPD Site Update and other odds and ends Chris Brew.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.
KISTI-GSDC SITE REPORT Sang-Un Ahn, Jin Kim On the behalf of KISTI GSDC 24 March 2015 HEPiX Spring 2015 Workshop Oxford University, Oxford, UK.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.
Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,
INFN-T1 site report Andrea Chierici, Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX spring 2012.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
STATUS OF KISTI TIER1 Sang-Un Ahn On behalf of the GSDC Tier1 Team WLCG Management Board 18 November 2014.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
Eygene Ryabinkin, on behalf of KI and JINR Grid teams Russian Tier-1 status report May 9th 2014, WLCG Overview Board meeting.
INFN-T1 site report Luca dell’Agnello On behalf ot INFN-T1 staff HEPiX Spring 2013.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
PADME Kick-Off Meeting – LNF, April 20-21, DAQ Data Rate - Preliminary estimate Tentative setup: all channels read with Fast ADC 1024 samples, 12.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Farming Andrea Chierici CNAF Review Current situation.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Data storage services at CC-IN2P3 Jean-Yves Nief.
Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.
NERSC/LBNL at LBNL in Berkeley October 2009 Site Report Roberto Gomezel INFN 1.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.
Extending the farm to external sites: the INFN Tier-1 experience
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
INFN CNAF TIER1 and TIER-X network infrastructure
Update on Plan for KISTI-GSDC
CERN Lustre Evaluation and Storage Outlook
The INFN Tier-1 Storage Implementation
GridPP Tier1 Review Fabric
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
Presentation transcript:

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Outline Common services Network Farming Storage 20/05/2013Andrea Chierici2

Common services

Cooling problem in march Problem at cooling system, we had to switch the whole center off – Obviously the problem happened on Sunday at 1am  Took almost a week to completely recover and have our center 100% back on-line – But LHC exp. opened after 36h We learned a lot from this (see separate presentation) 20/05/2013Andrea Chierici4

New dashboard 20/05/2013Andrea Chierici5

Example: Facility 20/05/2013Andrea Chierici6

Installation and configuration CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure INFN-T1 historically a quattor supporter New man power, wider user base and activities pushing us to change Quattor would stay around as much as needed – at least 1 year to allow for the migration of some critical services 20/05/2013Andrea Chierici7

Heartbleed No evidence of compromised nodes Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki) Some hosts were not exposed due to older version installed 20/05/2013Andrea Chierici8

Grid Middleware status EMI-3 update status – All core services updated – All WNs updated – Some legacy services (mainly UIs) still at EMI-1/2, will be phased out asap 20/05/2013Andrea Chierici9

Network

WAN Connectivity NEXUSCisco7600 RAL PIC TRIUMPH BNL FNAL TW-ASGC NDFGF IN2P3 SARA T1 resources LHC ONE LHC OPN General IP GARR Bo1 40Gb/s 10Gb/s 10 Gb/s CNAF-FNAL CDF (Data Preservation) 40 Gb Physical Link (4x10Gb) shared for LHCOPN and LHCONE. 10Gb/s 10 Gb/s For General IP Connectivity 20/05/201311Andrea Chierici

Current connection model INTERNET LHCOPN/ONE cisco 7600 bd8810 nexus Gb/s Disk Servers Farming Switch Worker Nodes 4X1Gb/s Old resources Farming Switch 20 Worker Nodes per switch 2x10Gb/s Up to 4x10Gb/s Core switches and routers are fully redundant (power, CPU, fabrics) Every Switch is connected with load sharing on different port modules Core switches and routers have a strict SLA (next solar day) for maintenance 20/05/2013Andrea Chierici12 4X10Gb/s 10Gb/s

Farming

Computing resources 150K HS-06 – Reduced compared to last WS – Old nodes have been phased-out (2008 and 2009 tender) Whole farm running on SL6 – Supporting a few VOs that still require sl5 via WNODeS 20/05/2013Andrea Chierici14

New CPU tender 2014 tender delayed – Funding issues – We were running over-pledged resources Trying to take into account TCO (energy consumption) not only sales price Support will cover 4 years Trying to open it as much as possible – Last tender only 2 bidders – “Relaxed” support constrains Would like to have a way to easily share specs, experiences and hints about other sites procurements 20/05/2013Andrea Chierici15

Monitoring & Accounting (1) 20/05/2013Andrea Chierici16

Monitoring & Accounting (2) 20/05/2013Andrea Chierici17

New activities (last ws) Did not migrate to Grid Engine, we stick to LSF – Mainly INFN-wide decision – Man power Testing zabbix as a platform for monitoring computing resources – More time required Evaluating APEL as an alternative to DGAS for grid accounting system not done yet 20/05/2013Andrea Chierici18

New activities Configure Ovirt cluster to manage service VMs done – standard libvirt mini-cluster for backup, with GPFS shared storage Upgrade LSF to v.9 Setup of a new HPC cluster (Nvidia GPUs + Intel MIC) Multicore task force Implement log analysis system (logstash, kibana) Move some core grid services to OpenStack infrastructure (first one will be site-BDII) Evaluation of Avoton CPU (see separate presentation) Add more VOs to WNODeS 20/05/2013Andrea Chierici19

Storage

Storage Resources Disk Space: 15 PB-N (net) on-line – 4 EMC2 CX EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections) – 7 DDN S2A DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps) – Upgrade of the latest system (DDN SFA 12K) was completed 1Q Aggregate bandwidth: 70 GB/s Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives – 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive – 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each drive The 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity – Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access the shared drives – 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances A tender for additional 3000 x 5TB/8.5TB tape capacity for is ongoing All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s) 20/05/2013Andrea Chierici21

Storage Configuration All disk space is partitioned in ~10 GPFS clusters served by ~170 servers – One cluster per main experiment (LHC) – GPFS deployed on SAN implements a full High Availability system – System scalable to tens of PBs and able to serve thousands of concurrent processes with an aggregate bandwidth of tens GB/s GPFS coupled with TSM offers a complete HSM solution: GEMSS Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV) – FS directly mounted on WNs 20/05/2013Andrea Chierici22

Storage research activities Studies on more flexible and user-friendly methods for accessing storage over WAN – Storage federations, based on http/WebDAV for Atlas (production) and LHCb (testing) – Evaluation of different file systems (CEPH) and storage solutions (EMC2 Isilon over OneFS). Integration between GEMSS Storage System and Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications – This is currently in production 20/05/2013Andrea Chierici23

LTDP Long Term Data preservation (LTDP) for CDF experiment – FNAL-CNAF Data Copy Mechanism is completed Copy of the data will follow this timetable: – end early 2014 → All data and MC user level n-tuples (2.1 PB) – mid 2014 → All raw data (1.9 PB) + Databases Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL 940 TB already at CNAF code preservation: CDF legacy software release (SL6) under test analysis framework: in the future, CDF services and analysis computing resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment 20/05/2013Andrea Chierici24