INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Luca dell’Agnello INFN-CNAF FNAL, May
CERN IT Department CH-1211 Geneva 23 Switzerland t T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT.
Questionaire answers D. Petravick P. Demar FNAL. 7/14/05 DLP -- GDB2 FNAL/T1 issues In interpreting the T0/T1 document how do the T1s foresee to connect.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.
News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) WLCG GDB, CERN 8 July 2015.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,
INFN-T1 site report Andrea Chierici, Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX spring 2012.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
A Silvio Pardi on behalf of the SuperB Collaboration a INFN-Napoli -Campus di M.S.Angelo Via Cinthia– 80126, Napoli, Italy CHEP12 – New York – USA – May.
Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.
Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Eygene Ryabinkin, on behalf of KI and JINR Grid teams Russian Tier-1 status report May 9th 2014, WLCG Overview Board meeting.
INFN-T1 site report Luca dell’Agnello On behalf ot INFN-T1 staff HEPiX Spring 2013.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
PADME Kick-Off Meeting – LNF, April 20-21, DAQ Data Rate - Preliminary estimate Tentative setup: all channels read with Fast ADC 1024 samples, 12.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Farming Andrea Chierici CNAF Review Current situation.
An Introduction to GPFS
Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
WLCG IPv6 deployment strategy
LHC[OPN/ONE]  IPv6  status
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.
iSCSI Storage Area Network
INFN CNAF TIER1 Network Service
Mattias Wadenstein Hepix 2012 Fall Meeting , Beijing
Daniele Cesini – INFN-CNAF - 19/09/2017
LHCOPN update Brookhaven, 4th of April 2017
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
INFN CNAF TIER1 and TIER-X network infrastructure
Luca dell’Agnello INFN-CNAF
The INFN TIER1 Regional Centre
Update from the HEPiX IPv6 WG
The INFN Tier-1 Storage Implementation
GridPP Tier1 Review Fabric
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015

Outline Infrastructure Network Data Management and Storage Farming 12/10/2015Andrea Chierici

Infrastructure

Incident (1) On August the 27 th, one of the two main power lines, upstream to the continuity system (so called “green line”) burnt. The fire was immediately extinguished by the automatic system and the power on the green line secured by the diesel engine – Root cause of the problem: dripping water – More than half of the extinguishing gas was used Protection against fire insufficient to cover another emergency We declared a “General Warning State” on GOCDB 12/10/2015Andrea Chierici

Incident (2) On August the 28 th, to allow fixing intervention, we interrupted the power on the green line (provided by diesel engine) One storage system (with redundant power) hosting LHCb data suffered multiple (65!) disk failures; to restore correct functionality and to start rebuild of failed RAID6 arrays it was necessary to switch it off and on again. – Cause of the problem: still unknown  Some inline conditioning systems lost power due to wrong cabling (both cables connected to the “green line”) – As a consequence, temperatures raised in storage zone, reaching the alarm threshold – We were able to restore cooling fast enough not to cause damages or having to switch off some devices – Issue resolved now 12/10/2015Andrea Chierici

Incident (3) Farm nodes powered only by red line in order to save gasoline – On August the 30 th diesel engine unexpectedly run out of gasoline (sensor reporting error) On August the 31 st green line partially restored without continuity On September the 4 th the fire extinguishing system was restored – At about the same time the recovery of the failed RAID6 arrays for LHCb storage system completed (no more "Critical" or "no redundancy" state on any LUN); a few hours later all rebuilds were completed and the system came back to "optimal" state. On September the 5 th we reopened to LHCb submissions On September the 29 th green line was completely restored 12/10/2015Andrea Chierici No impact on global site reliability, everything kept working (with the exception of LHCb)

Network

CNAF WAN Links 4x10 Gb/s LHCOPN+LHCONE – One link aggregation of 40Gb/s is used for T0-T1(LHCOPN) and T1-T2 (LHCONE) 20Gb/s to CERN dedicated for T0-T1 and T1-T1 traffic (LHCOPN) CNAF, KIT and IN2P3 last year moved traffic between their TIER1s from LHCOPN to LHCONE (More bandwidth available through GEANT) 20 Gb/s General Purpose – General Internet Access for CNAF users – LHC sites not connected to LHCOPN/ONE (T3 and minor T2) – Backup link in case of LHCOPN down – INFN IT National Services 10 Gb/s CNAF-FNAL (LTDP) – Activity terminated (Decommissioning). – The 10 Gb/s CNAF side remains but the traffic has been routed through Geant-GARR 100 Gigabit General peering. IPv6 has been configured CNAF Side on LHCONE and LHCOPN peering in order to make perfsonar-ps and perfsonar-ow reachable in dual stack. 12/10/2015Andrea Chierici

NEXUSCisco7600 RAL SARA PIC TRIUMPH BNL FNAL TW-ASGC NDGF LHC ONE LHC OPN General IP 40Gb/s 20Gb/s 10 Gb/s CNAF-FNAL CDF (Data Preservation) No more on a dedicated link. 40 Gb Physical Link (4x10Gb) Shared by LHCOPN and LHCONE. 10Gb/s 20 Gb/s For General IP Connectivity GARR Bo1 GARR Mi1 GARR BO1 IN2P3 Main Tier-2s RRC-KI JINR KR-KISTI CNAF TIER1 12/10/2015

Data Management and Storage

On-line storage: GPFS 17 PB of disk space (15 PB of data) in 15 file systems – Each major experiment has its own cluster – 70 I/O servers (10 Gbps ethernet) for LAN access – 12 I/O servers (10 Gbps) for remote (WAN) access All worker nodes in a single “diskless” cluster – Accessing file systems via remote mount During 2015: – Retired ~1.4 PB (EMC2 CX3-80) – Installed 2014 tender: ~2.2 PB (4xDell MD3860f) “Dynamic Disk Pools” feature: very positive impact 2015 Tender: exploring «new» technology (InfiniBand) – 8 PB (2x DDN SFA12k) – Storage to servers -> IB FDR (54Gbps) – Server to LAN -> 4x10 Gbps 12/10/2015Andrea Chierici

Near-line Storage GEMSS: HSM based on TSM and GPFS 22 PB of data on tape Oracle STK SL8500 tape library – 17 Oracle T10kD tape drives – 4520 tapes loaded – ~5400 free tape slots (adding further 46 PB) 12 Tape servers – SAN interconnection to on-line (disk) storage – I/O rate up to 400 MB/s per server (current limit is FC4 link to SAN) TSM server v.6.2 (single instance) – 1 active + 1 stand-by on shared storage – Planning upgrade to TSM v /10/2015Andrea Chierici

Tape re-pack Data repack T10kB (1 TB/tape), T10kC (5 TB/tape) -> T10kD (8.5 TB/tape) Campaign started at the end of November 2014 Campaign finished at the mid of Sep PB of data migrated in 10 months After TSM server upgrade the migration rate reached 1.6 GB/s, (limited by FC16 HBA) Observed heavy influence of INTEL power saving feature (C-states) on I/O performance Limited by ~900 MB/s on FC16 Resolved at the end of July by disabling C-states in BIOS and in kernel at boot time, reaching wire speed 12/10/2015Andrea Chierici C-state on C-state off

Farming

Computing resources 190K HS-06 – 2014 and 2015 tenders fully operational – 2015 tender is LENOVO Blades Dual Xeon 2630v3, 128GB ram Strong headache installing them, apparently some incompatibilities with SL6x (not well understood) – Still relying on a few unsupported resources, that will be dismissed after the next tender 12/10/2015Andrea Chierici

Extending the farm We are testing the possibility to extend our farm outside CNAF premises – First pilot with an Italian cloud provider, sharing vmware resources – Contacts with one of the major Italian banks, in order to exploit nightly CPU cycles – 20k HS06 should be provided in 2016 as pledged resources by a computing center in Bari, (built thanks to a collaboration between INFN and University of Bari) Many issues to face, but we are definitely interested in trying this approach 12/10/2015Andrea Chierici

Other activities Updating batch system to LSF9 – Some issues, mainly on accounting Provisioning: transition from quattor to puppet ongoing – Farming resources will be puppet-only within the end of the year Low power solutions: more studies carried on with HP moonshot and supermicro microblade – Will try to report outcome at the next HEPiX 12/10/2015Andrea Chierici