Download presentation
Presentation is loading. Please wait.
Published byFrancis Kelley Modified over 9 years ago
1
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015
2
Outline Infrastructure Network Data Management and Storage Farming 12/10/2015Andrea Chierici
3
Infrastructure
4
Incident (1) On August the 27 th, one of the two main power lines, upstream to the continuity system (so called “green line”) burnt. The fire was immediately extinguished by the automatic system and the power on the green line secured by the diesel engine – Root cause of the problem: dripping water – More than half of the extinguishing gas was used Protection against fire insufficient to cover another emergency We declared a “General Warning State” on GOCDB 12/10/2015Andrea Chierici
5
Incident (2) On August the 28 th, to allow fixing intervention, we interrupted the power on the green line (provided by diesel engine) One storage system (with redundant power) hosting LHCb data suffered multiple (65!) disk failures; to restore correct functionality and to start rebuild of failed RAID6 arrays it was necessary to switch it off and on again. – Cause of the problem: still unknown Some inline conditioning systems lost power due to wrong cabling (both cables connected to the “green line”) – As a consequence, temperatures raised in storage zone, reaching the alarm threshold – We were able to restore cooling fast enough not to cause damages or having to switch off some devices – Issue resolved now 12/10/2015Andrea Chierici
6
Incident (3) Farm nodes powered only by red line in order to save gasoline – On August the 30 th diesel engine unexpectedly run out of gasoline (sensor reporting error) On August the 31 st green line partially restored without continuity On September the 4 th the fire extinguishing system was restored – At about the same time the recovery of the failed RAID6 arrays for LHCb storage system completed (no more "Critical" or "no redundancy" state on any LUN); a few hours later all rebuilds were completed and the system came back to "optimal" state. On September the 5 th we reopened to LHCb submissions On September the 29 th green line was completely restored 12/10/2015Andrea Chierici No impact on global site reliability, everything kept working (with the exception of LHCb)
7
Network
8
CNAF WAN Links 4x10 Gb/s LHCOPN+LHCONE – One link aggregation of 40Gb/s is used for T0-T1(LHCOPN) and T1-T2 (LHCONE) 20Gb/s to CERN dedicated for T0-T1 and T1-T1 traffic (LHCOPN) CNAF, KIT and IN2P3 last year moved traffic between their TIER1s from LHCOPN to LHCONE (More bandwidth available through GEANT) 20 Gb/s General Purpose – General Internet Access for CNAF users – LHC sites not connected to LHCOPN/ONE (T3 and minor T2) – Backup link in case of LHCOPN down – INFN IT National Services 10 Gb/s CNAF-FNAL (LTDP) – Activity terminated (Decommissioning). – The 10 Gb/s CNAF side remains but the traffic has been routed through Geant-GARR 100 Gigabit General peering. IPv6 has been configured CNAF Side on LHCONE and LHCOPN peering in order to make perfsonar-ps and perfsonar-ow reachable in dual stack. 12/10/2015Andrea Chierici
9
NEXUSCisco7600 RAL SARA PIC TRIUMPH BNL FNAL TW-ASGC NDGF LHC ONE LHC OPN General IP 40Gb/s 20Gb/s 10 Gb/s CNAF-FNAL CDF (Data Preservation) No more on a dedicated link. 40 Gb Physical Link (4x10Gb) Shared by LHCOPN and LHCONE. 10Gb/s 20 Gb/s For General IP Connectivity GARR Bo1 GARR Mi1 GARR BO1 IN2P3 Main Tier-2s RRC-KI JINR KR-KISTI CNAF TIER1 12/10/2015
10
Data Management and Storage
11
On-line storage: GPFS 17 PB of disk space (15 PB of data) in 15 file systems – Each major experiment has its own cluster – 70 I/O servers (10 Gbps ethernet) for LAN access – 12 I/O servers (10 Gbps) for remote (WAN) access All worker nodes in a single “diskless” cluster – Accessing file systems via remote mount During 2015: – Retired ~1.4 PB (EMC2 CX3-80) – Installed 2014 tender: ~2.2 PB (4xDell MD3860f) “Dynamic Disk Pools” feature: very positive impact 2015 Tender: exploring «new» technology (InfiniBand) – 8 PB (2x DDN SFA12k) – Storage to servers -> IB FDR (54Gbps) – Server to LAN -> 4x10 Gbps 12/10/2015Andrea Chierici
12
Near-line Storage GEMSS: HSM based on TSM and GPFS 22 PB of data on tape Oracle STK SL8500 tape library – 17 Oracle T10kD tape drives – 4520 tapes loaded – ~5400 free tape slots (adding further 46 PB) 12 Tape servers – SAN interconnection to on-line (disk) storage – I/O rate up to 400 MB/s per server (current limit is FC4 link to SAN) TSM server v.6.2 (single instance) – 1 active + 1 stand-by on shared storage – Planning upgrade to TSM v.7.2 12/10/2015Andrea Chierici
13
Tape re-pack Data repack T10kB (1 TB/tape), T10kC (5 TB/tape) -> T10kD (8.5 TB/tape) Campaign started at the end of November 2014 Campaign finished at the mid of Sep. 2015 15.6 PB of data migrated in 10 months After TSM server upgrade the migration rate reached 1.6 GB/s, (limited by FC16 HBA) Observed heavy influence of INTEL power saving feature (C-states) on I/O performance Limited by ~900 MB/s on FC16 Resolved at the end of July by disabling C-states in BIOS and in kernel at boot time, reaching wire speed 12/10/2015Andrea Chierici C-state on C-state off
14
Farming
15
Computing resources 190K HS-06 – 2014 and 2015 tenders fully operational – 2015 tender is LENOVO Blades Dual Xeon 2630v3, 128GB ram Strong headache installing them, apparently some incompatibilities with SL6x (not well understood) – Still relying on a few unsupported resources, that will be dismissed after the next tender 12/10/2015Andrea Chierici
16
Extending the farm We are testing the possibility to extend our farm outside CNAF premises – First pilot with an Italian cloud provider, sharing vmware resources – Contacts with one of the major Italian banks, in order to exploit nightly CPU cycles – 20k HS06 should be provided in 2016 as pledged resources by a computing center in Bari, (built thanks to a collaboration between INFN and University of Bari) Many issues to face, but we are definitely interested in trying this approach 12/10/2015Andrea Chierici
17
Other activities Updating batch system to LSF9 – Some issues, mainly on accounting Provisioning: transition from quattor to puppet ongoing – Farming resources will be puppet-only within the end of the year Low power solutions: more studies carried on with HP moonshot and supermicro microblade – Will try to report outcome at the next HEPiX 12/10/2015Andrea Chierici
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.