INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.

Slides:

Advertisements

Similar presentations

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Advertisements

CASTOR Project Status CASTOR Project Status CERNIT-PDP/DM February 2000.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Luca dell’Agnello INFN-CNAF FNAL, May

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.

ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

RAL PPD Site Update and other odds and ends Chris Brew.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Federico Ruggieri INFN-CNAF GDB Meeting 10 February 2004 INFN TIER1 Status.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh

SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.

RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.

1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.

CASTOR: CERN’s data management system CHEP03 25/3/2003 Ben Couturier, Jean-Damien Durand, Olof Bärring CERN.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.

W.A.Wojcik/CCIN2P3, Nov 1, CCIN2P3 Site report Wojciech A. Wojcik IN2P3 Computing Center URL:

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

The Italian Tier-1: INFN-CNAF 11-Oct-2005 Luca dell’Agnello Davide Salomoni.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

Storage & Database Team Activity Report INFN CNAF,

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Validation tests of CNAF storage infrastructure Luca dell’Agnello INFN-CNAF.

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Servizi core INFN Grid presso il CNAF: setup attuale

Luca dell’Agnello INFN-CNAF

Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.

Database Services at CERN Status Update

CERN Lustre Evaluation and Storage Outlook

Luca dell’Agnello INFN-CNAF

The INFN TIER1 Regional Centre

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017

CASTOR: CERN’s data management system

Presentation transcript:

INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff

HEPiX, CERN2 6-May-08 Overview Introduction Infrastructural Expansion Farming Network Storage and databases Conclusions

HEPiX, CERN3 6-May-08 Introduction Location: INFN - CNAF, Bologna  floor -2 ~ 1000 m 2 hall in the basement Multi-Experiment Tier-1 (~20 VOs, including LHC experiments, CDF, BABAR and others)  Participating to LCG, EGEE, INFNGRID projects One of the main nodes of the GARR network In a nutshell: about 3 MSI2K with ~2000 CPU/cores, to be expanded ~9 MSI2K by June about 1PB of disk space (tender for further 1.6 PB) 1 PB tape library (additional 10 PB tape library by Q2 2008) Gigabit Ethernet network, 10 Gigabit links with some LAN and WAN Resources are assigned to experiments on a yearly basis

HEPiX, CERN4 6-May-08 Infrastructural Expansion Electrical power and cooling systems expansion, work is in progress right now Main constraints: Dimensional - Limited height of the rooms (h max =260 cm) → no floating floor Environmental - Noise and electromagnetic insulations due to the office and classroom proximity Key aspects: reliability, redundancy, maintainability 2 (+1) transformers: 2500 kVA each (~ 2000 kW each) 2 rotating (no-break) electric generators (EG): UPS+EG in one machine to save room 1700 kVA each (~ 1360 kW each) to be integrated with the 1250 kVA EG already installed 2 independent electric lines feeding each row of racks: redundancy of 2n 7 chillers: ~ 2 MW at T air = 40°C, 2 (+2) pumps for chilled water circulation High capacity precision air conditioning units (50 kW each) inside the high density islands (APC) Air treatment and conditioning (30 kW each) units outside the high density islands: UTA and UTL

HEPiX, CERN5 6-May-08 room 1 (migration, then storage) sites for chilled water piping from/to floor -1 (chillers) The TIER1 in 2009 (floor -2) electric panelboard room room 2 (farming) sites not involved by the expansion Remote control of the systems’ critical points

HEPiX, CERN6 6-May-08 Farming The farming service maintains all computing resources and provides access to them. Main aspects are:  Automatic unattended installation of all nodes via Quattor.  Advanced job scheduling via LSF scheduler.  Customized monitoring and accounting system via RedEye, internally developed.  Remote control of all computing resources via KVM, IPMI, plus customized scripts. Resources can be accessed in 2 ways:  GRID: the preferred solution, using a so called “User Interface” node. Requires to set up a VO. Secure, x509 certificates used for authentication/authorization.  Direct access to LSF: discouraged. Faster, you simply require a UNIX access on a front-end machine Limited to Tier1 only, insecure.

HEPiX, CERN7 6-May-08 Node installation and configuration Quattor ( is a CERN developed toolkit for automatic unattended installation.  Kickstart initial installation (RedHat)  Quattor takes over after first reboot  Node configured according to administrator requirements  Very powerful, allows per-node customizations, but can also easily install 1000 nodes with the same configuration in 1 hour Currently we only support Linux  The HEP community chose Scientific Linux ( Version currently deployed at CNAF: 4.x Identical to RedHat AS  Good hardware support  Big software library available on-line Supported Grid middleware is gLite 3.1  We install the SL CERN release ( Some useful customizations

HEPiX, CERN8 6-May-08 Job scheduling Job scheduling is done via the LSF scheduler  848 WNs, 2032 CPUs/Cores, 3728 Slots Queue abstraction:  one job queue per VO is deployed, no time oriented queues.  Each experiment submits jobs to its own queue only.  Resource utilization limits are set on per queue basis. Hierarchical Fairshare scheduling is used in order to calculate the priority in resource access  All slots are shared, no VO-dedicated resources, all nodes belong to a single big cluster.  One group per VO, subgroups supported.  A share (namely a resource quota) is assigned to each group in a hierarchical way.  Priority is directly proportional to the share and inversely proportional to the historical resource usage. MPI Jobs are supported

HEPiX, CERN9 6-May-08 CNAF Tier1 KSpecInt2000 history Before Migration 848 WNs, 2032 Cpus/Cores, 3728 Slots After Migration 452 WNs, 1380 Cpus/Cores, 2230 Slots 11 twin quadcore servers added 476 WNs, 540 Cpus/Cores 2450 Slots Expected delivery (from new tender) of KSI2K Declared/available KSI2K monitoring

HEPiX, CERN10 6-May GARR 2x10Gb/s 10Gb/s Exterme BD x10Gb/s LHC Network General Layout 10Gb/s LHC-OPN dedicated link 10Gb/s T1-T1’s (except FZK) T1-T2’s CNAF General purpose FZK Exterme BD8810 Worker Nodes 2x1Gb/s Extreme Summit450 Extreme Summit450 4x1Gb/s Extreme Summit450 Worker Nodes 4x1Gb/s 2x10Gb/s Extreme Summit400 Storage Servers Disk Servers Castor Stagers Fiber Channel Storage Devices SAN Extreme Summit400 In Case of network Congestion: Uplink upgrade from 4 x 1Gb/s to 10 Gb/s or 2x10Gb/s LHC-OPN CNAF-FZK & T0-T1 BACKUP 10Gb/s WAN

HEPiX, CERN11 6-May-08 CNAF Implementation of 3 Storage Classes needed for LHC  Disk0 Tape1 (D0T1)  CASTOR (testing GPFS/TSM/StoRM) Space managed by system Data migrated to tapes and deleted from disk when staging area full  Disk1 Tape0 (D1T0)  GPFS/StoRM Space managed by VO CMS, LHCb, Atlas  Disk1 Tape1 (D1T1)  CASTOR (moving to GPFS/TSM/StoRM) Space managed by VO (i.e. if disk is full, copy fails) Large buffer of disk with tape back end and no garbage collector Deployment of an Oracle database infrastructure for Grid applications back-ends. Advanced backup service for both disk based and database based data  Legato, RMAN, TSM (in the near future).

HEPiX, CERN12 6-May-08 ~ 40 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software) CASTOR deployment STK FlexLine 600, IBM FastT900 Core services are on machines with SCSI disks, hardware RAID1, redundant power supplies tape servers and disk servers have lower level hardware, like WNs 15 tape servers Sun Blade v100 with 2 internal IDE disks with software RAID1 running ACSLS 7.0 OS Solaris 9.0 STK L5500 silos (5500 slots, 200GB cartridges, capacity ~1.1 PB ) 16 tape drives, 3 Oracle databases (DLF, Stager, Nameserver) LSF plug-in for scheduling SRM v2 (2 front-ends), SRM v1 (phasing out) SA N

HEPiX, CERN13 6-May-08 Storage evolution Previous tests demonstrated weakness in the CASTOR behavior, even if some issues are now solved, we want to investigate and deploy an alternative way of implementing D1T1 and D0T1 storage classes Great expectations come from the use of TSM together with GPFS and StoRM Ongoing integration tests GPFS/TSM/StoRM  StoRM needs to be modified to support DxT1 Some not trivial modifications for D0T1 required Short term solution for D1T1 based on customized scripts, has been successfully tested by LHCb Solution for D0T1 much more complicated, at present under test

HEPiX, CERN14 6-May-08 Why SToRM and GPFS/TSM? StoRM is a GRID enabled Storage Resource Manager (SRM v2.2)  allows Grid applications to interact with storage resources through standard POSIX calls. GPFS 3.2 is the IBM high-performance cluster file system.  Greatly reduced administrative overhead  Redundancy on the level of IOserver failure  HSM support and ILM features in both GPFS and TSM permits creation of very efficient solution.  GPFS in particular demonstrated robustness and high performances GPFS showed better performance in SAN environment, as confronted to CASTOR, dCache and Xrootd solutions  Long experience at CNAF (> 3 years), ~ 27 GPFS file systems in production at CNAF (~ 720 net TB) mounted on all farm WNs TSM is a High Performance Backup/Archiving solution from IBM  TSM 5.5 implements HSM  used also in HEP world (e.g. FZK, NDGF, CERN for backup)

HEPiX, CERN15 6-May-08 GPFS deployment evolution Started from a single cluster: all WNs and IO nodes in one cluster Some manageability problem has been observed Separated cluster of servers from one of WNs Access to Remote cluster FS has proven to be as efficient as the local one Decided to separate also cluster with HSM backend

HEPiX, CERN16 6-May-08 Oracle Database Service Main goals: high availability, scalability, reliability Achieved through a modular architecture based on the following building blocks:  Oracle ASM for storage management implementation of redundancy and striping in an Oracle oriented way  Oracle Real Application Cluster (RAC) the database is shared across several nodes with failover and load balancing capabilities  Oracle Streams geographical data redundancy ASM RAC 32 server, 19 of them configured in 7 cluster 40 database instances Storage: 5TB (20TB raw) Availability rate: 98,7% in 2007 Availability (%) = Uptime/(Uptime + Target Downtime + Agent Downtime)

HEPiX, CERN17 6-May-08 Backup At present, backup on tape based on Legato Networker 3.3 Database on-line backup through RMAN, one copy is also stored on tape via Legato-RMAN plug-in Future migration to IBM TSM is foreseen  Certified interoperability between GPFS and TSM  TSM provides not only backup and archiving methods but also migrations capabilities  Possible to exploit TSM migration in order to implement D1T1 and D0T1 storage classes StoRM/GPFS/TSM integration

HEPiX, CERN18 6-May-08 Conclusions INFN – Tier1 is facing a big infrastructural improvement which will allow to fully meet the experiment requirements for LHC Farming and network services are already pretty consolidated and are able to grow in term of computing capacity and network bandwidth without deep structural modifications Storage service has achieved a good degree of stability, last issues are mainly due to implementation of D0T1 and D1T1 storage classes  An integration between StoRM, GPFS and TSM is under development and promises to be a definitive solution for the outstanding problems

HEPiX, CERN19 6-May-08