INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff
HEPiX, CERN2 6-May-08 Overview Introduction Infrastructural Expansion Farming Network Storage and databases Conclusions
HEPiX, CERN3 6-May-08 Introduction Location: INFN - CNAF, Bologna floor -2 ~ 1000 m 2 hall in the basement Multi-Experiment Tier-1 (~20 VOs, including LHC experiments, CDF, BABAR and others) Participating to LCG, EGEE, INFNGRID projects One of the main nodes of the GARR network In a nutshell: about 3 MSI2K with ~2000 CPU/cores, to be expanded ~9 MSI2K by June about 1PB of disk space (tender for further 1.6 PB) 1 PB tape library (additional 10 PB tape library by Q2 2008) Gigabit Ethernet network, 10 Gigabit links with some LAN and WAN Resources are assigned to experiments on a yearly basis
HEPiX, CERN4 6-May-08 Infrastructural Expansion Electrical power and cooling systems expansion, work is in progress right now Main constraints: Dimensional - Limited height of the rooms (h max =260 cm) → no floating floor Environmental - Noise and electromagnetic insulations due to the office and classroom proximity Key aspects: reliability, redundancy, maintainability 2 (+1) transformers: 2500 kVA each (~ 2000 kW each) 2 rotating (no-break) electric generators (EG): UPS+EG in one machine to save room 1700 kVA each (~ 1360 kW each) to be integrated with the 1250 kVA EG already installed 2 independent electric lines feeding each row of racks: redundancy of 2n 7 chillers: ~ 2 MW at T air = 40°C, 2 (+2) pumps for chilled water circulation High capacity precision air conditioning units (50 kW each) inside the high density islands (APC) Air treatment and conditioning (30 kW each) units outside the high density islands: UTA and UTL
HEPiX, CERN5 6-May-08 room 1 (migration, then storage) sites for chilled water piping from/to floor -1 (chillers) The TIER1 in 2009 (floor -2) electric panelboard room room 2 (farming) sites not involved by the expansion Remote control of the systems’ critical points
HEPiX, CERN6 6-May-08 Farming The farming service maintains all computing resources and provides access to them. Main aspects are: Automatic unattended installation of all nodes via Quattor. Advanced job scheduling via LSF scheduler. Customized monitoring and accounting system via RedEye, internally developed. Remote control of all computing resources via KVM, IPMI, plus customized scripts. Resources can be accessed in 2 ways: GRID: the preferred solution, using a so called “User Interface” node. Requires to set up a VO. Secure, x509 certificates used for authentication/authorization. Direct access to LSF: discouraged. Faster, you simply require a UNIX access on a front-end machine Limited to Tier1 only, insecure.
HEPiX, CERN7 6-May-08 Node installation and configuration Quattor ( is a CERN developed toolkit for automatic unattended installation. Kickstart initial installation (RedHat) Quattor takes over after first reboot Node configured according to administrator requirements Very powerful, allows per-node customizations, but can also easily install 1000 nodes with the same configuration in 1 hour Currently we only support Linux The HEP community chose Scientific Linux ( Version currently deployed at CNAF: 4.x Identical to RedHat AS Good hardware support Big software library available on-line Supported Grid middleware is gLite 3.1 We install the SL CERN release ( Some useful customizations
HEPiX, CERN8 6-May-08 Job scheduling Job scheduling is done via the LSF scheduler 848 WNs, 2032 CPUs/Cores, 3728 Slots Queue abstraction: one job queue per VO is deployed, no time oriented queues. Each experiment submits jobs to its own queue only. Resource utilization limits are set on per queue basis. Hierarchical Fairshare scheduling is used in order to calculate the priority in resource access All slots are shared, no VO-dedicated resources, all nodes belong to a single big cluster. One group per VO, subgroups supported. A share (namely a resource quota) is assigned to each group in a hierarchical way. Priority is directly proportional to the share and inversely proportional to the historical resource usage. MPI Jobs are supported
HEPiX, CERN9 6-May-08 CNAF Tier1 KSpecInt2000 history Before Migration 848 WNs, 2032 Cpus/Cores, 3728 Slots After Migration 452 WNs, 1380 Cpus/Cores, 2230 Slots 11 twin quadcore servers added 476 WNs, 540 Cpus/Cores 2450 Slots Expected delivery (from new tender) of KSI2K Declared/available KSI2K monitoring
HEPiX, CERN10 6-May GARR 2x10Gb/s 10Gb/s Exterme BD x10Gb/s LHC Network General Layout 10Gb/s LHC-OPN dedicated link 10Gb/s T1-T1’s (except FZK) T1-T2’s CNAF General purpose FZK Exterme BD8810 Worker Nodes 2x1Gb/s Extreme Summit450 Extreme Summit450 4x1Gb/s Extreme Summit450 Worker Nodes 4x1Gb/s 2x10Gb/s Extreme Summit400 Storage Servers Disk Servers Castor Stagers Fiber Channel Storage Devices SAN Extreme Summit400 In Case of network Congestion: Uplink upgrade from 4 x 1Gb/s to 10 Gb/s or 2x10Gb/s LHC-OPN CNAF-FZK & T0-T1 BACKUP 10Gb/s WAN
HEPiX, CERN11 6-May-08 CNAF Implementation of 3 Storage Classes needed for LHC Disk0 Tape1 (D0T1) CASTOR (testing GPFS/TSM/StoRM) Space managed by system Data migrated to tapes and deleted from disk when staging area full Disk1 Tape0 (D1T0) GPFS/StoRM Space managed by VO CMS, LHCb, Atlas Disk1 Tape1 (D1T1) CASTOR (moving to GPFS/TSM/StoRM) Space managed by VO (i.e. if disk is full, copy fails) Large buffer of disk with tape back end and no garbage collector Deployment of an Oracle database infrastructure for Grid applications back-ends. Advanced backup service for both disk based and database based data Legato, RMAN, TSM (in the near future).
HEPiX, CERN12 6-May-08 ~ 40 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software) CASTOR deployment STK FlexLine 600, IBM FastT900 Core services are on machines with SCSI disks, hardware RAID1, redundant power supplies tape servers and disk servers have lower level hardware, like WNs 15 tape servers Sun Blade v100 with 2 internal IDE disks with software RAID1 running ACSLS 7.0 OS Solaris 9.0 STK L5500 silos (5500 slots, 200GB cartridges, capacity ~1.1 PB ) 16 tape drives, 3 Oracle databases (DLF, Stager, Nameserver) LSF plug-in for scheduling SRM v2 (2 front-ends), SRM v1 (phasing out) SA N
HEPiX, CERN13 6-May-08 Storage evolution Previous tests demonstrated weakness in the CASTOR behavior, even if some issues are now solved, we want to investigate and deploy an alternative way of implementing D1T1 and D0T1 storage classes Great expectations come from the use of TSM together with GPFS and StoRM Ongoing integration tests GPFS/TSM/StoRM StoRM needs to be modified to support DxT1 Some not trivial modifications for D0T1 required Short term solution for D1T1 based on customized scripts, has been successfully tested by LHCb Solution for D0T1 much more complicated, at present under test
HEPiX, CERN14 6-May-08 Why SToRM and GPFS/TSM? StoRM is a GRID enabled Storage Resource Manager (SRM v2.2) allows Grid applications to interact with storage resources through standard POSIX calls. GPFS 3.2 is the IBM high-performance cluster file system. Greatly reduced administrative overhead Redundancy on the level of IOserver failure HSM support and ILM features in both GPFS and TSM permits creation of very efficient solution. GPFS in particular demonstrated robustness and high performances GPFS showed better performance in SAN environment, as confronted to CASTOR, dCache and Xrootd solutions Long experience at CNAF (> 3 years), ~ 27 GPFS file systems in production at CNAF (~ 720 net TB) mounted on all farm WNs TSM is a High Performance Backup/Archiving solution from IBM TSM 5.5 implements HSM used also in HEP world (e.g. FZK, NDGF, CERN for backup)
HEPiX, CERN15 6-May-08 GPFS deployment evolution Started from a single cluster: all WNs and IO nodes in one cluster Some manageability problem has been observed Separated cluster of servers from one of WNs Access to Remote cluster FS has proven to be as efficient as the local one Decided to separate also cluster with HSM backend
HEPiX, CERN16 6-May-08 Oracle Database Service Main goals: high availability, scalability, reliability Achieved through a modular architecture based on the following building blocks: Oracle ASM for storage management implementation of redundancy and striping in an Oracle oriented way Oracle Real Application Cluster (RAC) the database is shared across several nodes with failover and load balancing capabilities Oracle Streams geographical data redundancy ASM RAC 32 server, 19 of them configured in 7 cluster 40 database instances Storage: 5TB (20TB raw) Availability rate: 98,7% in 2007 Availability (%) = Uptime/(Uptime + Target Downtime + Agent Downtime)
HEPiX, CERN17 6-May-08 Backup At present, backup on tape based on Legato Networker 3.3 Database on-line backup through RMAN, one copy is also stored on tape via Legato-RMAN plug-in Future migration to IBM TSM is foreseen Certified interoperability between GPFS and TSM TSM provides not only backup and archiving methods but also migrations capabilities Possible to exploit TSM migration in order to implement D1T1 and D0T1 storage classes StoRM/GPFS/TSM integration
HEPiX, CERN18 6-May-08 Conclusions INFN – Tier1 is facing a big infrastructural improvement which will allow to fully meet the experiment requirements for LHC Farming and network services are already pretty consolidated and are able to grow in term of computing capacity and network bandwidth without deep structural modifications Storage service has achieved a good degree of stability, last issues are mainly due to implementation of D0T1 and D1T1 storage classes An integration between StoRM, GPFS and TSM is under development and promises to be a definitive solution for the outstanding problems
HEPiX, CERN19 6-May-08