Download presentation
Presentation is loading. Please wait.
Published byLewis Harrison Modified over 9 years ago
1
The Italian Tier-1: INFN-CNAF 11-Oct-2005 Luca dell’Agnello Davide Salomoni
2
Introduction Location: INFN-CNAF, Bologna (Italy) –one of the main nodes of the GARR network Computing facility for INFN HNEP community –Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiment TIER1 –LHC experiments –VIRGO –CDF –BABAR –AMS, MAGIC, ARGO, PAMELA,… Resources assigned to experiments on a yearly basis
3
Infrastructure (1) Hall in the basement (-2 nd floor): ~ 1000 m 2 of total space –Easily accessible with lorries from the road –Not suitable for office use (remote control) New control and alarm systems under installation (including cameras to monitor the hall) –Hall temperature –Circuit cold water temperature –Electric power transformer temperature Electric power system (1250 KVA) –UPS: 800 KVA (~ 640 KW) needs a separate room (conditioned and ventilated) Not used for conditioning system –Electric Generator: 1250 KVA (~ 1000 KW) up to 160 racks (~100 with 3.0 GHz Xeon) –220 V mono-phase (computers) 4 x 16A PDU needed for 3.0 GHz Xeon racks –380 V three-phase for other devices (tape libraries, air conditioning etc…) –Expansion under evaluation
4
Infrastructure (2) Cooling –RLS (Airwell) on the roof ~ 530 KW cooling power Water cooling Need “booster pump” (20 mts T1 roof) Noise insulation needed on the roof –1 Air Conditioning Unit (uses 20% of RLS refreshing power and controls humidity) –14 Local Cooling Systems (Hiross) in the computing room ~ 30 KW each Main challenge is electric power needed in 2010 –Presently Intel Xeon 110 Watt/KspecInt with quasi-linear increase in Watt/SpecInt –New Opteron consumption 10% less –Opteron Dual Core ~factor -1.5-2 less ?
5
WN typical Rack Composition Power Controls (3U) 1 network switch (1- 2U) – 48 FE copper interfaces – 2 GE fiber uplinks ~ 36 1U WNs – Connected to network switch via FE – Connected to KVM system
6
Remote console control Paragon UTM8 (Raritan) –8 Analog (UTP/Fiber) output connections –Supports up to 32 daisy chains of 40 nodes (UKVMSPD modules needed) –IP-reach (expansion to support IP transport) evaluted but not used –Used to control WNs Autoview 2000R (Avocent) –1 Analog + 2 Digital (IP transport) output connections –Supports connections up to 16 nodes Optional expansion to 16x8 nodes –Compatible with Paragon (“gateway” to IP) –Used to control servers
7
Power Switches 2 models used: “Old” APC MasterSwitch Control Unit AP9224 controlling 3 x 8 outlets 9222 PDU from 1 Ethernet “New” APC PDU Control Unit AP7951 controlling 24 outlets from 1 Ethernet “zero” Rack Unit (vertical mount) Access to the configuration/control menu via serial/telnet/web/snmp Dedicated machine using APC Infrastructure Manager Software Permit remote switching off of resources in case of serious problems
8
Networking Main network infrastructure based on optical fibres (~ 20 Km) LAN has a “classical” star topology with 2 Core Switch/Router (ER16, BD) –Migration to Black Diamond 10808 with 120 GE and 12x10GE ports (it can scale up to 480 GE or 48x10GE) –Each CPU rack equipped with FE switch with 2xGb uplinks to core switch –Disk servers connected via GE to core switch (mainly fibre) Some servers concentrated with copper cables to a dedicated switch –VLAN’s defined across switches (802.1q) 30 rack switches (14 switches 10Gb Ready): several brands, homogeneous characteristics –48 Copper Ethernet ports –Support of main standards (e.g. 802.1q) –2 Gigabit up-links (optical fibers) to core switch CNAF interconnected to GARR-G backbone at 1 Gbps + 2 x 1 Gbps for SC –Giga-PoP co-located –Upgrading SC link to 10 Gbps –New access router (Cisco 7600 with 4x10GE and 4xGE interfaces) under installation
9
L1L2 CNAF Production Network backdoor (CNAF) SC layout 192.135.23.254 192.135.23/24 GARR 2x 1Gbps links aggregation L0 General Internet 1Gbps link T1 n x 1Gbps link LAN & WAN T1 connectivity 10 Gbps link (October)
10
HW Resources….. CPU: –700 biprocessor boxes 2.4 – 3 GHz (+70 servers) –150 new Opteron biprocessor boxes 2.6 GHz (~ 2200 Euro each + VAT) 1700 KSi2k Total Decommissioning ~ 100 WNs (~ 150 KSi2K) moved to test farm –Tender for 800 KSI2k (Summer 2006) Disk: –FC, IDE, SCSI, NAS technologies –470 TB raw (~ 430 FC-SATA) 2005 tender: 200 TB raw (~ 2260 Euro/TB net + VAT) –Tender for 400 TB (Summer 2006) Tapes: –Stk L180 18 TB –Stk 5500 6 LTO-2 with 2000 tapes 400 TB 2 9940B with 800 tapes 160 TB (1.5 KEuro/TB 0.35 KEuro/TB)
11
….. Human Resources ~ 14 FTE available –Farming service: 4 FTE –Storage service: 5 FTE –Logistic service: 2 FTE –Network & Security service: 3 FTE
12
Farm(s) 1 general purpose farm (~ 750 WNs, 1550 KSI2k) –SLC 3.0.5, LCG 2.6, LSF –Accessible both from Grid and locally –Also 16 InfiniBand WNs for MPI on special queue 1 dedicated farm to CDF (~ 80 WNs) –RH 7.3,Condor Migration to SLC and inclusion in general farm planned –CDF can run also on general farm with glide-in Test farm (~ 100 WNs, 150 KSI2k)
13
Access to Batch system “Legacy” non Grid Access CELSF Wn1WNn SE Grid Access UI Grid
14
Farming tasks Installation & management of Tier1 WNs and servers –Using Quattor Deployment & configuration of OS & LCG middleware –HW maintenance management –Management of batch scheduler (LSF) Migration from torque+maui to LSF (v6.1) last Spring –Torque+maui apparently not scalable –LSF farm running successfully –Fair sharing model for resource access 1 queue/experiment (at least) Queue policies under evaluation –Progressive inclusion of CDF farm into general one Access to resources centrally managed with Kerberos (authc) and LDAP (authz) –Group based authorization
15
c=it o=infn U AFS: infn.it G o=cnaf UG Generic CNAF users infn user public view ou=afs Authorization with LDAP private view ou=cnaf G U A R R U G ou=people ou=group A ou=role ou=automount N ou=people-nologin N
16
The queues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP dteam 200 Open:Active - - - - 0 0 0 0 babar_test 80 Open:Active 10 - - - 0 0 0 0 babar_build 80 Open:Active - 3 - - 0 0 0 0 alice 40 Open:Active - - - - 37 0 37 0 cdf 40 Open:Active - - - - 529 40 489 0 atlas 40 Open:Active - - - - 27 10 17 0 cms 40 Open:Active - - - - 128 1 127 0 cms_align 40 Open:Active - - - - 0 0 0 0 lhcb 40 Open:Active - - - - 594 6 588 0 babar_xxl 40 Open:Active 10 3 - - 0 0 0 0 babar_objy 40 Open:Active 175 120 - - 228 108 120 0 babar 40 Open:Active 60 40 - - 0 0 0 0 virgo 40 Open:Active - - - - 0 0 0 0 argo 40 Open:Active - - - - 299 0 299 0 magic 40 Open:Active - 50 - - 0 0 0 0 ams 40 Open:Active 50 - - - 0 0 0 0 infngrid 40 Open:Active - - - - 1 0 1 0 pamela 40 Open:Active - - - - 0 0 0 0 quarto 40 Open:Active - - - - 17 0 17 0 guest 40 Open:Active - - - - 0 0 0 0 test 40 Open:Active - - - - 0 0 0 0 geant4 30 Open:Active - - - - 0 0 0 0 biomed 10 Open:Active - - - - 0 0 0 0 pps 10 Open:Active - - - - 0 0 0 0
17
Farm usage CPU total time total wall-clock time
18
Farm usage
19
Tier1 Database (1) Resource database and management interface –Postgres database as back end –Web interface (apache+mod_ssl+php) –Hw servers characteristics –Sw servers configuration –CPU allocation Interoperability of other applications with the db –Monitoring/accounting system –Nagios Interface to configure switches and interoperate with Quattor –Vlan tags –ddns –Dhcp
20
Tier1 Database (2)
21
Storage & Database Tasks –DISK (SAN, NAS) - HW/SW installation and maintenance, remote (gridSE) and local (rfiod/nfs/GPFS) access service, clustered/parallel filesystem tests, participation to SC 2 SAN systems (~ 225 TB) 4 NAS systems (~ 60TB) –CASTOR HSM system - HW/SW installation and maintenance, gridftp and SRM access service STK library with 6 LTO2 and 2 9940B drives (+4 to install) –2000 LTO2 (200 GB) tapes –800 9940B (200 GB) tapes –DB (Oracle for Castor & RLS test, Tier 1 “global” Hardware db)
22
Storage status Physical access to main storage (Fast-T900) via SAN –Level1 disk servers connected via FC Usually also in GPFS cluster –Easiness of administration –Load balancing and redundancy –Lustre under evaluation –Can be level2 disk servers connected to storage only via GPFS LCG and FC dependencies on OS decoupled WNs are not members of GPFS cluster (no scalability on large number of WNs) –Storage available to WNs via rfio, xrootd (BABAR only), gridftp/SRM or NFS (sw distribution only) CASTOR HSM system (SRM interface) –STK library with 6 LTO2 and 2 9940B drives (+4 to install) 1200 LTO2 (200 GB) tapes 680 9940B (200 GB) tapes
23
Summary & Conclusions INFN-Tier1 startup was in 2001 2004, start-up phase ended During 2004 INFN Tier1 began to ramp up towards LHC –Some experiments (e.g. BABAR, CDF) already in data taking phase A lot of work still needed… –Infrastructural issues –Consolidation –Technological uncertainties –Management –Customer requests!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.