INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF.

INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF

Introduction Location: INFN-CNAF, Bologna (Italy) –one of the main nodes of GARR network Computing facility for INFN HNEP community –Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiment TIER1 –LHC experiments –VIRGO –CDF –BABAR –AMS, MAGIC, ARGO,... Resources assigned to experiments on a yearly Plan.

Services Computing servers (CPU FARMS) Access to on-line data (Disks) Mass Storage/Tapes Broad-band network access System administration Database administration Experiment specific library software Coordination with TIER0, other TIER1s and TIER2s

Infrastructure Hall in the basement (-2 nd floor): ~ 1000 m 2 of total space –Easily accessible with lorries from the road –Not suitable for office use (remote control) Electric power –220 V mono-phase (computers) 4 x 16A PDU needed for 3.0 GHz Xeon racks –380 V three-phase for other devices (tape libraries, air conditioning etc…) –UPS: 800 KVA (~ 640 KW) needs a separate room (conditioned and ventilated). –Electric Generator: 1250 KVA (~ 1000 KW)  up to 160 racks (~100 with 3.0 GHz Xeon)

HW Resources CPU: –320 old biprocessor boxes 0.8-2.4 GHz –350 new biprocessor boxes 3 GHz (+70 servers +55 Babar + 48 CDF +30 LHCb) 1300 KSi2K Total Disk: –FC, IDE, SCSI, NAS Tapes: –Stk L180 18 TB –Stk 5500 6 LTO-2 with 2000 tapes  400 TB 2 9940B with 800 tapes  200 TB Networking: –30 rack switches  46 FE UTP + 2 GE FO –2 core switches  96 GE FO + 120 GE FO + 4x10 GE –2x1Gbps links to WAN

Networking GARR-G Backbone with 2.5 Gbps F/O will be upgraded to Dark Fiber (Q3 2005) INFN-TIER1 access is now 1 Gbps (+1 Gbps for SC) and will be 10 Gbps soon (September 2005) –Gigapop is colocated with INFN-CNAF International Connectivity via Geant: 10 Gbps access in Milan already in place.

CNAF Network Setup 1Gb/s Dedicated to SC GARR Italian Research Network ER16 FarmSW3(IBM)FarmSWG1 FarmSW1FarmSW2(Dell) LHCBSW1 FarmSW4(IBM3) Catalyst3550 Farmsw4 Catalyst3550 SW-04-06 SW-04-07SW-04-08 SW-04-09 SW-04-10SW-03-06 FarmSW12 ServSW2 Catalyst3550 SW-03-07 SW-05-01 SW-05-02 SW-05-03 SW-05-04 SW-05-05 SW-05-06 SW-05-07SW-06-06 SW-06-01 FarmSWG3 HP Babar FarmSWtest SW-05-08 SW-05-09 BD SSR 8860 ServiceCahllenge Summit 400 10Gb/s CNAF Internal Services 1Gb/s SW-08-04 n x 10Gb/s Back Door 1Gb/s Production link

Tier1 LAN Each CPU rack (36 WNs) equipped with FE switch with 2xGb uplinks to core switch Disk servers connected via GE to core switch Foreseen upgrade to rack Gb switch –10 GB core switch already installed

LAN model layout SAN STK Tape Library NAS Rack_farm

Networking resources 30 Switches (14 switches 10Gb Ready) 3 Core Switch/Router (SSR8600, ER16, BD) –SSR8600 is also the WAN access router with firewalling functions) –A new Black Diamond 10808 is already installed with 120 GE and 12x10GE ports (it can scale up to 480 GE or 48x10GE) –New access router (with 4x10GE and 4xGE interfaces) in order to substitute the SSR8600 as WAN access (ER 16 and Black Diamond will aggregate all the Tier1’s resources). 3 Switch L2/L3 (with 48xGE and 2x10GE) to be used during the “Service Challenge”.

Farming Team composition –2 FTE for general purpose farm –1 FTE for hosted farms (CDF and Babar) –~3 FTE clearly not enough to manage ~800 WNs  more people needed Tasks –Installation & management of Tier1 WNs and servers Using Quattor (still some legacy lcfgng nodes around) –Deployment & configuration of OS & LCG middleware HW maintenance management –Management of batch scheduler (LSF, torque)

Access to Batch system “Legacy” non Grid Access CELSF Wn1WNn SE Grid Access UI Grid

Farming: where were we last year Computing resources not fully used –Batch system (Torque+maui) showed not to be scalable –Present policy assignment does not allow resource use optimization Interoperability issues –Full experiment integration still to be achieved –Difficulty to deal with 3 different farms

Farming: evolution (1) Migration of whole farm to SLC 3.0.4 (CERN version!) almost complete –Quattor deployment successful (more than 500 WNs) –Standard configuration of WNs for all experiments –To be completed in a couple of weeks Migration from torque+maui to LSF (v6.0) –LSF farm running successfully –Process to be completed together with OS migration –Fair sharing model for resource access –Progressive inclusion of BABAR & CDF WNs into general farm

Farming: evolution (2) LCG upgrade to 2.4.0 release –Installation via quattor (project-lcg-gdb-quattor-wg@cern.ch)project-lcg-gdb-quattor-wg@cern.ch Dropped lcfgng Different packaging from yaim Deployed upgrade to 500 nodes in one day –Still some problems with VOMS integration to be investigated –1 legacy LCG 2.2.0 CE still running Access to resources centrally managed with Kerberos (authc) and LDAP (authz)

c=it o=infn ou=cnaf ou=cr U U U ou=grid ou=ui ou=wn ou=privat e ou=public AFS: infn.it ou=cr ou=grid bastion G G U ou=local A WorkerN (pool accounts) UserInt o=cnaf UG Generic CNAF users U ext.acess CR.WorkerNode CR.UserInterface locaUser gridUser AFSinfn.it user infn user A A public A ou=afs Authorization with LDAP

Some numbers (LSF output) QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP alice 40 Open:Active - - - - 50 0 50 0 cdf 40 Open:Active - - - - 840 152 688 0 dteam 40 Open:Active - - - - 0 0 0 0 atlas 40 Open:Active - - - - 26 12 14 0 cms 40 Open:Active - - - - 0 0 0 0 lhcb 40 Open:Active - - - - 7 7 0 0 babar_test 40 Open:Active 300 300 - - 302 2 300 0 babar 40 Open:Active 20 20 - - 2073 2060 13 0 virgo 40 Open:Active - - - - 0 0 0 0 argo 40 Open:Active - - - - 0 0 0 0 magic 40 Open:Active - - - - 0 0 0 0 ams 40 Open:Active - - - - 136 0 136 0 infngrid 40 Open:Active - - - - 0 0 0 0 guest 40 Open:Active - - - - 0 0 0 0 test 40 Open:Active - - - - 0 0 0 0 1200 jobs running!

Storage & Database Team composition –~ 1.5 FTE for general storage –~ 1 FTE for CASTOR –~ 1 FTE for DBs Tasks –DISK (SAN, NAS) - HW/SW installation and maintenance, remote (gridSE) and local (rfiod/nfs/GPFS) access service, clustered/parallel filesystem tests, participation to SC 2 SAN systems (~ 225 TB) 4 NAS systems (~ 60TB) –CASTOR HSM system - HW/SW installation and maintenance, gridftp and SRM access service STK library with 6 LTO2 and 2 9940B drives (+4 to install) –1200 LTO2 (200 GB) tapes –680 9940B (200 GB) tapes –DB (Oracle for Castor & RLS test, Tier 1 “global” Hardware db)

Storage setup Physical access to main storage (Fast-T900) via SAN –Level1 disk servers connected via FC Usually also in GPFS cluster –Easiness of administration –Load balancing and redundancy –Lustre under evaluation –Can be level2 disk servers connected to storage only via GPFS LCG and FC dependencies on OS decoupled WNs are not members of GPFS cluster (no scalability on large number of WNs) –Storage available to WNs via rfio, xrootd (BABAR) or NFS (few cases, see next slide) –NFS used mainly to share experiment sw on WNs But not suitable for data access

NFS stress test NFS server... 1) Connectathon (NFS/RPC test) 2) iozone (NFS-I/O test) NFS clients WNs NAS LSF fibre channell Parameters client kernel server kernel rsize wsize protocol test execution time I/O disk write I/O wait server threads Results problems with 2.4.21-27.0.2 kernel 2.6 is better than 2.4 (even with sdr patch) UDP protocol has better performance than TCP better rsize is 32768 better wsize is 8192 nfs protocol may scale over 200 clients without aggregate performance degradation. job scheduler

Disk storage tests Data processing for the LHC experiments at Tier1 facilities requires the access to PetaBytes of data from thousands of nodes simultaneously at high rate None of old-fashioned storage systems allow to handle these requirements We are in the process of defining the required hardware and software components of the system –It is emerging that a Storage Area Network approach with a Parallel File System on top can make the job V. Vagnoni – LHCb Bologna

Hardware testbed Disk storage –3 controllers of a IBM DS4500 –Each controller serves 2 RAID5 arrays, 4 TB each (17 x 250 GB disks + 1 hot spare) –Each RAID5 is further subdivided in two LUNs of 2 TB each –12 LUNs and 24 TB of disk space in total (102 x 250GB disks + 8 hot spares) File System Servers –6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit NIC –QLogic fiber channel PCI card on each server connected to the DS4500 via a Brocade switch –6 Gb/s available bandwidth to/from the clients Clients –36 dual Xeon, 2 GB RAM, Gigabit NIC V. Vagnoni – LHCb Bologna

Parallel File Systems We evaluated the two main-stream products on the market –GPFS (version 2.3) by IBM –Lustre (version 1.4.1) by CFS Both come with advanced management and configuration features, SAN-oriented failover mechanism, data recovery –But can be used as well on top of standard disks and arrays By using GPFS and Lustre, the 12 DS4500 LUNs in use were aggregated in one 24 TB file system from the servers, and mounted by the clients through the Gigabit network –Both the file systems are 100% POSIX-compliant from the client side –The file systems appear to the clients as ordinary local mount points V. Vagnoni – LHCb Bologna

Performance (1) A home-made benchmarking tool oriented to HEP applications has been written –Allows simultaneous sequential read/write from an arbitrary number of clients and processes per client Raw ethernet throughput vs time (20 x 1GB file simultaneous reads) V. Vagnoni – LHCb Bologna

Performance (2) Net throughput (Gb/s) # of simultaneous read/writes Results of read/write (1GB different files) V. Vagnoni – LHCb Bologna

CASTOR issues At present STK library with 6xLTO2 and 2x9940B drives –2000 x 200GB LTO-2 tapes  400TB –800 x 200GB 9940B tapes  160TB (free) –Tender for upgrade with 2500 x 200GB tapes (500TB) In general CASTOR performances (as other HSM software) increase with clever pre-staging of files (ideally ~ 90%) LTO-2 drives not usable in a real production environment with present CASTOR release –hangs on locate/fskip occur every 50-100 not-sequential reading operations or checksum and not terminated tape (RDONLY) every 50/100GB data written (also STK assistance is needed) –Usable only with medium file size of 20 MB or more –Good reliability on optimized (sequential or pre-staged) operations –Fixes with CASTOR v.2 (Q2 2005) ? CERN and PIC NEVER reported HW problems with 9940B drives during last year data-challenges.

Service challenge (1) WAN Dedicated 1Gb/s link connected via GARR+Geant 10 Gb/s link available in September ‘05 LAN Extreme Summit 400 (48xGE+2x10GE) dedicated to Service Challenge. CERN 10Gb in September 2005 1Gb Summit 400 48 Gb/s+2x10Gb/s GARR Italian Research Network 11 Servers + Internal HDs

Service challenge (2) 11 SUN Fire V20 Dual Opteron (2,2 Ghz) –2x 73 GB U320 SCSI HD –2x Gbit Ethernet interfaces –2x PCI-X Slots OS: SLC3.0.4 (arch x86_64), the kernel is 2.4.21-27 Tests bonnie++/IOzone on local disks  ~60 MB/s r and w Tests Netperf/Iperf on LAN  ~950 Mb/s Globus (GridFTP) v2.4.3 Installed on all cluster nodes CASTOR SRM v1.2.12, Stager CASTOR v1.7.1.5.

SC2: Transfer cluster 10 machines used as GridFTP/SRM servers, 1 as CASTOR Stager/SRM-repository –Internal disks used (70x10=700GB). –For SC3 also CASTOR tape servers with IBM LTO2 or STK 9940B drives will be used Load balancing implemented assigning to a CNAME the IP addresses of all the 10 servers with round-robin algorithm SC2 goal reached: 100 MBps disk-to-disk for 2 weeks sustained

Sample of production: CMS (1) CMS activity at T1: grand summary –Data transfer >1 TB per day T0  T1 in 04 via PhEDEx –Local MC production >10 Mevts for >40 physics datasets –Grid activities official production on LCG analysis on DST via grid tools

Sample of production: CMS (2)

Sample of production: CMS (3)

Summary & Conclusions During 2004 INFN Tier1 deeply involved in LHC –Some experiments (e.g. BABAR, CDF) already in data taking phase Main issue is HR shortage –~10 FTE required for Tier1 to nearly double the staff (both for hw and sw maintenance)

INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF.

Similar presentations

Presentation on theme: "INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF.

Similar presentations

Presentation on theme: "INFN Tier1 Status report Spring HEPiX 2005 Andrea Chierici – INFN CNAF."— Presentation transcript:

Similar presentations

About project

Feedback