Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFN Computing infrastructure - Workload management at the Tier-1

Similar presentations


Presentation on theme: "INFN Computing infrastructure - Workload management at the Tier-1"— Presentation transcript:

1 INFN Computing infrastructure - Workload management at the Tier-1

2 Introduction to INFN-T1

3 The Tier-1 at INFN-CNAF INFN-T1 started as a computing center for LHC experiments Nowadays provides services and resources to ~30 other scientific collaborations ~1.000 WNs , ~ computing slots, ~200 kHS06 Small HPC cluster available with IBA (~33 TFlops) 22 PB SAN disk (GPFS), 43 PB on tape (TSM) integrated as an HSM Supporting LTDP for CDF experiment Dedicated network channel (60 Gb/s) for LHC OPN + LHC ONE 20 Gb/s reserved for LHC ONE Upgrade to 100 Gb/s connection in 2017 25/10/2016

4 Computing resource usage
Computing resources always completely used at INFN-T1 with a large amount of waiting jobs (50% of the running jobs) Expected huge resource request (and increase) in the next years mostly coming from  LHC experiments INFN Tier-1 farm usage in 2016 INFN Tier-1 resource increase 25/10/2016

5 WAN@CNAF (OPN+ONE upgrade 4060Gb/s)
LHC OPN RAL SARA PIC TRIUMPH BNL FNAL TW-ASGC NDGF KR-KISTI RRC-KI JINR Main Tier-2s LHC ONE GARR Mi1 IN2P3 General IP NEXUS GARR Bo1 GARR BO1 20 Gb/s For General IP Connectivity 20Gb/s 40Gb/s 40 Gb Physical Link (4x10Gb) Shared by LHCOPN and LHCONE. 60 Gb Physical Link (6x10Gb) Shared by LHCOPN and LHCONE. Cisco7600 CNAF TIER1 25/10/2016

6 Physical TIER1 network upgrade to 60Gb/s
26-Sep: Upgraded T1 physical WAN bandwidth (LHC-OPN/ONE) from 4x10Gb/s to 6x10Gb/s. 3-Oct: reached 60 gb/s OPN+ONE (Daily) OPN+ONE (weekly statistics) OPN (Daily) ONE (Daily) OPN: 90% of peak traffic was Xrootd towards WNs from CERN machines. ONE: traffic on ONE generated mainly by T2s 6 25/10/2016

7 Storage Resources 23 PB (net) of disks GPFS 4.1
3-4 PB for each file system 32 PB on tapes TSM 7.2 17 tape drives (T10KD) High density installations: 2015 tender - 10 PB (net) in 3 racks Decreased number of servers 150 → 100 2016 tender: 3 PB Huawei OceanStor 6800 V3 25/10/2016

8 Computing resources Farm power: 200K HS-06
2016 tender still to be delivered 2016 tender: HUAWEI x6800 255 x Dual Xeon e5-2618L v4, 128GB ram Should provide 400HS each We continuously get requests of extra pledges, it’s difficult to switch off old nodes But we decommissioned many racks this year. New hardware for virtualization infrastructure Managed by Ovirt New VMWare contract with special discount price 25/10/2016

9 Monitoring Complete refactory of monitoring and alarm tools across all CNAF functional units Past: nagios, lemon, home made probes and sensors, legacy UIs Future: Central Infrastructure, Sensu+Uchiwa, InfluxDB and Grafana, community probes and sensors (home made when necessary) 25/10/2016

10 Provisioning All nodes in a common data-base (home made solution)
Warranty information Purchase data OS and network configuration Can interface nagios to data-base to automatically build the node inventory Recently moved from quattor to puppet+foreman Installation and configuration is fully automated Several manifests and configuration shared among different departments Better support achieved by community 25/10/2016

11 Toward a (semi-)elastic Data Center?
Planning to upgrade the Data Center to host resources at least until the end of LHC Run 3 (2023) Extending (dynamically) the farm to remote hosts as a complementary solution Cloud bursting on commercial provider Tests of “opportunistic computing” on Cloud providers Static allocation of remote resources First production use case: part of 2016 pledged resources for WLCG experiments at INFN- T1 are in Bari-ReCaS Participating in HNSCicloud PCP project 25/10/2016

12 Workload management

13 Current situation T1 for LHC Experiments (ATLAS, ALICE, CMS, LHCB)
Other “heavy users” (AMS, VIRGO) Other VOs (auger, biomed, borexino, juno,…) 2 approaches to submit jobs to INFN-T1 Grid (preferred) Locally 25/10/2016

14 Grid submission Users reach our site via cream CEs (6 configured so far) or… through one of the many UIs configured One generic UI available Many experiment specific UIs present to fulfill specific requirements in terms of software requirements or specific configuration 25/10/2016

15 Local submission Generic UI is available for local submission too
To be able to log-in users must sign CNAF AUP. Access is given through bastion hosts and then users can ssh to their experiment UI. 25/10/2016

16 Batch system LSF (currently v9.x) is the production batch system at INFN-T1 Adopted many years ago, to solve issues had with torque+maui Never a problem related to bug, scalability, performance Always found a configuration to fulfill every request coming from experiments Recently adopted multi-core job configuration Installation with “best-practices” configuration Shared cNFS file system among all clients nodes 3 redundant masters Single LSF instance for the whole farm Managing 1000 WNs, 50 UIs, 6CEs LHCb T2 resources 25/10/2016

17 Batch system 2 more, separate, LSF batch systems T3 Bologna
Kept separated because resources are not part of INFN-T1 pledges HPC cluster Cluster with some special settings, kept separate for now 25/10/2016

18 LSF problems Mainly one: cost of the license
Agreement with IBM, acquired at a special price, but still very expensive! Main concern is for the future requirement of computing resources Support is over by the end of 2018 In 2019 we will still be able to use our purchased licenses but no right to upgrade and no support. 25/10/2016

19 Planning the switch HTCondor seems to be the best alternative for us
We have software relying on LSF that needs to be adapted May be more time consuming compared to batch system switch itself We need to instruct users Are command wrappers for basic submission and query available? We are not in hurry, since the number of licenses is big enough to fulfill our requirements for the near future. 25/10/2016


Download ppt "INFN Computing infrastructure - Workload management at the Tier-1"

Similar presentations


Ads by Google