Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF
INFN National Institute for Nuclear Physics (INFN) is a research institute funded by the Italian government Composed by several units – 20 units dislocated in the main Italian University Physics Departments – 4 Laboratories – 3 National Centers dedicated to specific tasks CNAF is a National Center dedicated to computing applications 2 ISGC 2016
The Tier-1 at INFN-CNAF ● WLCG Grid site dedicated to HEP computing for LHC experiments (ATLAS, CMS, LHCb, ALICE) works with ~30 other scientific groups ● WNs, computing slots, 200k HS06 and counting. ● LSF as current Batch System, Condor migration foreseen ● 22PB SAN disk (GPFS), 27PB on tape (TSM) integrated as an HSM ● Also supporting LTDP for CDF experiment ● Dedicated network channel (LHC OPN, 20Gb/s) with CERN Tier-0 and T1s, plus 20GB/s (LHC ONE) with most of the T2s ● 100Gbps connection in 2017 ● Member of HNSciCloud European project for testing hybrid clouds for scientific computing ISGC
NEXUSCisco7600 RAL SARA PIC TRIUMPH BNL FNAL TW-ASGC NDGF LHC ONE LHC OPN General IP 40Gb/s 20Gb/s 40 Gb Physical Link (4x10Gb) Shared by LHCOPN and LHCONE. 10Gb/s 20 Gb/s For General IP Connectivity GARR Bo1 GARR Mi1 GARR BO1 IN2P3 Main Tier-2s RRC-KI JINR KR-KISTI CNAF TIER-1 ISGC
Extension use-cases Elastic opportunistic computing with transient Aruba resources. CMS selected for test&setup ReCaS/Bari: extension and management of remote resources – These will become pledged resources for CNAF ISGC 2016 Bologna Bari Arezzo 5
Use-case 1: Aruba
Pros of Opportunistic computing ● CMS ● Take advantage of (much) more computing resources. ● CONS: transient availability ● ARUBA ● Study case in order to provide unused resources to an “always hungry” customer ● INFN-T1 ● Test transparent utilization of remote resources for HEP (proprietary or opportunistic) ISGC
Aruba ● One of the main Italian resource providers – Web, host, mail, cloud... ● Main datacenter in Arezzo (near Florence) ISGC
The CMS Experiment at INFN-T1 ● 48k HS06 of CPU power, 4PB of online Disk storage and 12PB of tape ● Implemented all majors computing activities ● Monte Carlo simulations ● Reconstruction ● End-user analysis The 4 LHC experiments are close enough in requests / workflows – extension to the other 3 under development ISGC
The use-case ● Early agreement CNAF - Aruba ● ARUBA provides an amount of Virtual resources (CPU cycles, RAM, DISK) to deploy a remote testbed ● VMWare dashboard ● When Aruba customers require more resources, the CPU Freq. of the provided VMs in the testbed is lowered down to a few MHz (not destroyed!) ● Goal ● Transparently join these external resources “as if they were” in the local cluster, and have LSF dispatching jobs there when available ● Tied to CMS-only specifications for the moment ● Once fully tested and verified, extension to other experiments is ● Trivial for other LHC experiments ● To be studied for non-LHC VOs ISGC
VM Management via VMWare Proved to be rock solid and extremely versatile Imported seamlessly a WN image from our WN- on-demand system (WNoDeS) Adapted and contextualized ISGC 2016 Resources allocated to our Data center 11
The CMS workflow at CNAF ● Grid pilot jobs submitted to CREAM CEs ● Late binding: we cannot know in advance what kind of activity it's going to perform ● Multicore only ● 8 core (or 8 slot) jobs: CNAF dedicates a dynamic partition of WNs to such jobs ● SQUID proxy for Software and Condition DB ● Input files on local GPFS disk, fallback via Xrootd, O(GB) file size ● Output file staged through SRM (StoRM) at CNAF. ISGC
The dynamic Multicore partition ISGC 2016 CMS jobs run in a dynamic subset of hosts dedicated to multicore-only jobs. Elastic resources shall be member of this subset. 13
Adapting CMS for Aruba Main idea: transparent extension Remote WN join the LSF cluster at boot “as if” local to the cluster Problems: Remote Virtual WN need read-only access to the cluster shared fs (/usr/share/lsf) VMs have private IP, are behind NAT & FW, outbound connectivity only, but have to be reachable by LSF LSF needs host resolution (IP ↔ hostname) but no DNS available for such hosts ISGC
Adapting CMS for Aruba Solutions: Read-only access to the cluster shared fs Provided through GPFS/AFM Host resolution LSF has his own version of /etc/hosts This requires to declare a fixed set of Virtual nodes Networking problems solved using dynfarm: Service developed at CNAF to provide integration between LSF and virtualized computing resources. ISGC
Remote data access via GPFS AFM GPFS AFM A cache providing geographic replica of a file system manages RW access to cache Two sides Home - where the information lives Cache Data written to the cache is copied back to home as quickly as possible Data is copied to the cache when requested Configured as Read-only for site extension ISGC
Dynfarm concepts The VM at boot connects to a OpenVPN based service at CNAF It authenticates the connection (X.509) Delivers parameters to setup a tunnel with (only) the required services at CNAF (LSF, CEs, Argus) Routes are defined on each server to the private IPs of the VMs (GRE Tunnels) Other traffic flows through general network ISGC
Dynfarm deployment VPN Server side, two RPMs: dynfarm-server, dynfarm-client-server In the VPN server at CNAF. First install creates one dynfarm_cred.rpm which must be present in the VMs VM side, two RPMs: dynfarm_client, dynfarm_cred (contains CA certificate used by VPN server and a key used by dynfarm-server) Management: remote_control ISGC
Dynfarm workflow ISGC
Results Early successful attempts from Jun 2015 Different configurations (tuning) have followed ISGC
Results 160GHz total amount of CPU (Intel 2697-v3). – Assuming 2GHz/core → 10 x 8-cores VMs (possible overbooking) ISGC
Results Currently the remote VM run the very same jobs delivered to CNAF by GlideinWMS Job efficiency on elastic resources can be very good for certain type of jobs (MC) Special configuration at GlideIN can specialize delivery for these resources. ISGC 2016 QueueSiteNjobsAvg_effMax_effAvc_wctAvg_cpt CMS_mcAR CMS_mcT
Use-case 2: ReCaS/Bari
Remote extension to ReCaS/Bari ~17.5k HS06, ~30WN, 64 core, 256GB RAM 1 core / 1 slot, 4GB/slot, 8,53 HS06/slot (546HS06/WN) Dedicated network connection with CNAF: VPN lev. 3, 20Gb/s Routing through CNAF, IP of remote hosts in the same network range (plus x.y for ipmi access) Similar to CERN/Wigner extension Direct and transparent access from CNAF ISGC
Deployment Two infrastructure VMs to offload network link: CVMFS and Frontier SQUID (used by ATLAS and CMS) SQUID requests are redirected to the local VMs Cache storage GPFS/AFM 2 server, 10 Gbit 330TB (Atlas, CMS, LHCb) LSF shared file system also replicated ISGC
Network traffic (4 weeks) ISGC
Current issues and tuning Latencies in the shared fs can cause troubles – Intense I/O can lead to timeout : ba-3-x-y: Feb 8 22:56:51 ba kernel: nfs: server nfs- ba.cr.cnaf.infn.it not responding, timed out CMS: fallback to Xrootd (excessive load on the AFM cache) ISGC
Comparative Results QueueNodetypeNjobsAvg_effMax_effAvg_wctAvg_cpt Cms_mcAR AliceT Atlas_scT Cms_mcT LhcbT Atlas_mcT AliceBA AtlasBA Cms_mcoreBA LhcbBA Atlas_scBA ISGC
Conclusions
Aruba Got the opportunity to test our setup on a pure commercial cloud provider Developed dynfarm to extend our network setup Core dynfarm concept should be adaptable to other Batch Systems Gained experience on yet another Cloud Infrastructure: Vmware Job efficiency encouraging Even better when we will be able to forward to Aruba only non-IO intensive jobs Scale of the test quite small, did not reach any bottleneck Tested with CMS, other LHC experiments may join in future Accounting problematic due to possible GHz reduction Good exercise for HNSciCloud too ISGC
ReCaS/Bari T1-Bari farm extension “similar” to CERN-Wigner Job efficiency (compared to native T1) highly depending on storage usage – Better efficiency means job on WN is mainly CPU bound (or input file already in cache before start) General scalability limited by the width of dedicated T1→BA link (20Gb/s) Assistance on faulty nodes somehow problematic ISGC