Dynamic Extension of the INFN Tier-1 on external resources

Slides:



Advertisements
Similar presentations
ALICE data access WLCG data WG revival 4 October 2013.
Advertisements

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
Analysis in STEP09 at TOKYO Hiroyuki Matsunaga University of Tokyo WLCG STEP'09 Post-Mortem Workshop.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Monte Carlo Data Production and Analysis at Bologna LHCb Bologna.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
A Silvio Pardi on behalf of the SuperB Collaboration a INFN-Napoli -Campus di M.S.Angelo Via Cinthia– 80126, Napoli, Italy CHEP12 – New York – USA – May.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
ELASTIC LSF EXTENSION AT CNAF. credits Vincenzo Ciaschini Stefano Dal Pra Andrea Chierici Vladimir Sapunenko Tommaso Boccali.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Global Science experimental Data hub Center April 25, 2016 Seo-Young Noh Status Report on KISTI’s Computing Activities.
Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.
PRIN STOA-LHC: STATUS BARI BOLOGNA-18 GIUGNO 2014 Giorgia MINIELLO G. MAGGI, G. DONVITO, D. Elia INFN Sezione di Bari e Dipartimento Interateneo.
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
INFN Site Report R.Gomezel November 5-9,2007 The Genome Sequencing University St. Louis.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
1 LCG-France 22 November 2010 Tier2s connectivity requirements 22 Novembre 2010 S. Jézéquel (LAPP-ATLAS)
CernVM-FS vs Dataset Sharing
Extending the farm to external sites: the INFN Tier-1 experience
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2015/2016
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
DPM at ATLAS sites and testbeds in Italy
Distributed storage, work status
SuperB – INFN-Bari Giacinto DONVITO.
WLCG Network Discussion
Bob Ball/University of Michigan
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Helge Meinhard, CERN-IT Grid Deployment Board 04-Nov-2015
LCG Service Challenge: Planning and Milestones
ALICE internal and external network
Sviluppi in ambito WLCG Highlights
Virtualization and Clouds ATLAS position
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
StoRM: a SRM solution for disk based storage systems
Daniele Cesini – INFN-CNAF - 19/09/2017
Belle II Physics Analysis Center at TIFR
Diskpool and cloud storage benchmarks used in IT-DSS
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
INFN CNAF TIER1 and TIER-X network infrastructure
How to enable computing
Future challenges for the BELLE II experiment
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
STORM & GPFS on Tier-2 Milan
Evolution of the distributed computing model The case of CMS
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
Bernd Panzer-Steindel CERN/IT
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
Grid Canada Testbed using HEP applications
Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.
ETHZ, Zürich September 1st , 2016
This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.
Exploring Multi-Core on
Job Submission Via File Transfer
Presentation transcript:

Dynamic Extension of the INFN Tier-1 on external resources Vincenzo Ciaschini 25/10/2016 CCR Workshop on HTCondor

Tier1 is always saturated Computing resources always completely used at INFN-T1 with a large amount of waiting jobs (50% of the running jobs) Expected huge resource request (and increase) in the next years mostly coming from  LHC experiments INFN Tier-1 farm usage in 2016 Workshop CCR HTCondor 24-27 oct 2016

Cloud bursting on Commercial Provider Workshop CCR HTCondor 24-27 oct 2016

Bursting on Aruba One of the main Italian commercial resource providers Web, host, mail, cloud, … Main datacenter in Arezzo (near Florence) Small scale test 10x8 cores VM (160 GHz) managed by VMWare Use of idle CPU cycles When a customer requires a resource we are using, the CPU clock speed of “our” VMs is decreased to a few MHz (not destroyed!) Only CMS mcore jobs tested No storage on site: remote data access via Xrootd Use of GPN (no dedicated NREN infrastructure) Workshop CCR HTCondor 24-27 oct 2016

Implementation Uses in-house tool, dynfarm Authenticates connections coming from remote hosts and create a VPN split tunnel connecting them to Tier-1 Only CE, LSF and Argus are visible to remote WNs Configures the remote machines to make them compatible with the rest of the farm and the requirements to run jobs Only requires outbound connectivity Shared file system access through a R/O GPFS cache (AFM, see later for details) Workshop CCR HTCondor 24-27 oct 2016

Usage Capped at 160Mhz by Aruba Workshop CCR HTCondor 24-27 oct 2016

Results The jobs on the remote machine are the same that are running in the T1 Efficiency depends on job Very Good for Monte Carlo Low for the rest Range from 0.9 to 0.49 Could be increased by better network bandwidth or caching data Tests are foreseen Workshop CCR HTCondor 24-27 oct 2016

Static expansion to remote farms Workshop CCR HTCondor 24-27 oct 2016

Remote extension to Bari ReCaS 48 WNs (~26 kHS06) and ~330 TB of disk allocated to INFN-T1 farm for WLCG experiments in Bari-ReCaS data center Bari-ReCaS hosts a Tier-2 for CMS and Alice 10% of CNAF total resources, 13% of resources pledged to WLCG experiments Goal: direct and transparent access from INFN-T1 Our model is Wigner Dist: 600km RTT: 10ms Workshop CCR HTCondor 24-27 oct 2016

BARI – INFN-T1 connectivity Commissioning tests Bari-ReCaS WNs to be considered as if on CNAF LAN L3 VPN configured 2x10 Gb/s, MTU=9000 CNAF /22 subnet allocated to BARI WNs Service network (i.e. for IPMI) accessible Bari WNs routing through CNAF Including LHCONE, LHCOPN and GPN Layout of CNAF-BARI VPN Distance: ~600 Km RTT: ~10 ms Workshop CCR HTCondor 24-27 oct 2016

Farm extension setup INFN-T1 LSF master dispatches jobs also to Bari-ReCaS WNs Remote WNs considered (and really appear) as local resources CEs and other service nodes only at INFN-T1 Auxiliary services configured in Bari-ReCaS CVMFS Squid servers (for software distribution) Frontier Squid servers (used by ATLAS and CMS for condition db) Workshop CCR HTCondor 24-27 oct 2016

Data Access Data at CNAF are organized in GPFS file-systems Local access through Posix, Gridftp, Xrootd, and http Remote fs mount from CNAF unfeasible (x100 RTT) Jobs expect to access data the same way as if running at INFN-T1 Not all experiments can use a fallback protocol Posix cache for data required in Bari-ReCaS Alice uses Xrootd only (no cache needed) Cache implemented with AFM (GPFS native feature) Workshop CCR HTCondor 24-27 oct 2016

Remote data access via GPFS AFM A cache providing geographic replica of a file system Manages R/W access to cache Two sides Home - where the information lives Remote – implemented as a cache Data written to the cache is copied back to home as quickly as possible Data is copied to the cache when requested AFM configured R/O for Bari-ReCaS ~400 TB of cache vs. ~11 PB of data Several tuning and reconfigurations required! In any case, decided to avoid submission of high throughput jobs to remote host HEPiX Fall 2016 24-27 oct 2016

Results: Bari-ReCaS Job efficiency @ Bari-ReCaS equivalent (or even better!) In general jobs at CNAF use WNs shared among several VOs during summer one of these (non WLCG), submitted misconfigured jobs, affecting efficiency of all other VOs Atlas submits only low I/O jobs on Bari-ReCaS Alice uses only XrootD, no cache “intense” WAN usage also from CNAF jobs Network appears not to be an issue We can work without cache if data comes via Xrootd Cache mandatory for some experiments Production quality since June 2016 ~550 k jobs (~8% CNAF) Experiment NJobs Efficiency Alice 105109 0,87 Atlas 366999 0,94 CMS 34626 0,80 LHCb 39310 0,92 Job efficiency @BARI Experiment NJobs Efficiency Alice 536361 0,86 Atlas 4956628 0,87 CMS 326891 0,76 LHCb 263376 0,88 Job efficiency @CNAF Efficiency = CPT/WCT Cpt=current processing time Workshop CCR HTCondor 24-27 oct 2016

Conclusions An extension of INFN Tier-1 center with remote resources is possible 2 different implementations INFN remote resources Commercial cloud provider Even if we are planning to upgrade the Data Center to host enough resources at least to guarantee local execution of LHC Run 3 (2023), testing (elastic) extension is important Infrastructure will not scale indefinitely External extensions could address temporary peak requests Workshop CCR HTCondor 24-27 oct 2016