INFN Computing infrastructure - Workload management at the Tier-1

Slides:



Advertisements
Similar presentations
INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.
Advertisements

Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Connect. Communicate. Collaborate perfSONAR MDM Service for LHC OPN Loukik Kudarimoti DANTE.
Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
Computing Jiří Chudoba Institute of Physics, CAS.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Eygene Ryabinkin, on behalf of KI and JINR Grid teams Russian Tier-1 status report May 9th 2014, WLCG Overview Board meeting.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Farming Andrea Chierici CNAF Review Current situation.
Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.
INFN Site Report R.Gomezel November 5-9,2007 The Genome Sequencing University St. Louis.
Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.
Servizi core INFN Grid presso il CNAF: setup attuale
CernVM-FS vs Dataset Sharing
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2015/2016
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
LHC[OPN/ONE]  IPv6  status
Experiments and User Support
LHCOPN/LHCONE status report pre-GDB on Networking CERN, Switzerland 10th January 2017
Bob Ball/University of Michigan
Ian Bird WLCG Workshop San Francisco, 8th October 2016
The Beijing Tier 2: status and plans
INFN CNAF TIER1 Network Service
High Availability Linux (HA Linux)
Daniele Cesini – INFN-CNAF - 19/09/2017
LHCOPN update Brookhaven, 4th of April 2017
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Andrea Chierici On behalf of INFN-T1 staff
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
INFN CNAF TIER1 and TIER-X network infrastructure
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
The INFN TIER1 Regional Centre
Deployment of IPv6-only CPU on WLCG – an update from the HEPiX IPv6 WG
PES Lessons learned from large scale LSF scalability tests
Javier Magnin Brazilian Center for Research in Physics & ROC-LA
The INFN Tier-1 Storage Implementation
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
LHC Data Analysis using a worldwide computing grid
ETHZ, Zürich September 1st , 2016
CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019
Luca dell’Agnello, Daniele Cesini – INFN-CNAF CCR - 23/05/2017
This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.
IPv6 update Duncan Rand Imperial College London
Presentation transcript:

INFN Computing infrastructure - Workload management at the Tier-1

Introduction to INFN-T1

The Tier-1 at INFN-CNAF INFN-T1 started as a computing center for LHC experiments Nowadays provides services and resources to ~30 other scientific collaborations ~1.000 WNs , ~21.500 computing slots, ~200 kHS06 Small HPC cluster available with IBA (~33 TFlops) 22 PB SAN disk (GPFS), 43 PB on tape (TSM) integrated as an HSM Supporting LTDP for CDF experiment Dedicated network channel (60 Gb/s) for LHC OPN + LHC ONE 20 Gb/s reserved for LHC ONE Upgrade to 100 Gb/s connection in 2017 25/10/2016

Computing resource usage Computing resources always completely used at INFN-T1 with a large amount of waiting jobs (50% of the running jobs) Expected huge resource request (and increase) in the next years mostly coming from  LHC experiments INFN Tier-1 farm usage in 2016 INFN Tier-1 resource increase 25/10/2016

WAN@CNAF (OPN+ONE upgrade 4060Gb/s) LHC OPN RAL SARA PIC TRIUMPH BNL FNAL TW-ASGC NDGF KR-KISTI RRC-KI JINR Main Tier-2s LHC ONE GARR Mi1 IN2P3 General IP NEXUS GARR Bo1 GARR BO1 20 Gb/s For General IP Connectivity 20Gb/s 40Gb/s 40 Gb Physical Link (4x10Gb) Shared by LHCOPN and LHCONE. 60 Gb Physical Link (6x10Gb) Shared by LHCOPN and LHCONE. Cisco7600 CNAF TIER1 25/10/2016

Physical TIER1 network upgrade to 60Gb/s 26-Sep: Upgraded T1 physical WAN bandwidth (LHC-OPN/ONE) from 4x10Gb/s to 6x10Gb/s. 3-Oct: reached 60 gb/s OPN+ONE (Daily) OPN+ONE (weekly statistics) OPN (Daily) ONE (Daily) OPN: 90% of peak traffic was Xrootd towards WNs from CERN machines. ONE: traffic on ONE generated mainly by T2s 6 25/10/2016

Storage Resources 23 PB (net) of disks GPFS 4.1 3-4 PB for each file system 32 PB on tapes TSM 7.2 17 tape drives (T10KD) High density installations: 2015 tender - 10 PB (net) in 3 racks Decreased number of servers 150 → 100 2016 tender: 3 PB Huawei OceanStor 6800 V3 25/10/2016

Computing resources Farm power: 200K HS-06 2016 tender still to be delivered 2016 tender: HUAWEI x6800 255 x Dual Xeon e5-2618L v4, 128GB ram Should provide 400HS each We continuously get requests of extra pledges, it’s difficult to switch off old nodes But we decommissioned many racks this year. New hardware for virtualization infrastructure Managed by Ovirt New VMWare contract with special discount price 25/10/2016

Monitoring Complete refactory of monitoring and alarm tools across all CNAF functional units Past: nagios, lemon, home made probes and sensors, legacy UIs Future: Central Infrastructure, Sensu+Uchiwa, InfluxDB and Grafana, community probes and sensors (home made when necessary) 25/10/2016

Provisioning All nodes in a common data-base (home made solution) Warranty information Purchase data OS and network configuration Can interface nagios to data-base to automatically build the node inventory Recently moved from quattor to puppet+foreman Installation and configuration is fully automated Several manifests and configuration shared among different departments Better support achieved by community 25/10/2016

Toward a (semi-)elastic Data Center? Planning to upgrade the Data Center to host resources at least until the end of LHC Run 3 (2023) Extending (dynamically) the farm to remote hosts as a complementary solution Cloud bursting on commercial provider Tests of “opportunistic computing” on Cloud providers Static allocation of remote resources First production use case: part of 2016 pledged resources for WLCG experiments at INFN- T1 are in Bari-ReCaS Participating in HNSCicloud PCP project 25/10/2016

Workload management

Current situation T1 for LHC Experiments (ATLAS, ALICE, CMS, LHCB) Other “heavy users” (AMS, VIRGO) Other VOs (auger, biomed, borexino, juno,…) 2 approaches to submit jobs to INFN-T1 Grid (preferred) Locally 25/10/2016

Grid submission Users reach our site via cream CEs (6 configured so far) or… through one of the many UIs configured One generic UI available Many experiment specific UIs present to fulfill specific requirements in terms of software requirements or specific configuration 25/10/2016

Local submission Generic UI is available for local submission too To be able to log-in users must sign CNAF AUP. Access is given through bastion hosts and then users can ssh to their experiment UI. 25/10/2016

Batch system LSF (currently v9.x) is the production batch system at INFN-T1 Adopted many years ago, to solve issues had with torque+maui Never a problem related to bug, scalability, performance Always found a configuration to fulfill every request coming from experiments Recently adopted multi-core job configuration Installation with “best-practices” configuration Shared cNFS file system among all clients nodes 3 redundant masters Single LSF instance for the whole farm Managing 1000 WNs, 50 UIs, 6CEs LHCb T2 resources 25/10/2016

Batch system 2 more, separate, LSF batch systems T3 Bologna Kept separated because resources are not part of INFN-T1 pledges HPC cluster Cluster with some special settings, kept separate for now 25/10/2016

LSF problems Mainly one: cost of the license Agreement with IBM, acquired at a special price, but still very expensive! Main concern is for the future requirement of computing resources Support is over by the end of 2018 In 2019 we will still be able to use our purchased licenses but no right to upgrade and no support. 25/10/2016

Planning the switch HTCondor seems to be the best alternative for us We have software relying on LSF that needs to be adapted May be more time consuming compared to batch system switch itself We need to instruct users Are command wrappers for basic submission and query available? We are not in hurry, since the number of licenses is big enough to fulfill our requirements for the near future. 25/10/2016