Daniele Cesini – INFN-CNAF - 19/09/2017

Slides:



Advertisements
Similar presentations
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Advertisements

INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
A short introduction to the Worldwide LHC Computing Grid Maarten Litmaath (CERN)
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.
Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Elastic CNAF Datacenter extension via opportunistic resources INFN-CNAF.
INFN Site Report R.Gomezel October 9-13, 2006 Jefferson Lab, Newport News.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Kilian Schwarz ALICE Computing Meeting GSI, October 7, 2009
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
LHC[OPN/ONE]  IPv6  status
Experiments and User Support
LHCOPN/LHCONE status report pre-GDB on Networking CERN, Switzerland 10th January 2017
SuperB – INFN-Bari Giacinto DONVITO.
The Beijing Tier 2: status and plans
LCG Service Challenge: Planning and Milestones
StoRM: a SRM solution for disk based storage systems
LHCOPN update Brookhaven, 4th of April 2017
Moroccan Grid Infrastructure MaGrid
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
INFN CNAF TIER1 and TIER-X network infrastructure
Update on Plan for KISTI-GSDC
Model (CMS) T2 setup for end users
INFN – GRID status and activities
Luca dell’Agnello INFN-CNAF
The INFN TIER1 Regional Centre
Deployment of IPv6-only CPU on WLCG – an update from the HEPiX IPv6 WG
RDIG for ALICE today and in future
Update from the HEPiX IPv6 WG
Статус ГРИД-кластера ИЯФ СО РАН.
A Messaging Infrastructure for WLCG
Artem Trunov and EKP team EPK – Uni Karlsruhe
Project Status Report Computing Resource Review Board Ian Bird
Simulation use cases for T2 in ALICE
Infrastructure for testing accelerators and new
Small site approaches - Sussex
The INFN Tier-1 Storage Implementation
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
LHC Data Analysis using a worldwide computing grid
CC and LQCD dimanche 13 janvier 2019dimanche 13 janvier 2019
Luca dell’Agnello, Daniele Cesini – INFN-CNAF CCR - 23/05/2017
GRIF : an EGEE site in Paris Region
This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.
IPv6 update Duncan Rand Imperial College London
The LHCb Computing Data Challenge DC06
Presentation transcript:

Daniele Cesini – INFN-CNAF CSN3@Arenzano - 19/09/2017 CNAF and CSN3 Daniele Cesini – INFN-CNAF CSN3@Arenzano - 19/09/2017

CNAF Mission: 4 pillars CNAF Technology Transfer & External Funds: to support innovation and development projects and the recruitment of temporary staff. TT towards industry, public administration and society at large. Scientific Computing: support for the 4 WLCG experiments, 30+ Astro-particle and GW experiments, theoretical physics, beams simulations. CNAF Research and Innovation: Distributed Systems (CLOUD and GRID), external projects Software Developments for experiments and ext. projects Tracking on the new hardware technology INFN National IT Services: Administrative services Networking services Document repositories Code repositories Web sites …. 2 CSN3 - 19/09/2017 Daniele Cesini – INFN-CNAF

Current Organization Chart Research and Innovation Tech. Transfer & External Projects Scientific Computing Daniele Cesini – INFN-CNAF CCR - 23/05/2017

Experiments @CNAF 4 LHC 34 non-LHC CNAF-Tier1 is officially supporting about 40 experiments 4 LHC 34 non-LHC 22 GR2 + VIRGO 5 GR3 AGATA/GAMMA, FAMU, NEWCHIM/FARCOS, NUCLEX FAZIA 7 GR1 non LHC Ten Virtual Organizations in opportunistic usage via Grid services (on both Tier1 and IGI-Bologna site) Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

User Support @ Tier1 CNAF 6 group members (post-docs) 3 group members, one per experiment, dedicated to ATLAS, CMS, LHCb 3 group members dedicated to all the other experiments 1 close external collaboration for ALICE 1 group coordinator from the Tier1 staff Technical support for non-trivial problems provided by ALL the CNAF sysadmins CNAF Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

User Support activities The group acts as a first level of support for the users Incident initial analysis and escalation if needed Provides information to access and use the data center Takes care of communications between users and CNAF operations Tracks middleware bugs if needed Reproduces problematic situations Can create proxy for all VOs or belong to local account groups Provides consultancy to users for computing models creation Collects and tracks user requirements towards the datacenter Including tracking of extrapledges requests Represent INFN-Tier1 in WLCG coordination daily meeting (Run Coordinator) Daniele Cesini – INFN-CNAF CCR WS - 23/05/2017

Il Tier1@CNAF – numbers in 2017 25.000 CPU core 27PB disk storage 70PB tape storage (pledged) 45 PB used Two small HPC farms: 30TFlops dp infiniband interconnect (with GPUs and MICs) 20TFlops dp CPU only Omnipath interconnect 1.2MW Electrical Power (Max available for IT) Used 0.7MW PUE = 1.6 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

CNAF Future Data transfer rate to CNAF tapes will increase in the next years 250+ PB by 2023 100 PB disk by 2023 100k CPU cores by 2023 CNAF power and cooling infrastructure already fits these requirements Adequate up to LHC RUN3 (2023) Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Access to the farm (CPU) Grid services with digital certificates authentication gLite-WMS gLite-CE DIRAC Local access via “Bsub” Login to bastion.cnaf Login into a User Interface Batch job submission to the farm LSF bsub Interactive via Cloud@CNAF New CLOUD infrastructure Not officially funded (yet) Best fits the needs of small collaboration without a highly distributed computing model Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

“Cloud” infrastructure 1392 VCPUs 4.75TB RAM 50TB Disk Storage Web interface to manage VMs and storage SSH access FAZIA first use case Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Resource distribution @ T1 GR3 no LHC CPU: 0.1% GR3 no LHC Disk: 0.4% GR3 no LHC Tape: 1.7% Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

CSN3 Resources at CNAF FAMU Disk Usage: 5TB 15% 17% 16% AGATA TAPE Usage – last 90 days FAMU Disk Usage: 5TB FAMU CPU Usage – year 2016 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Extrapledge management We try to accommodate extrapledge requests, in particular for CPU Handled manually RobinHood Identification of temporary inactive VOs Old resources in offline racks that can be turned on if needed Much more difficult for storage Working on automatic solution to offload to external resources Cloud ARUBA, Microsoft HNSciCloud project Other INFN sites Extension to Bari Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Resource Usage – number of jobs GRID JOBS LOCAL JOBS LSF handles the batch queues Pledges are respected changing dynamically the priority of jobs in the queues in pre- defined time windows Short jobs are highly penalized Short jobs that fail immediately are even more penalized Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

LHC Tier1 Availability and Reliability Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

WAN@CNAF (Status) + 20 Gbit/s towards INFN-Bari LHC ONE LHC OPN RAL SARA (NL) PIC TRIUMPH BNL FNAL TW-ASGC NDGF KR-KISTI RRC-KI JINR LHC ONE Main Tier-2s KIT T1 IN2P3 T1 GARR-MI GARR Mi1 General IP NEXUS 7018 GARR Bo1 20 Gb/s For General IP Connectivity GARR BO1 GARR-BO 60 Gb Physical Link (6x10Gb) Shared by LHCOPN and LHCONE. 20Gb/s 60Gb/s LHCOPN 40 Gb/s CERN LHCONE up to 60 Gb/s (GEANT Peering) Cisco6506-E CNAF TIER1 + 20 Gbit/s towards INFN-Bari (*)Courtesy: Stefano.Zani Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Data Management Services On-line disks storage Distributed filesystem based on IBM GPFS (now SpectrumScale) Tape Storage Disk buffer for recalled files Long term data preservation POSIX local access to the disk storage from the UI and WN Remote access through: Standard Grid services (SRM/GRIDFTP) Auth/AuthZ based on digital certificates, but simplified wrappers can be provided if needed Xrootd service Standard Web Protocols (HTTPD/WEBDAV) In progress: using Federated AAI (INFN single sign-on) Custom user services Daniele Cesini – INFN-CNAF CCR - 23/05/2017

2014 HPC Cluster 27 Worker Nodes CPU: 904HT cores 640 HT cores E5-2640 48 HT cores X5650 48 HT cores E5-2620 168 HT cores E5-2683v3 15 GPUs: 8 Tesla K40 7 Tesla K20 2x(4GRID K1) 2 MICs: 2 x Xeon Phi 5100 Dedicated STORAGE 2 disks server 60 TB shared disk space 4 TB shared home Infiniband interconnect (QDR) Ethernet interconnect 48x1Gb/s + 8x10Gb/s Disk server 60+4 TB 1Gb/s 10Gb/s 2x10Gb/s Worker Nodes IB QDR ETH CPU GPU MIC TOT TFLOPS (DP - PEAK) 6.5 19.2 2.0 27.7 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

2017 HPC Cluster 12 Worker Nodes Dedicated STORAGE 2 disk servers + 2 JBOD 300 TB shared disk space (150 with replica 2) 18 TB SSD based file system using 1 SSD on each WN – used for home dirs OmniPath interconnect (100Gbit/s) Ethernet interconnect 48x1Gb/s + 4x10Gb/s 12 Worker Nodes CPU: 768 HT cores Dual E5-2683v4 @2.1 GHz (16 core) 1 KNL node: 64 core (256HT core) CPU GPU MIC TOT TFLOPS (DP - PEAK) 6.5 2.6 8.8 Disk server 150 TB OPA 100Gb/s 12 Gb/s SAS 2x10Gb/s Worker Nodes ETH 150 TB 12 Gb/s SAS - WN installed, storage under testing - Will be used by CERN only - Can be expanded 18TB SSD ETH 1Gb/s Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Contacts and Coordination Meetings GGUS Ticketing system: https://ggus.eu Mailing lists user-support<at>cnaf.infn.it hpc-support<at>cnaf.infn.it Monitoring system: https://mon-tier1.cr.cnaf.infn.it/ FAQs and UserGuide https://www.cnaf.infn.it/utenti-faq/ https://www.cnaf.infn.it/wp-content/uploads/2016/12/tier1-user-guide- v7.pdf Monthly CDG meetings https://agenda.cnaf.infn.it/categoryDisplay.py?categId=5 No meeting, no problems Not enough: we will organized ad-hoc meeting to define the desiderata and complaints Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017

Conclusion We foster the CNAF data center usage for: Massive batch computation Interactive computing on the Cloud@CNAF infrastructure Experiment data preservation Simplified data management services available Not only grid services Small HPC clusters are available for developments, tests and small production If needed (and funded) they can expanded The User Support team is fully committed to provide assistance in porting computing models to the CNAF infrastructures Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017