Download presentation
Presentation is loading. Please wait.
1
Daniele Cesini – INFN-CNAF CSN3@Arenzano - 19/09/2017
CNAF and CSN3 Daniele Cesini – INFN-CNAF /09/2017
2
CNAF Mission: 4 pillars CNAF
Technology Transfer & External Funds: to support innovation and development projects and the recruitment of temporary staff. TT towards industry, public administration and society at large. Scientific Computing: support for the 4 WLCG experiments, 30+ Astro-particle and GW experiments, theoretical physics, beams simulations. CNAF Research and Innovation: Distributed Systems (CLOUD and GRID), external projects Software Developments for experiments and ext. projects Tracking on the new hardware technology INFN National IT Services: Administrative services Networking services Document repositories Code repositories Web sites …. 2 CSN3 - 19/09/2017 Daniele Cesini – INFN-CNAF
3
Current Organization Chart
Research and Innovation Tech. Transfer & External Projects Scientific Computing Daniele Cesini – INFN-CNAF CCR - 23/05/2017
4
Experiments @CNAF 4 LHC 34 non-LHC
CNAF-Tier1 is officially supporting about 40 experiments 4 LHC 34 non-LHC 22 GR2 + VIRGO 5 GR3 AGATA/GAMMA, FAMU, NEWCHIM/FARCOS, NUCLEX FAZIA 7 GR1 non LHC Ten Virtual Organizations in opportunistic usage via Grid services (on both Tier1 and IGI-Bologna site) Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
5
User Support @ Tier1 CNAF 6 group members (post-docs)
3 group members, one per experiment, dedicated to ATLAS, CMS, LHCb 3 group members dedicated to all the other experiments 1 close external collaboration for ALICE 1 group coordinator from the Tier1 staff Technical support for non-trivial problems provided by ALL the CNAF sysadmins CNAF Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
6
User Support activities
The group acts as a first level of support for the users Incident initial analysis and escalation if needed Provides information to access and use the data center Takes care of communications between users and CNAF operations Tracks middleware bugs if needed Reproduces problematic situations Can create proxy for all VOs or belong to local account groups Provides consultancy to users for computing models creation Collects and tracks user requirements towards the datacenter Including tracking of extrapledges requests Represent INFN-Tier1 in WLCG coordination daily meeting (Run Coordinator) Daniele Cesini – INFN-CNAF CCR WS - 23/05/2017
7
Il Tier1@CNAF – numbers in 2017
CPU core 27PB disk storage 70PB tape storage (pledged) 45 PB used Two small HPC farms: 30TFlops dp infiniband interconnect (with GPUs and MICs) 20TFlops dp CPU only Omnipath interconnect 1.2MW Electrical Power (Max available for IT) Used 0.7MW PUE = 1.6 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
8
CNAF Future Data transfer rate to CNAF tapes will increase in the next years 250+ PB by 2023 100 PB disk by 2023 100k CPU cores by 2023 CNAF power and cooling infrastructure already fits these requirements Adequate up to LHC RUN3 (2023) Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
9
Access to the farm (CPU)
Grid services with digital certificates authentication gLite-WMS gLite-CE DIRAC Local access via “Bsub” Login to bastion.cnaf Login into a User Interface Batch job submission to the farm LSF bsub Interactive via New CLOUD infrastructure Not officially funded (yet) Best fits the needs of small collaboration without a highly distributed computing model Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
10
“Cloud” infrastructure
1392 VCPUs 4.75TB RAM 50TB Disk Storage Web interface to manage VMs and storage SSH access FAZIA first use case Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
11
Resource distribution @ T1
GR3 no LHC CPU: 0.1% GR3 no LHC Disk: 0.4% GR3 no LHC Tape: 1.7% Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
12
CSN3 Resources at CNAF FAMU Disk Usage: 5TB
15% 17% 16% AGATA TAPE Usage – last 90 days FAMU Disk Usage: 5TB FAMU CPU Usage – year 2016 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
13
Extrapledge management
We try to accommodate extrapledge requests, in particular for CPU Handled manually RobinHood Identification of temporary inactive VOs Old resources in offline racks that can be turned on if needed Much more difficult for storage Working on automatic solution to offload to external resources Cloud ARUBA, Microsoft HNSciCloud project Other INFN sites Extension to Bari Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
14
Resource Usage – number of jobs
GRID JOBS LOCAL JOBS LSF handles the batch queues Pledges are respected changing dynamically the priority of jobs in the queues in pre- defined time windows Short jobs are highly penalized Short jobs that fail immediately are even more penalized Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
15
LHC Tier1 Availability and Reliability
Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
16
WAN@CNAF (Status) + 20 Gbit/s towards INFN-Bari LHC ONE LHC OPN RAL
SARA (NL) PIC TRIUMPH BNL FNAL TW-ASGC NDGF KR-KISTI RRC-KI JINR LHC ONE Main Tier-2s KIT T1 IN2P3 T1 GARR-MI GARR Mi1 General IP NEXUS 7018 GARR Bo1 20 Gb/s For General IP Connectivity GARR BO1 GARR-BO 60 Gb Physical Link (6x10Gb) Shared by LHCOPN and LHCONE. 20Gb/s 60Gb/s LHCOPN 40 Gb/s CERN LHCONE up to 60 Gb/s (GEANT Peering) Cisco6506-E CNAF TIER1 + 20 Gbit/s towards INFN-Bari (*)Courtesy: Stefano.Zani Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
17
Data Management Services
On-line disks storage Distributed filesystem based on IBM GPFS (now SpectrumScale) Tape Storage Disk buffer for recalled files Long term data preservation POSIX local access to the disk storage from the UI and WN Remote access through: Standard Grid services (SRM/GRIDFTP) Auth/AuthZ based on digital certificates, but simplified wrappers can be provided if needed Xrootd service Standard Web Protocols (HTTPD/WEBDAV) In progress: using Federated AAI (INFN single sign-on) Custom user services Daniele Cesini – INFN-CNAF CCR - 23/05/2017
18
2014 HPC Cluster 27 Worker Nodes CPU: 904HT cores 640 HT cores E5-2640
48 HT cores X5650 48 HT cores E5-2620 168 HT cores E5-2683v3 15 GPUs: 8 Tesla K40 7 Tesla K20 2x(4GRID K1) 2 MICs: 2 x Xeon Phi 5100 Dedicated STORAGE 2 disks server 60 TB shared disk space 4 TB shared home Infiniband interconnect (QDR) Ethernet interconnect 48x1Gb/s + 8x10Gb/s Disk server 60+4 TB 1Gb/s 10Gb/s 2x10Gb/s Worker Nodes IB QDR ETH CPU GPU MIC TOT TFLOPS (DP - PEAK) 6.5 19.2 2.0 27.7 Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
19
2017 HPC Cluster 12 Worker Nodes Dedicated STORAGE
2 disk servers + 2 JBOD 300 TB shared disk space (150 with replica 2) 18 TB SSD based file system using 1 SSD on each WN – used for home dirs OmniPath interconnect (100Gbit/s) Ethernet interconnect 48x1Gb/s + 4x10Gb/s 12 Worker Nodes CPU: 768 HT cores Dual GHz (16 core) 1 KNL node: 64 core (256HT core) CPU GPU MIC TOT TFLOPS (DP - PEAK) 6.5 2.6 8.8 Disk server 150 TB OPA 100Gb/s 12 Gb/s SAS 2x10Gb/s Worker Nodes ETH 150 TB 12 Gb/s SAS - WN installed, storage under testing - Will be used by CERN only - Can be expanded 18TB SSD ETH 1Gb/s Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
20
Contacts and Coordination Meetings
GGUS Ticketing system: Mailing lists user-support<at>cnaf.infn.it hpc-support<at>cnaf.infn.it Monitoring system: FAQs and UserGuide v7.pdf Monthly CDG meetings No meeting, no problems Not enough: we will organized ad-hoc meeting to define the desiderata and complaints Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
21
Conclusion We foster the CNAF data center usage for:
Massive batch computation Interactive computing on the infrastructure Experiment data preservation Simplified data management services available Not only grid services Small HPC clusters are available for developments, tests and small production If needed (and funded) they can expanded The User Support team is fully committed to provide assistance in porting computing models to the CNAF infrastructures Daniele Cesini – INFN-CNAF CSN3 - 19/09/2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.