COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017

COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
D. Adamova, with the help and support of : J. Chudoba, J.Svec, V. Rikal, A. Mikula, M. Adam and J. Hampl Strasbourg 2017

Outline Status of the WLCG Tier-2 site in Prague
ALICE operations in 2016/2017 Report on the delivery of mandatory resources A couple of various issues Summary and Outlook

10Gb/s dedicated fiber link
The geographical layout Nuclear Physics Institute AS CR (NPI) a large ALICE group, no ATLAS involvement A (smaller) part of ALICE storage resources 10Gb/s dedicated fiber link Institute of Physics AS CR (FZU) Regional computing center WLCG Tier-2 site praguelcg2 All the CPU resources for ALICE A (larger) part of storage resources for ALICE Quite small ALICE group, much larger ATLAS community Originally, the xrootd storage servers for ALICE were only at NPI. Subsequently, xrootd servers were also put in operation at FZU and now even the redirector is there.

HEP Computing in Prague: site praguelcg2 (a.k.a. the farm GOLIAS)
A national computing center for processing data from various HEP experiments Located in the Institute of Physics (FZU) in Prague Basic infrastructure already in 2002, but officially started in 2004 Certified as a Tier2 center of LHC Computing Grid (praguelcg2) Collaboration with several Grid projects. April 2008, WLCG MoU signed by Czech Republic (ALICE+ATLAS). Very good network connectivity: Multiple dedicated 1 – 10 Gb/s connections to collaborating institutions and 2x10 Gb/s connection to LHCONE (upgraded from 10 Gb/s on ). Provides computing services for ATLAS + ALICE, D0, Solid state physics, Auger, Nova, CTA ... 4

Some pictures Servers xrootd1 (125 TB) and xrootd2 (89 TB)
Server xrootd5 (422 TB)

Current numbers 1 batch system (torque + maui) 2 main WLCG VOs: ALICE, ATLAS Other users: Auger, Nova, CTA, FNAL's D0 user group 220 WNs, 4556 CPU cores, published in the Grid ~4 PB on disk storage (DPM, XRootD, NFS), a tape library Regular yearly upscale of resources on the basis of various financial supports, mainly the academic grants. The WLCG services include: Apel publisher, Argus Authorization service, BDII, several UIs, ALICE vobox, Cream CEs, Storage Elements, lcg-CA Monitoring: Nagios, Munin, Ganglia, Netflow, Observium Log processing and visualization: ElasticSearch, Kibana + Logstach, i.e. the ELK stack Service management by Puppet The use of virtualization at the site is quite extensive. ALICE disk XRootD Storage Element ALICE::Prague::SE ~ 1.6 PB of disk space in total Redirector+ 5 FZU, 4 NPI Rez a distributed storage cluster 10 Gb/s dedicated connection from FZU to the NPI ALICE storage cluster 6

Annual growth of total on site resources since 2012
7

All disk servers are connected
via 10 Gb/s to the central CISCO Nexus 3172PQ switch. Worker node connection is 1 Gb/s to top of the rack switches that have 10 Gb/s connection to the central switch.

External connectivity, by CESNET
Core load on external connections comes from LHC Experiments ATLAS and ALICE. Before the upgrade to 2x10 Gb/s the LHCONE link was often saturated up to the limit.

Prague LHCONE connectivity: 2 x 10 Gb/s FZU  LHCONE connection
All disk servers are connected via 10 Gbps to the central CISCO Nexus 3172PQ switch. Worker node connection is 1 Gbps to top of the rack switches that have 10 Gbps connection to the central switch. Data traffic on the FZU  LHCONE connection

IPv6 deployment All worker and storage nodes have IPv4 and IPv6 addresses. All local transfers via gridftp protocol utilize preferably IPv6.

The ELK Stack is used for log processing and visualization.
An example of Kibana filtering.

ALICE operations in 2016/2017 Running jobs profile at the Prague Tier-2 in Nov – Nov The ALICE local share in 2016 is lower than in Opportunistic resources were not so easy to use because ATLAS and NOVA are very active. ATLAS ATLAS multicore ALICE Running jobs profile in 2016/2017 as shown in ALICE MonALISA. The avg. number of jobs is 1192, in peaks up to 3070.

CPU resources share & site availability in 2016/2017
In 2016, the Czech Republic share in the Tier-2 CPU resources delivery was ~3.1%. In 2017, this share of Prague is ~2.7%. Services availability in 2016

Partial upgrade of the Prague SE cluster
On March 2, a new switch/router Cisco Catalyst 4500-X 16 ports was installed at NPI and the lids* storage machines were put on 10 Gb/s connectivity. The peaks in traffic on individual lids* servers went up from close to 100 MB/s to over 500 MB/s.

Tuning/upgrade of the xrootd cluster
January 2016 – April 2017: outgoing traffic at the Prague/Rez ALICE storage cluster: > 3.8 GByte/s in peaks almost 32.1 PBytes of data read out. Tuning/upgrade of the xrootd cluster Traffic on the dedicated line NPI <-> FZU. Maximum rate over 5 Gb/s . Not yet fully made use of the new 10 Gb/s Ethernet connectivity. Some additional tuning necessary. But the data traffic on the servers went up. New CISCO installed/ lids* servers on 10 Gb/s

Czech Republic resources delivery
MANDATORY ACCORDING TO THE ALICE CONSTITUTION: 1.96% of the total required T2 resources Resources delivery in 2016: CPU (kHepSpec06) ALICE requirement: 7,2 REBUS info: pledged ,0 REBUS info: delivered ,85 Resources delivery in 2017: ALICE requirement: ,22 REBUS info: pledged ,0 REBUS info: delivered ,4 Resources delivery in 2016: Disk (PB): ALICE requirement: ,92 REBUS info: pledged ,4 REBUS info: delivered 1,79 Resources delivery in 2017: ALICE requirement: ,12 REBUS info: pledged , 4 REBUS info: delivered 1,79 For the ALICE Czech group easier to scale up the storage capacity than to pile up CPUs. ALICE requests in 2016 of computing and storage resources were satisfied, though the CPU pledges were low.: - by stable operations with almost no outage - by using the servers out of warranty by extensive use of other projects resources

ARC-CE TEST OPERATIONS PRAGUE ARC CE submission 1.3.2017 – 22.4.2017:
Since a couple of months, ARC-CE has been in operation in Prague for NOVA and ATLAS. In March a second (ARC) vobox was set and a small part of ALICE jobs has been submitted via ARC-CE. The final goal is to replace CREAM + Torque by ARC + HTCondor. Currently, both systems co-exist. PRAGUE ARC CE submission – : avg. 70, max. 425 running jobs

Planned compute/storage capacity upgrades
In May 2017, a new grant project is starting, running until November It is a complementary project to the traditional big CERN – CR collaboration grant aimed at investments into computing for CERN (in our case it means ALICE and ATLAS). Investments into computing resources are absent in the “big” CERN-CR grant. The total capacity is shared with the ratio ALICE : ATLAS = 1 : 3, according to the ratio of scientists participating in the projects. Unfortunately the limits are low, cca 450 kEuro in total for the hardware purchases for the both experiments. The schedule of purchases: 2017: compute cluster with computing capacity of 20 – 25 kHS06. The cluster will be officially shared between ALICE and ATLAS with the ratio 1:3. 2018: storage servers/cluster with capacity 300 – 400 TB exclusively for ALICE, operated at NPI. 2019: compute cluster with capacity ~13 kHS06. Again the share ALICE : ATLAS will be 1:3.

Aspects of other future upscale
The future LHC runs will require a huge upscale of computing, storage and network resources which will be impossible for us to achieve with the expected budget. Activities towards getting access to external resources, HPC centers or computational clouds are ongoing at all LHC experiments. On the local level, there are lengthy discussions with the HPC center IT for Innovations (IT4I) in Ostrava. The showstopper is the absence of tools for submission of ALICE jobs to the site. We’d need an IT student with some experience to work on the submission interface. Problem to find someone like this.

Summary and Outlook Czech Republic was performing well during 2016 concerning delivery of mandatory resources. High-level accessibility, reliable delivery of services, fast response to problems In 2017, we provide enough storage but are under-delivering in compute resources . A network upgrade was done at NPI putting the local part of ALICE::Prague::SE on 10 Gb/s connectivity. A new grant is starting in May with planned investments into compute and storage resources allowing for an upscale of current capacities. Into the upcoming years, we will do our best to keep up the reliability and performance level of the services and deliver the pledges.

Backups

Czech republic CPU delivery in 2017
The Prague share in the Tier-2 CPU resources delivery in 2017 was ~2.7%.

ALICE operations in 2016 Running jobs profile at the Prague Tier-2 in March The ALICE share is lower than in 2015. ATLAS ALICE Running jobs profile as shown in ALICE MonALISA. The number of concurrently running ALICE jobs at FZU in the first months of 2016 was in average ~1300, in peaks ~ 3000.

COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017

Similar presentations

Presentation on theme: "COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017

Similar presentations

Presentation on theme: "COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017"— Presentation transcript:

Similar presentations

About project

Feedback