COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2015/2016

Slides:

Advertisements

Similar presentations

National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.

Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

11 September 2007Milos Lokajicek Institute of Physics AS CR Prague Status of the GRID in the Czech Republic NEC’2007.

Regional Computing Centre for Particle Physics Institute of Physics AS CR (FZU) TIER2 of LCG (LHC Computing Grid) 1M. Lokajicek Dell Presentation.

Tier 2 Prague Institute of Physics AS CR Status and Outlook J. Chudoba, M. Elias, L. Fiala, J. Horky, T. Kouba, J. Kundrat, M. Lokajicek, J. Svec, P. Tylka.

Prague Site Report Jiří Chudoba Institute of Physics, Prague Hepix meeting, Prague.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

Prague TIER2 Computing Centre Evolution Equipment and Capacities NEC'2009 Varna Milos Lokajicek for Prague Tier2.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Preparation of KIPT (Kharkov) computing facilities for CMS data analysis L. Levchuk Kharkov Institute of Physics and Technology (KIPT), Kharkov, Ukraine.

Computing for HEP in the Czech Republic Jiří Chudoba Institute of Physics, AS CR, Prague.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

March 2003 CERN 1 EDG and AliEn in Prague Dagmar Adamova INP Rez near Prague.

WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

Sejong STATUS Chang Yeong CHOI CERN, ALICE LHC Computing Grid Tier-2 Workshop in Asia, 1 th December 2006.

October 2002 INFN Catania 1 The (LHCC) Grid Project Initiative in Prague Dagmar Adamova INP Rez near Prague.

1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team 12 th CERN-Korea.

Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,

7 March 2000EU GRID Project Proposal Meeting CERN, M. Lokajicek 1 Proposal for Participation of the Czech Republic in the EU HEP GRID Project Institute.

5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

Computing Jiří Chudoba Institute of Physics, CAS.

13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.

Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.

Eygene Ryabinkin, on behalf of KI and JINR Grid teams Russian Tier-1 status report May 9th 2014, WLCG Overview Board meeting.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

Global Science experimental Data hub Center April 25, 2016 Seo-Young Noh Status Report on KISTI’s Computing Activities.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Grid activities in Czech Republic Jiri Kosina Institute of Physics of the Academy of Sciences of the Czech Republic

Grid Operations in Germany T1-T2 workshop 2015 Torino, Italy Kilian Schwarz WooJin Park Christopher Jung.

13 January 2004GDB Geneva, Milos Lokajicek Institute of Physics AS CR, Prague LCG regional centre in Prague

LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.

Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Kilian Schwarz ALICE Computing Meeting GSI, October 7, 2009

Daniele Bonacorsi Andrea Sciabà

Dynamic Extension of the INFN Tier-1 on external resources

Extending the farm to external sites: the INFN Tier-1 experience

WLCG IPv6 deployment strategy

COMPUTING for ALICE at WLCG TIER-2 SITE in PRAGUE in 2015/2016

COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017

(Prague, March 2009) Andrey Y Shevel

GRID OPERATIONS IN ROMANIA

Ian Bird WLCG Workshop San Francisco, 8th October 2016

The Beijing Tier 2: status and plans

LCG Service Challenge: Planning and Milestones

Grid site as a tool for data processing and data analysis

INFN Computing infrastructure - Workload management at the Tier-1

Andrea Chierici On behalf of INFN-T1 staff

Prague TIER2 Site Report

Kolkata Status and Plan

GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.

Update on Plan for KISTI-GSDC

Dagmar Adamova, NPI AS CR Prague/Rez

Luca dell’Agnello INFN-CNAF

RDIG for ALICE today and in future

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Статус ГРИД-кластера ИЯФ СО РАН.

Bernd Panzer-Steindel CERN/IT

New strategies of the LHC experiments to meet

This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1.

This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.

Exploring Multi-Core on

Presentation transcript:

COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2015/2016 D. Adamova, with the help and support of : J. Chudoba, J.Svec, V. Rikal, A. Mikula, M. Adam and J. Hampl

Outline Status of the WLCG Tier-2 site in Prague ALICE operations in 2015/2016 Report on the delivery of mandatory resources What is really mandatory? A couple of various issues Summary and Outlook

10Gbps dedicated fiber link The geographical layout Nuclear Physics Institute AS CR (NPI) a large ALICE group, no ATLAS involvement A (smaller) part of ALICE storage resources 10Gbps dedicated fiber link Institute of Physics AS CR (FZU) Regional computing center WLCG Tier-2 site praguelcg2 All the CPU resources for ALICE A (larger) part of storage resources for ALICE Quite small ALICE group, much larger ATLAS community Originally, the xrootd storage servers for ALICE were only at NPI. Subsequently, xrootd servers were also put in operation at FZU and now even the redirector is there.

HEP Computing in Prague: site praguelcg2 (a.k.a. the farm GOLIAS) A national computing center for processing data from various HEP experiments Located in the Institute of Physics (FZU) in Prague Basic infrastructure already in 2002, but officially started in 2004 Certified as a Tier2 center of LHC Computing Grid (praguelcg2) Collaboration with several Grid projects. April 2008, WLCG MoU signed by Czech Republic (ALICE+ATLAS). Very good network connectivity: Multiple dedicated 1 – 10 Gb/s connections to collaborating institutions. 10 Gb/s connection to LHCONE. Provides computing services for ATLAS + ALICE, D0, Solid state physics, Auger, Nova, CTA ... 4

Some pictures Servers xrootd1 (125 TB) and xrootd2 (89 TB) Server xrootd5 (422 TB)

Current numbers 1 batch system (torque + maui) 2 main WLCG VOs: ALICE, ATLAS FNAL's D0 (dzero) user group Other VOs: Auger, Nova, CTA 258 WNs, 4756 CPU cores, published in the Grid 4.25 PB on disk storage (DPM, XRootD, NFS), a tape library Regular yearly upscale of resources on the basis of various financial supports, mainly the academic grants. The WLCG services include: Apel publisher, Argus Authorization service, BDII, several UIs, Alice VOBOX, Cream CEs, Storage Elements, lcg-CA Monitoring: Munin, Ganglia, Nagios w/multisite, MRTG, Netflow The use of virtualization at the site is quite extensive. ALICE disk XRootD Storage Element ALICE::Prague::SE ~ 1.6 PB of disk space in total Redirector+ 5 clients @ FZU, 4 clients @ NPI Rez a distributed storage cluster 10 Gb/s dedicated connection from FZU to the NPI ALICE storage cluster 6

ALICE operations in 2015 The number of concurrently running ALICE jobs at FZU in 2015 was in average ~1370, in peaks over 3300. Due to a steady operation and active usage of idle resources, ALICE was the largest CPU consumer in 2015. ATLAS ALICE Running jobs profile as shown in ALICE MonALISA. Almost 5.5 million of jobs was processed in Prague during 2015.

ALICE operations in 2016 Running jobs profile at the Prague Tier-2 in March 2016. The ALICE share is lower than in 2015. ATLAS ALICE Running jobs profile as shown in ALICE MonALISA. The number of concurrently running ALICE jobs at FZU in March 2016 was in average ~1200, in peaks ~2100. This will very likely drop down later in 2016 due to problems discussed in next slides.

CPU resources share & site availability in 2015 During 2015, there was a continuous uninterrupted ALICE data processing at the computing center at FZU. The Czech Republic share in the Tier-2 CPU resources delivery in 2015 was ~5%. 100% availability in November 2015

CPU resources share & site availability in 2016 During the first months of 2016, the Czech Republic share in the Tier-2 CPU resources delivery dropped to ~3.2%. Almost 100% availability in March 2016

Upgrade of the xrootd cluster March 2015 – April 2016: outgoing traffic at the Prague/Rez ALICE storage cluster: > 3 GBytes/s in peaks almost 19 PBytes of data read out Upgrade of the xrootd cluster Traffic on the dedicated line NPI <-> FZU. Maximum rate close to 4 Gbits/s, Which corresponds to 4 servers with 1 Gb/s Ethernet cards. This year we plan to upgrade the performance of the NPI cluster: the servers will be equipped with the 10 Gb/s Eth cards and the corresponding switch will be replaced with a more efficient one (Cisco Catalyst 4500-X 16 Port 10G)

swL014 - traffic to/from Worker Nodes clusters and storage clusters Traffic - WN cluster ibis

swL014 - traffic to/from XRootD storage nodes xrootd5 and xrootd4

Czech Republic resource share MANDATORY ACCORDING TO THE ALICE CONSTITUTION: 2.1% of the total required T2 (+T1?) resources Resources delivery in 2015: CPU (kHepSpec06) ALICE requirement: 7,2 REBUS info: pledged 3,5 REBUS info: delivered 8,83 Resources planning for 2016: ALICE requirement: 7,2 (11,4) REBUS info: pledged 5,0 Resources delivery in 2015: Disk (PB): ALICE requirement: 0,6 REBUS info: pledged 1,030 REBUS info: delivered 1,591 Resources planning for 2016: ALICE requirement: 0,92 (1,3) REBUS info: pledged 1, 4 For the ALICE Czech group easier to scale up the storage capacity than to pile up CPUs. ALICE pledges in 2015 of computing and storage resources were satisfied (for storage to > 200% ), although our CPU pledges were low : - by sharing the hardware between various projects by stable operations with almost no outage by using the servers out of warranty

A couple of various issues - In December 2014, some new hardware for ALICE was installed which upscaled the resources for ALICE into 2015 (a computing cluster at FZU, one storage server at FZU with ~ 420 TB of disk capacity and one storage server at NPI with ~120 TB of disk capacity) - Most of the nodes of the Prague/Rez storage cluster for ALICE have been upgraded to the XRootD version v. 4.3.0. - New redirector machine was put into operation, without the client functionality due to overload of the original redirector+client. We were the first users of the new installation method released in July 2015. - In 2016, there was an arguing about financing of the operation of (old) computing servers for ALICE. - One 6 years old cluster was switched off for high power consumption costs: minus 320 cores, 2.7 kHS06. - Another cluster from 2009 will be switched off during 2016 (144 cores, 1.7 kHS06). - It might happen that we will be strongly underperforming in 2016. - ? But we will not be really criticized, because there is no MoU on computing resources delivery inside the ALICE collaboration?

A couple of various issues (contd) - Fortunately, we are in the process of preparation of a grant application to obtain money for new hw for ALICE and ATLAS. If successful, new capacities should come in the period 2017 – 2019. - Change – improvement in the ALICE jobs exit codes in Torque after increasing the limit on the wall time from 60 to 120 hours. JAs TTL is 48 hours …. - A couple of HW interventions were needed through 2015/2016, but the interruption of operations was negligible. - Problems to find students: we need IT students with some experience – interested but expect better financing - At the moment only one Master student. Work as a sysadmin at the Prague Tier-2 focusing on ALICE support. Collaboration with C. Cordeiro from CERN IT on development of monitoring tools of VM performance in the commercial clouds ( continue a Summer Student work).

Aspects of future upscale The future LHC runs will require a huge upscale of computing, storage and network resources which will be impossible for us to achieve with the expected flat or even declined budget. Activities towards getting access to external resources, HPC centers or computational clouds are ongoing at all LHC experiments. On the local level, there are lengthy discussions with the HPC center IT for Innovations (IT4I) in Ostrava. Another possibility: using services of the OpenNebula IaaS cluster/cloud of CESNET (MetaCloud, FIWARE lab). The showstopper is the absence of tools for submission of ALICE jobs to these sites.

Summary and Outlook of computing and storage resources Czech Republic was performing well during 2015 concerning delivery of computing and storage resources - Mandatory requests satisfied to > 100% - High-level accessibility, reliable delivery of services, fast response to problems Into the upcoming years, we will do our best to keep up the reliability and performance level of the services and deliver the pledges Crucial is the high-capacity network infrastructure : - Endeavour to upgrade the NPI storage nodes for higher traffic performance - Possible upgrade of the network system at the Tier-2 site at FZU Work on the development of the local ALICE monitoring tools is planned Further negotiations with the IT4I HPC center and possible development of job submission tools (the help of ALICE Offline central team needed …)

Backups

Network architecture at the Prague site

ALICE operations in 2016 Running jobs profile at the Prague Tier-2 in March 2016. The ALICE share is lower than in 2015. ATLAS ALICE Running jobs profile as shown in ALICE MonALISA. The number of concurrently running ALICE jobs at FZU in the first months of 2016 was in average ~1300, in peaks ~ 3000. This will very likely drop down later in 2016 due to problems discussed in next slides.