Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virgo computing Michele Punturo Computing - VW20180709.

Similar presentations


Presentation on theme: "Virgo computing Michele Punturo Computing - VW20180709."— Presentation transcript:

1 Virgo computing Michele Punturo Computing - VW

2 Computing is an “hot topic”?
Virgo computing has been an hot topic in the last weeks 15/06/2018 – presentation of ET computing issues and activities in front of the INFN “Commissione Calcolo e Reti” 18/06/2018 – Computing issues discussed at VSC 25/06/2018 – discussion on future developments on astroparticle computing in INFN (Virgo invited together CTA, KM3, Euclid in the INFN presidency) 02/07/2018 – External Computing Committee meeting at EGO (ECC appointed by the EGO council) 04/07/2018 – discussion on funds in 2019 for all the INFN experiments at TIER1 08/07/2018 – Talk at the Virgo Week 19/07/2018 – C3S meeting in Presidenza INFN on computing challenges Computing - VW

3 Slides Recycling Computing - VW

4 Temporary storage and DBs
Advanced Virgo Advanced LIGO EGO/Virgo Site Temporary storage and DBs h(t) data transfer DAQ (50MB/s – 84MB/s full raw data) Detector characterisation Low latency analysis and detection Reduced Data Set (RDS) transfer 0.87MB/s via LDR Virgo Raw Data Transfer 45-50MB/s Virgo ↔Via GridFTP LIGO Via iROODS ↔ Nikhef Offline DA CCIN2P3 CC and Tier0-1 Offline Data Analysis CNAF CC and Tier0-1 Offline Data Analysis SurfSARA Offline DA LIGO via LDR PolGRID Offline DA GRID and local access Computing - VW

5 Temporary storage and DBs
EGO/Virgo Site DAQ (50MB/s – 84MB/s full raw data) Detector characterisation Low latency analysis and detection Currently the storage capabilities at EGO are about 1PB 50% devoted to home directories, archiving special data, output of the local “low latency analysis” 50% devoted to the circular buffer that is used to store locally the raw data Less than 4 months of lifetime of the data before overwriting (at 50MB/s) Too short period for commissioning purposes Unable to keep O3 on disk This situation is due to a rapid evolution of the requirements scenario that is making obsolete the previous computing model specifications: Increase of the requests by DAQ; data writing rates: Nominal ~22MB/s O2: ~ 37MB/s Current: ~50MB/s Requests by commissioner to have data corresponding to “special periods” stored locally for commissioning and noise hunting purposes Requests by low-latency analysers to have disk space for outputs of their analysis Computing - VW

6 Storage The shortage of disk space at the site is also raising the risk of loosing scientific data Nota Bene: “If the incident at the CNAF had occurred during O3, Virgo would have lost science data” We had to pass through: a Council meeting a STAC meeting a ECC meeting Finally we had green light to purchase the storage and the order has been submitted by EGO Computing - VW

7 O2 experience on computing and DT
A sequences of internal and external issues affected the data transfer toward the CCs during O2 Computing STAC 23/05/2017

8 O2 issues highlighted by DT
Problem (1): iRODS hanging toward CCIN2P3 Problem (2): un-identified midnight slowing down Problem (3): Grid certificate expiration Problem (4): Saturation of the disk-to-tape transfer in CCIN2P3 Problem (7): Similar issue at CNAF Problem (5): Freezing of the storage due to lack of disk space Problem (6): Firewall freezing Introduction to Virgo computing

9 Discussion with CCs We had two meetings with CCs:
16/01/2018: first meeting, presentation of the problems, discussions, hypothesis, some solution suggested on Data management (Keyword: Dirac) Workload management (Keyword: Dirac) Virgo software replica (Keyword: CernVMS) From CCs to Virgo: request of requirements From Virgo to CCs: request of common solutions 07/05/2018: Second meeting Solution on Data Transfer discussed A possible common solution proposed by CCs Computing - VW

10 Strategy toward O3 Radically reduce the data loss risk purchasing a large storage at EGO Almost solved Improve the reliability and availability of the computing architecture at EGO: Bandwidth increase from 1GB/s to 10Gb/s Bandwidth tested toward the pop GARR Some issues toward CCs (see later) High Availability Firewall has been installed New Data Transfer Engine? Virgo request: use the same protocol for CNAF and CCIN2P3 CCs answer: Webdav Computing - VW

11 New Firewall Clean solutions to improve critical domains security
separate the ITF control and operation network from the others separate the on-line analysis network introduce a separate VTF (Virgo Test Facility) network (needed also for reliability) introduce a separate R&D network introduce 2-factors authentication in the most critical domains introduce internal intrusion/anomaly detection probes (reactive approach) reorganize the remote access procedures in the new topology (VPN , proxies , ...) New Firewall Computing - VW

12 New Firewall Stefano’s talk
Fastpath (for selected hosts ):access control only: 9.32/9.55 Gbps Normal path: deep inspection Upgrade started in 2016 Note: some failures (freezings) could still happen => applications must be capable to sense lack of communication and start a new TCP connection Stefano’s talk Computing - VW

13 Federate Login Federated login and Identity Management: last news
Recently requested to join IDEM federation as EGO IdP Quick solution: working with GARR to provide the ego-gw.it IdP as a service in the GARR cloud connected to our AD database, and enter IDEM/EduGAIN in a few days Federate Virgo web applications starting from the most critical for collaborative detection events management: VIM (Virgo Interferometer Monitor) web site need to split the application in an internal instance and an external federated one (ongoing) federated authentication for ligo.org or ego-gw.it users , direct Virgo users authentication only when defined in the LV Virtual Organization common database For next Web applications: discussing with GARR to set an SP/IdP proxy pilot for a more flexible setup LSC plans for authorization and IdM: to provide gradually federated services to the LV federated identities via CoManage (as in the gw-astronomy instance) Caveats ligo.org identities (accounts) still needed to access LDG computing resources in addition Virgo users still need to complement their ligo.org account with their personal certificate subject Federate Login Computing - VW

14 (DAQ, Controls, Electronics, Monitoring, …)
AA COmanage IdMS FIREWALL TDS AAI LIGO lab and LSC universities offer access and services through EduGAIN Scheme EGO federation Idem Interferometer: (DAQ, Controls, Electronics, Monitoring, …) WWW EGO SP (IdP) Renater UV VIM Single sign on Federate login access SIR VIM-replica Surfnet WIGNER Poland: Pioneer? VW Apr2018 Computing EduID

15 Bulk data transfer O3 Requirements: O2:
Data writing 50MB/s → 100MB/s sustained (parallel) data transfer per remote site MB/s peak data transfer per remote site Same protocol/solution for the two sites Reliable login procedure O2: iRods+username/password CCIN2P3 CnAF Solution proposed (previously) by CCs: Webdav Tests performed: CNAF: Login issues (certificate)  Throughput always > 100MB/s with peaks of about 200MB/s  Performance CCIN2P3: Easy to login  Throughput about 12MB/s up to 30MB/s using webdav (100MB/s using iRods)  Long discussion at ECC meeting: Waiting for feedbacks from Lyon Test proposed with FTS 180MB/s since Friday, thanks to the migration of the iRods server serving Virgo at CCIN2P3 in a 10Gb/s link Computing - VW

16 Low latency analysis machines
The number of machines devoted to MBTA have been doubled 160 → 320 cores Additional ~180 cores have been devoted to detchar and cWB (Condor farm) Quick investment and installation made possible by the fact that a virtual machine architecture has been tested and approved for low latency analysis Computing - VW

17 F. Carbognani slides Computing - VW

18 F. Carbognani slides Computing - VW

19 F. Carbognani slides Computing - VW

20 F. Carbognani slides Computing - VW

21 Offline analysis Unresolved long standing issue about the under-use of Virgo computing resources at CCs Only CW is using substantially CNAF and less regularly Nikhef Other pipelines (mainly in CCIN2P3) have a negligible CPU impact Computing - VW

22 Network of Computing Centres
Sept 2016-Sept2017 Virgo ~6-8% LIGO Scientific Collaboration: 1263 collaborators (including GEO) 20 countries 9 computing centres ~1.5 G$ of total investment Virgo Collaboration: 343 collaborators 6 countries 5 computing centres ~0.42 G€ of total investment KAGRA Collaboration: 260 collaborators 12 countries 5 computing centres ~16.4 G¥ of construction costs Offline Computing

23 Computing load distribution
The use of the CNAF is almost mono- analysis Diversification is give by OSG access cw cw cw cw cw cw cw cw cw cw cw cw Offline Computing

24 Future: Increase of CPU power needs
O3 run will start in February and will last for 1 year We are signing a new agreement with LIGO that is forcing us to provide about 20-25% of the whole computing energy 3 detectors: Non linear increment for the coherent pipelines HL, HV, LV, HLV 4 detectors: At the end of O3, probably KAGRA will join the science run Some of the pipelines will be tested (based) on GPUs Offline Computing

25 How to fill up the resources
Ok, let suppose to find the money to provide the 25% virgo quota Are we able to use it? Within a parallel investigation made for INFN, I asked to (Italian) DA people to project their needs/intentions in the next years With a series of caveat the projection is → O4 O3 Offline Computing

26 HPC resources Numerical relativity is now a key element for the realisation of BNS templates In Virgo there are important activities thanks to the groups in Torino and in Milano Bicocca They intensively uses the CINECA resources within the CSN4 framework and through some grant Being them “structurally” participating to the DA of LV we need to provide computing resources: Requests: :   2M GPU hours (e.g. on cineca) [i]   6M CPU hours on Intel BDW/SKL (Marconi cineca) [ii]    50 Tb hard drive space [iii] yrs : 6M GPU hours per year [i] 6M GPU hours per year [ii] 150 Tb hard drive space [iii] Offline Computing

27 Hence … We need to contribute with new resources (MOU constrain)
But we are unable to use them or to allow access to “LIGO-like” pipelines We need a solution to this situation Replicate LIGO environment  Local installation of Condor as wrapper of the local batch system  New Workload Manager + Data manager DIRAC Positive WM tests  Data management unclear to me  Data transfer still to be tested We are progressing too slow  LIGO is going in the direction of using Rucio as DM and remain in Condor as WM  Future development: In LIGO tests of using Singularity + CVMS for virtualisation and distribution That is a technology pursued at LHC → supported by our CCs We need to invest on that A post-doc is under INFN-Torino to be engaged in that activity Computing - VW

28 Cost model In the current cost model EGO reimburses the costs at the French and Italian CCs, Nikhef and Polgrav are contributing in kind This model balances the costs between Italy and France but it charges the largest fraction of costs on EGO shoulders To move the bill back on the funding agencies doesn’t work, because INFN will be by far the largest contributor without further balancing We need to find a cost model that shares in a more clever way the costs for computing within Virgo: It should take in account the number of authors It should take in account the global investment in computing (DAQ, low latency analysis, storage, human resources) It should force the Virgo members to account their resources Offline Computing

29 Cost model This is not a definitive proposal, but we need to open a discussion also in front of the EGO Council and the institutions We have to define our standards: Needs and standard cost of the storage Needs and standard costs for CPU Accessibility requirements (LIGO compliant, Virgo compliant, …) Accountability requirements (resources need to be accountable) Human resource requirements (a collaboration member MUST be the interface) Compute a “Virgo standard cost” per author Each institution in Virgo has to provide in kind and/or in money resources proportional to the number of authors according to the standard figures we defined Ghosts are expensive! Over-contributions can be considered as a net contribution to the experiment by the institution; obviously we must take in account also the direct contribution to EGO Offline Computing

30 2019: who pays? In these days INFN is defining the investments in TIER1 (CNAF) The decision will be in September We have 30kHS06 In the original plan it was expected to jump to 70-80kHS06 Difficult in terms of cost and efficiency What we ask for 2019? Tape is defined (1PB) Disk: We have 656TB (578TB occupied) O3: 1MB/s Virgo of RDS + 2MB/s LIGO = 90TB Disk for DA Suggest to request a pledge of 780TB (+124TB) CPU? 30kHS06, 40kHS06, ? CPU Computing - VW

31 ECC meeting Organisation
I have appointed as VDAS coordinator in emergency in Sept 2015 Since the beginning of my mandate, ended in Sept. 2017, I highlighted the need to improve the organisation of the VDAS (changing also the name!) Recently I proposed a new structure of VDAS, dividing it in Working Packages and identifying the main projects This organisation has been shown at the Virgo week in April and proposed to VSC Decision pending and a suggestion from this committee is more than welcome ECC meeting Offline Computing

32 ECC meeting «Local area» «Wide area» Virgo Computing Coordination
Spokesperson / EGO Direcor «Wide area» DA coord LIGO interface Virgo Computing Coordination Commissioning coordinator Computing Centres Local computing and storage infrastructure management & strategy Low latency architecture Analysis pipelines GRID compatibility Sw management Offline Computing Management Offline Computation needs evaluation Networking infrastructure Local Computing Infrastructure Online Computing subsystem Dedicated hw management ECC meeting Cyber-security Data Management System Local data allocation and management & strategy Local services ( federated login, web servers ... ) Bulk Data Transfer Offline Computing

33 Reference persons at CCs
In addition, it is crucial to have a MEMBER OF THE VIRGO COLLABORATION, fully devoted to computing issues, physically or virtually located at each computing center, acting as reference persons Post-doc level He/she participates to the collaboration life (computing meetings, DA meetings, …), but he/she has the duty to solve (or to facilitate the solution) of all the issues related to the use of that CC by the collaboration ECC meeting Offline Computing

34 Conclusions Computing is a crucial part of the detector
Computing is a key element of the LIGO-Virgo agreement It is time for the collaboration (and for all the funding agencies) to take it seriously As stated at the last VSC and reported in the minutes, today I consider concluded the extra-time I devoted to my appointment as VDAS coordinator (officially ended in Sept. 2017) I hope that a VDAS coordinator will exist anymore, but I wish all the best to my successor Computing - VW


Download ppt "Virgo computing Michele Punturo Computing - VW20180709."

Similar presentations


Ads by Google