EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Feedback to sites from the VO auger Jiří Chudoba (Institute of Physics and CESNET) with input from the Auger production team (J.Lozano Bahilo, G.Rubio, M.D.Serrano - UGR) and Jean-Noel Albert (LAL)
EGI-InSPIRE RI The Observatory PAO is an astroparticle project to measure ultra–high energy cosmic rays See my talk on Friday for more details about the project
EGI-InSPIRE RI VO auger Mostly used for organized production of simulations of cosmic ray showers and detector response CORSIKA with different models - FORTRAN Offline code – C++, but many packages included (GEANT4, ROOT)
EGI-InSPIRE RI Sites supporting VO auger sites 10 countries How shall we acknowledge sites contribution?
EGI-InSPIRE RI Some issues feedback from sites change of VOMS server certificate too many jobs in queue hanging lcg-cp SE occupancy, data movement slow LFC response efficiency evaluation
EGI-InSPIRE RI Feedback from sites Production, VO management, data management, bulk transfers to SRB – done by geographically distributed team Sites should preferably handle all issues via GGUS We may not know about some problems sometimes we learn them only from sites we manage
EGI-InSPIRE RI Change of the VOMS certificate Change of the DN sites must download the new certificate from the CIC portal and reconfigure services broadcast message shall we create a GGUS ticket for each site? we did not succeed with the right configuration on our site at first attempt Can production continue? running jobs with proxy signed by the “old” VOMS server solution could be using two VOMS servers?
EGI-InSPIRE RI Too many waiting jobs Some sites reported too many (thousands) of waiting jobs in the auger queue The distribution is done by WMS servers, we do not send directly to sites wrong values in the BDII ? slow update? bug in WMS? We decreased the parameter submitted/running
EGI-InSPIRE RI Hanging jobs CORSIKA in infinite loop only a small fraction of jobs difficult to debug cpu is used, but there is no update of output files fixed by CORSIKA developers
EGI-InSPIRE RI Hanging jobs II lcg-cp used to download sw if not locally available It hanged in some cases very “expensive” error – jobslot blocked until job is killed on the walltime limit GGUS #90936 Jiri Horky debugged it, Michail Salichos provided a patch a lot of work, took more than 2 months should be fixed in the next release
EGI-InSPIRE RI SE Occupancy Production stores results on available SEs some sites excluded Can fill all available space Space tokens should be used to set quotas – AUGERPROD, limit write access to the production role We are unable to quickly response to requests to move TBs of data from a site there is not enough space on other sites
EGI-InSPIRE RI Data transfers to SRB Decommissioning of an SE with many auger files FTS transfers from Lille to Lyon 2 months, 1.9 M files, 38.7 TB less than 1% of lost files operations/day, 1300 ops/hour 650 GB/day, 27 GB/hour, 8 MB/s FTS transfers from Bordeaux to Lyon 1 month, 700 K files, 7.1 TB.6% of lost files operations/day, 500 ops/hour 160 GB/day, 7 GB/hour, 2 MB/s Many more small files in Bordeaux Large files stored to tapes in Lyon
EGI-InSPIRE RI Effectiveness evaluation Efficiency: cputime/walltime
EGI-InSPIRE RI Top ten VOs efficiency Efficiency of the biggest VOs for to
EGI-InSPIRE RI VO auger efficiency From to efficiency improves
EGI-InSPIRE RI Effectiveness evaluation Effectiveness = cputime of jobs with good output total walltime Difficult to estimate No information about cancelled or lost jobs Some jobs without job log file stored correct results Production maximizes throughput Each job processes 1 shower 5 times Jobs resent if not enough (<3) output files More detailed view from accounting portal could help Just one of many possible definitions
EGI-InSPIRE RI Instead of conclusions We thank all sites supporting the VO auger for their hardware resources and manpower support