EGEE is a project funded by the European Union under contract IST LCG open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting, November 4-5,
, - 2 Problems hopefully already addressed The bugs below are still open in the LCG Savannah, but they have already been addressed Patches provided (by us, or by LCG) Still open because patches under test/still to be tested #3252, #3546, #3807, #3848, #3883, #3884, #3895, #3896, #3900, #3916, #4009, #4047, #4070, #4098, #4109, #4127, #4144, #4378, #4836, #4891, #4909, #5237, #5238, #5244,#5261, #5269, #5427
, - 3 Issues not addressed yet #3302: On a RB+SE node there is a GridFTP problem Asked for clarifications to LCG: no answer Not considered a high priority problem #3671: To drain an RB They would like to make possible to disallow new submissions, while allowing the other commands Not addressed yet: only suggested, as trick, to set MaxInputSandboxSize=0 Doesn’t work for jobs without ISB #3724: LogMonitor should be resilient to full file system Still to be understood why irepository.dat could not be recovered #3808: NetworkServer must log from which UI the job was submitted A patch was provided, but it logs the UI address and the user DN in *separate* messages (and it is not possible to unambiguously connect them) Asked if instead they could use the LB info instead: no answer
, - 4 Issues not addressed yet #3871: edg-wl-bkserverd: Terminating after 500 connections 'event_store_recover’ likely a inter-thread locking bug, which must be investigated MarcoP agreed with D. Smith to provide a patch for all these bugs #4319: Suggestion for change of policy for resubmitted jobs Basically they (D. Smith) think that if the job doesn’t even start its execution on a WN, this should not be counted as (re)submission They'd want to be confident that the user payload of the previous attempts really have never started. However they don't require the same level of certainly in the opposite case The “shallow resubmissions” should be limited by a configurable maximum number of attempts in the broker configuration OR by virtue of the fact that the shallow resubmission would need to target a previously tried CEid. They would like a fix for the near future (~ 1 month)
, - 5 Issues not addressed yet #2716, #4126, #4894 Problems with NS affecting the same portion of code #4570: Multiple cancel requests can crash WM (and possibly PR) Discussed at last meeting #4665: GlueCEPolicyMaxTotalJobs isn’t considered during matchmaking Jobs shouldn’t be sent to CEs publishing jobs >= GlueCEPolicyMaxTotalJobs Add this default requirement at WMS level (not UI) ? Same for the other default requirements & rank #5347: FD limit for LM Being discusses between Alessio and David Smith
, - 6 Issues addressed by LCG that we didn’t integrate yet #3931: Suggest a local proxy expiration check for WMS jobs Proxy expiry check in the jobwrapper #4318: Matchmaking policy for resubmitted jobs Remove previously matched sites in resubmission Now we remove only previously matched CEs #4365: WL libraries/daemons must retry BDII queries When the first query fails, it sleeps 5 seconds and retries; when the second attempt fails, it sleeps another 5 seconds and tries a third, final time #4388: WP1 on IA64: correct pointer casts in sources Changes in interactive and LB to support IA64 Changes integrated for interactive but not for LB (as far as I know)
, - 7 Issues addressed by LCG that we didn’t integrate yet #4892: NS can (partially) crash with ‘unable to receive’ uncaught exception #5109: WMS daemon memory leaks Memory leaks in JC, ldif2classad, LM, LB, NS Fixes integrated only for JC and LM (as far as I know) #5274: Interface Resource Broker to Dataset catalogue (use the DataLocationInterface) Heinz’s stuff