Presentation is loading. Please wait.

Presentation is loading. Please wait.

LCG and Glite open issues Massimo Sgaravatto INFN Padova

Similar presentations


Presentation on theme: "LCG and Glite open issues Massimo Sgaravatto INFN Padova"— Presentation transcript:

1 LCG and Glite open issues Massimo Sgaravatto INFN Padova
JRA1 IT-CZ cluster meeting, December 14-15, 2004 LCG and Glite open issues Massimo Sgaravatto INFN Padova EGEE is a project funded by the European Union under contract IST

2 How to manage LCG and GLITE bugs
Different ways to “manage” LCG and GLITE bugs See next slides LCG and GLITE both use Savannah Easy to get confused Check if the bug web page title starts with “LCG” or “JRA1 middleware” to distinguish among them <event>, <date> - 2

3 How to deal with LCG bugs
I (and usually also Pacio) receive LCG bug notifications Then I CC (via Savannah) the relevant person(s) Relevant persons are supposed: To attach the fix/provide your feedback to the already attached patch implemented by LCG To commit the same patch to our CVS(s), if applicable Don’t change bug status in Savannah <event>, <date> - 3

4 How to deal with GLITE bugs
receives GLITE bug notifications for CE, WMS, Accounting, LB As far as I understand security bugs assigned to JRA3; then J. Hahkala assigns the VOMS server ones to Valerio/Vincenzo Now *all* bugs also notified to the Iteam ML This is going to change: instead all bugs will be notified to: See: You are supposed to change status in Savannah for “your” bugs None  Accepted  In progress In progress  Ready for integration (when fixed in CVS) Don’t close the bug (this is up to the testing team or to the person who opened the bug) ! Let D. Smith know about the problem (and the proper fix) if it is applicable also for LCG <event>, <date> - 4

5 LCG problems hopefully already addressed
The bugs below are still open in the LCG Savannah, but they have already been addressed Patches provided (by us, or by LCG) Still open because patches under test/still to be tested #2716, #3252, #3546, #3807, #3848, #3883, #3884, #3895, #3896, #3900, #3916, #4009, #4047, #4070, #4098, #4109, #4126, #4127, #4144, #4378, #4836, #4891, #4909, #5237, #5238, #5244, #5261, #5269, #5274, #5427, #5471, #5488, #5575 <event>, <date> - 5

6 LCG issues not addressed yet
#3302: On a RB+SE node there is a GridFTP problem Asked for clarifications to LCG: no answer Not considered a high priority problem #3671: To drain an RB They would like to make possible to disallow new submissions, while allowing the other commands Not addressed yet: only suggested, as trick, to set MaxInputSandboxSize=0 Doesn’t work for jobs without ISB #3724: LogMonitor should be resilient to full file system Still to be understood why irepository.dat could not be recovered Actually not investigated further #3808: NetworkServer must log from which UI the job was submitted A patch was provided, but it logs the UI address and the user DN in *separate* messages (and it is not possible to unambiguously connect them) Asked if instead they could use the LB info instead: no answer <event>, <date> - 6

7 LCG issues not addressed yet
#3871: edg-wl-bkserverd: Terminating after 500 connections 'event_store_recover’ likely a inter-thread locking bug, which must be investigated #4319: Suggestion for change of policy for resubmitted jobs Basically they (D. Smith) think that if the job doesn’t even start its execution on a WN, this should not be counted as (re)submission Fix applied by David Smith under test: Logging of “running” by LRMS as priority event, and return code checked if the logging doesn't appear successful, the job script returns an indication of the error in the output/maradona and exits without starting the job No events logged by the LRMS  the job didn’t start  shallow resubmission can be performed The maximum number of these new type of resubmissions per job is a broker side configuration option The new resubmissions won't be done if doing so would send the job back to a previously tried destination <event>, <date> - 7

8 LCG issues not addressed yet
#4894: NS can become unresponsive during dialogue with client Marco agreed with D. Smith to review that part of code #4570: Multiple cancel requests can crash WM (and possibly PR) Addressed for PR For WM already discussed (it would require major modifications) #5347: FD limit for LM D. Smith changed the system hard limit on file descriptors for the LM (to 16384) because of the big number of condorG logfiles (and associated state files) This was not sufficient; at some points in the code (eg. in dgssl.c) select()s are done on fd sets which of type 'fd_set'. These are only large enough for 1024 descriptors #5351: WMS uninitialised variable Noticed possible use of unintialised variable in JobControllerReal::cancel -but there is no indication that it was causing problems. Waiting for further information <event>, <date> - 8

9 LCG issues not addressed yet
#5404: JC/LM id repository Inconsistency between the JC (memory resident) id repository and the LM (disk resident) version To be investigated #5442: Setting output path for LCG GUI Job Monitor Actually the problem was that the user didn’t read the doc  The only problem that needs to be fixed is that the GUI always try to use the home directory for the retrieval of the OSB (it doesn’t remember the previous choice) #5549: NS cannot handle being addressed through RB host alias Selected Virtual Organisation name (from --config-vo option): dteam **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server In the NS log file: 10 Nov, 23:03:29 -F- "Manager::run": Manager: Failed to acquire credentials Nov, 23:05:01 -F- "Manager::run": Manager: Failed to acquire credentials Nov, 23:05:30 -F- "Manager::run": Manager: Failed to acquire credentials... <event>, <date> - 9

10 Issues addressed by LCG that we didn’t integrate yet
#3931: Suggest a local proxy expiration check for WMS jobs Proxy expiry check in the jobwrapper #4318: Matchmaking policy for resubmitted jobs Remove previously matched sites in resubmission Now we remove only previously matched CEs #4365: WL libraries/daemons must retry BDII queries When the first query fails, it sleeps 5 seconds and retries; when the second attempt fails, it sleeps another 5 seconds and tries a third, final time #4892: NS can (partially) crash with ‘unable to receive’ uncaught exception #5109: WMS daemon memory leaks  Memory leaks in JC, ldif2classad, LM, LB, NS Fixes integrated only for JC and LM (as far as I know) <event>, <date> - 10

11 GLITE problems hopefully already addressed
The bugs below are still open in the Glite Savannah, but they have already been addressed Still open because patches under test/still to be tested #4588, #4630, #4631, #4893, #5071, #5089, #5094, #5115, #5202, #5248, #5325, #5361, #5406, #5792, #5832, #5869, #5903, #5904, #5926, #5932, #5934, #5977 <event>, <date> - 11

12 GLITE issues not addressed yet
#5029: On /opt/glite/libexec/voms/voms_install_db strange error message for typo in parameter #5125: glite-lb-bkserverd start/stop/status displays usage options Still waiting for clarifications from the user who submitted this bug #5378: voms-proxy-info crashes voms-proxy-info taken from UI in AFS (Datamat) I don’t see anything in Savannah for this bug #5383: wms client commands crash when used with a VOMS proxy Doesn’t appear anymore … Status to be changed into “Ready for integration” ? #5278: lack of logging information for the workload_manager daemon Discussed between Mario and FrancescoG #5494: Can't generate voms-proxies “Can't interpret old format!” error message <event>, <date> - 12

13 GLITE issues not addressed yet
#5582: Unable to get voms proxy info from a voms proxy Submitted by PeppeGrid #5802: hardcoded GLITE_LOCATION in voms-proxy-init #5804: unecessary C++ statement std::flush after std::endl VOMS related bug #5833: all jobs in SUBMITTED after a job storm  SUBMITTED status for approx. 3 hours because most LB events did not arrive to the bookkeeping server in timely fashion Being investigated by CESNET #5938: Error using VOMS_Retrieve from voms C api <event>, <date> - 13

14 GLITE issues not addressed yet
#5965: noncorrect chown in the glite-wms-parse-configuration.sh file The scripts does: chown $GLITE_WMS_USER.$GLITE_WMS_USER $location This doesn’t work if the GLITE_WMS_USER belongs to the group with other name and the GLITE_WMS_USER group does not exist Should we also define a GLITE_WMS_GROUP ? <event>, <date> - 14


Download ppt "LCG and Glite open issues Massimo Sgaravatto INFN Padova"

Similar presentations


Ads by Google