Presentation is loading. Please wait.

Presentation is loading. Please wait.

LCG and Glite open issues Massimo Sgaravatto INFN Padova

Similar presentations


Presentation on theme: "LCG and Glite open issues Massimo Sgaravatto INFN Padova"— Presentation transcript:

1 LCG and Glite open issues Massimo Sgaravatto INFN Padova

2 LCG problems hopefully addressed
The bugs below are still open in the LCG Savannah, but they have already been addressed Patches provided (by us, or by LCG) Still open because patches under test/still to be tested #3546, #3808, #3848, #4144, #4319, #6134, #6295, #6653, #7372, #7582, #7875, #8011, #8034, #9268 Massimo Sgaravatto - INFN Padova

3 LCG issues not addressed yet
#3671: To drain an RB They would like to make possible to disallow new submissions, while allowing the other commands Asked to LCG if the idea discussed here last time to look for a given file, created by the admin on the Broker (if the file exists the NS will drain all submissions) No feedback so far #3724: LogMonitor should be resilient to full file system Still to be understood why irepository.dat could not be recovered Priority lowered: non happened anymore #4570: Multiple cancel requests can crash WM (and possibly PR) Addressed for PR For WM already discussed (it would require major modifications) Massimo Sgaravatto - INFN Padova

4 LCG issues not addressed yet
#5404: JC/LM id repository Inconsistency between the JC (memory resident) id repository and the LM (disk resident) version This happened when a daemon was down for a while Each daemon needs to know if its partner is live or dead Proposal (each one writes a file with an epoch and updates it every m seconds; if the date in the partner file is older than a threshold this means that the partner is dead and so a more or less drastic solution can be taken) submitted to LCG for feedback No feedbacks #10666: edg-wl-purgeStorage does not appear work Salvo provided a patch, but there is still a problem 21 Sep, 13:23:10 -F- purgeStorageEx: edg_wll_JobStatus: SSL Error: error:140BA0C3:lib(20):func(186):reason(195) Massimo Sgaravatto - INFN Padova

5 LCG issues not addressed yet
#10061: bkserver and --rgmaexport RGMA export functionality of the bkserverd in conjunction with a process to read the export file and publish events on RGMA Possibility to nominate a unix domain socket that will be sent a 1 byte 'signal' just after a line is appended to the export file. Problem if the monitoring process fails and stops reading from the socket. After the send buffer is full on the socket the bkserver blocks when attempting to sendto() and the LB service effectively hangs. Discussed last time: no updates #10696: trailing && causes edg-job-submit command to seg. fault Actually a Condor classad problem #12766: edg-wl-bkpurge fails Initializing Globus common module...yes. Purge request: - flags: timeouts: Submitted: -1 … - list of jobs: Not specified. Error running the edg_wll_Purge(). Transport endpoint is not connected ((null)) Running the edg_wll_Purge...no.End. Massimo Sgaravatto - INFN Padova

6 Issues addressed by LCG that we didn’t integrate yet
#4318: Matchmaking policy for resubmitted jobs Remove previously matched sites in resubmission Now we remove only previously matched CEs #11084: Gang-matching takes a very long time Several changes in the job wrapper See David Smith’s mail Massimo Sgaravatto - INFN Padova

7 GLite bugs Please change bug status into “Ready for integration” only when a tag with the relevant fix has been given to the Iteam Write as comment the tag name or the GLite version where the bug is supposed to have been fixed Massimo Sgaravatto - INFN Padova

8 Glite problems hopefully already addressed
The bugs below are still open in the Glite Savannah, but they have already been addressed Still open because patches under test/still to be tested #4588, #6439, #6665, #6682, #6760, #7097, #7203, #7227, #7490, #7808, #7910, #8003, #8499, #8500, #8630, #8637, #8759, #8786,#8899, #9030, #9031,#9040, #9087, #9125, #9135, #9136, #9137, #9139, #9140, #9183, #9256, #9518, #9522, #9541, #9545, #9628, #9634, #9700, #9753, #9757, #9759, #9760, #9761, #9762, #9768, #9805, #9822, #9823, #9957, #9960, #9963, #9964, #9973, #9985, #9996, #10047, #10072, #10525, #10527, #10536, #10546, #10607, #10692, #10693, #10772, #10896, #10954, #11050, #11093, #11194, #11250 , #11292, #11293,, #11387, #11463, #11631, #12648, #13048 Some bugs in “Remind” (waiting for more info) #7231, #7312, #7324, #7718, #8540, #8600, #8998, #9194, #9391, #9777 Massimo Sgaravatto - INFN Padova

9 Glite issues not addressed yet
#5278: lack of logging information for the workload_manager daemon Discussed between Mario and FrancescoG #7512: Glite Python modules can be overload by user Glite command use the python modules which are found first on PYTHONPATH location. And this value can be modify by users. #7977: count not correctly supported for a DAG node If planning fails for a DAG node (i.e. the pre script fails, in DAGMan terms), the job is aborted, without considering that the retry count could allow further attempts #8327: Final job status error when job fails “Cancelled” even if the user didn’t issue a job-cancel Cancel was triggered by JC, because of a Condor problem Alessio investigated the problem, but he could not fully understand the reason of the problem, since the logs got rotated #9127: WMS could not exit from strange status WM kept crashing because of a syntax error in the input.fl file. Input.fl provided as requested by FrancescoP Massimo Sgaravatto - INFN Padova

10 Glite issues not addressed yet
#9609: Jobs should be rejected if proxy renewal can't succeed “if a job requests proxy renewal it seems a good idea to me to check immediately whether it will be possible to renew the proxy, and reject it if not, rather than waiting for a failure hours later when the proxy expires” “We agreed we will specify a new JDL attribute that can be used by the user to explicitly ask the WMS to check if the proxy can be renewed (by trying to renew the proxy immediately after submission). In order to handle this attribute we will also extend the API for the proxy registration so the proxy renewal daemon could perform such a check and report problems back to the caller (i.e. WMS).” posted by Daniel Kouril #9701: erroneous rpath in several shared objects Some of the shared objects in the LB have a built-in rpath set to a non-existent and non-standard directory 'home/glbuild/...'. This path is preferred over any other paths provided by /etc/ld.so.conf or LD_LIBRARY_PATH. On all systems, this will trigger an open(2) call that will always fail with ENOTFOUND (causing minor overhead on each load). On a system using automount for /home, this will cause mount attempts for "/home/glbuild" (that will fail), and aload on the naming service serving the automount maps. If the naming service is not file-based, but uses NIS/YP or LDAP, this triggers a directory call for each load attempt, and a significant slowdown of all services on that node Discussed yesterday Massimo Sgaravatto - INFN Padova

11 Glite issues not addressed yet
#9900: InputSandbox Parameter not inherited by nodes in a DAG The InputSandbox Parameter does not seem to be inherited by nodes in a DAG, and causes a Job Wrapper error Didn’t we already release the relevant fix ? #10058: After upgrading the WMS from 1.2 to 1.3 an error occurs when starting up the service Dependency on mod_ssl in WMS subsystem to be added, even if it looks like to have this dependecy added in GridSite #10487: LSF CE failed to submit job to CERN lxbatch BLAH submissions to LSF fail at Cern Discussed last time when it was said that it was necessary to contact the relevant persons to have more info: no updates #10781: Missing the timestamps of 'Scheduled' and 'Running‘ status Because Condor doesn't log anymore "Globus submitted" event (017) and sometimes the "executed" event (001) Massimo Sgaravatto - INFN Padova

12 Glite issues not addressed yet
#10800: WMS UI error reporting needs to be improved “Error while calling the "AdWrapper::toDagAd" native api. Invalid DAG” not very meaningful The API used by the UI to parse the JDL of a DAG only returns the generic error "Invalid DAG" when there is some problems in the JDL. #10803: WMS C++ API crashes when using getStatus() method Under investigation by Datamat #10917: CurrentStep attribute has no effect for checkpointable jobs Under investigation by Alessio #11535: Job submission extremely slow Because of the synchronous registration of the job in the LB ? Under investigation by CESNET #11704: WM crashes if ism_dump contains entry purchased from rgma and the rgma purchaser is disabled Submitted by Salvo Massimo Sgaravatto - INFN Padova

13 Glite issues not addressed yet
#11765: 1.4 WMS can either work with the bdii or with the rgma purchaser, but NOT with both #11788: glite-job-list-match is very slow in a WMS 1.4 using a bdii Unwanted consequence of fix for bug #9628 Brokerinfo created for each match #12225: Enhancement requested for CREAM-BLAH integration Submitted in Savannah (as requested by Milano’s developers) Modifications done and committed #12456: strange behaviour of Job::listMatchingCE ( using voms ) Didn’t we already release the relevant fix ? #12458: Job::listMatchingCE gives incorrect anwser (WMS C++ API) Massimo Sgaravatto - INFN Padova

14 Glite issues not addressed yet
#12699: LogMonitor fails on job submission in RC1.5 Didn’t we already release the relevant fix ? #12702: boost::filesystem::path: invalid name ... #12721: JW failed (in RC1.5) #12742: There seems to be a memory leak in the 1.4 WMS Memory leak in NS Memory leak in external software ? #13342: Authorization does not work with fqans not containing the 'Role' tag Users authZ does not work in WMProxy when the proxies issued by VOMS contain FQANs not encompassing the Role tag Massimo Sgaravatto - INFN Padova

15 Glite issues not addressed yet
#13418: Problems in computing status (of resubmitted jobs) Job aborted by WMS (resubmission failed) but job state is waiting #13451: DLI - LFC interface for InputData not working (and #10607) Didn’t the right tag of service-discovery get into release ? #13455: JP index server does not check authorization properly Missing constraint in DB between users.userid jobs.ownerid, and missing comparison of users.userid with client DN in code #13457: Proxy renewal daemon CPU usage on the gL WMS node Observed proxy renewal sucking up to 50 % of the CPU. Is this normal ? “When doing the renewal, the daemon quite often connects to the MyProxy server(s), which requires a full TLS handhake. It also generates a new public/private key pair (when doing delegation from the MyProxy service and renewing VOMS attributes). All this operations require pretty much of CPU power which grows with increasing number of the proxies handled by the service.(Last but least, there certainly can be an error in the code)” posted by Daniel Kouril Massimo Sgaravatto - INFN Padova

16 Glite issues not addressed yet
#13458: wms configuration file corrupted in org.glite.wms.configuration/config/glite_wms.conf a semicolon is missing after ShallowRetryCount attribute. That leads into a classad parsing error. #13477: WMs components should use only one logging time format In the WM log found entries followed by entries that have jumped one hour back in time #13492: Job State Information Log File “The old EDG RB had a feature to publish the Job State information into R-GMA. We tried this code about 9 months ago but it wasn't reliable so we re-wrote that section of the code. The LB Server now writes the job state information to a log file and a separate daemon script reads this file and publishes the information into R-GMA. Please could you ensure that this functionality is included in the glite LB Server” #13494: ARC job submitter “We would like the glite RB to be able to submit to ARC CEs.” #13495: Job wrapper “Two important changes were to run a Job Monitoring Script on the WN and toenable the Job to find where the client tools have been installed on the WN. Please can you ensure that these two modifications are in the Job Wrapper on the glite RB.” Massimo Sgaravatto - INFN Padova

17 Glite issues not addressed yet
#10630: glite_dgas_hlrUserInfo: -u and -g options are not working #10631: glite_dgas_hlrUserInfo: inconsistent behaviour when only using -s option if the HLR database contains only one user, and the following command is used: /opt/glite/sbin > glite_dgas_hlrUserInfoClient -s "lxb0769.cern.ch:56568:" the command returns the only record in the user database. If the database contains more than one user, the same command returns nothing. It exits without giving any message. #10949: glite_dgas_hlrAdvancedQueryClient> inconsistent result when queying a user who didn't submit any jobs Got Error from server:Info not Found,Exit status:3 Massimo Sgaravatto - INFN Padova

18 Glite issues not addressed yet
#10950: glite_dgas_hlrAdvancedQueryClient > -g option doesn't work It doesn't work either with resourceAggregate, userJobList and resourceJobList #11155: glite_dgas_hlrAdvancedQueryClient > -v option doesn't work It also happens with resourceJobList #11460: DGAS on CE ignores GLITE settings “Even though I have GLITE_LOCATION_VAR pointing to /var/glite and GLITE_LOCATION_LOG pointing to /var/log/glite in my config file and  in the /opt/glite/etc/dgas_gianduia.conf file, DGAS server writes to /opt/glite/var and /opt/glite/var/log” #13408: The aclFile used by the dgas_ce_getAcctLogd daemon should contain ip addresses Instead of host names Massimo Sgaravatto - INFN Padova

19 Glite issues not addressed yet
#7395: Unhelpful error message with missing certificate info “WARNING: Unable to verify signature!” when cert of VOMS server non installed is not very clear “Error: Cannot find AC issuer” not considered very clear as well #7662: references to EDG license in voms Instead of the EGEE one #8021: VOMS test script doesn't tell you the progress when in the mass voms-proxy-* phase. Partially addressed in VOMS v #10431: VOMS admin and VOMS need to be harmonized Same parameters for voms and voms-admin should use the same flag in the configuration as the scripts are closely related. E.g. --vo on voms-admin vs --voms_vo voms_install_db. #10729: VOMS API should return clean FQAN The VOMS API returns the following strings, when the FQAN is queried: /EGEE/dm/Role=catalog-admin/Capability=NULL /EGEE/dm/Role=NULL/Capability=NULL Should instead return: /EGEE/dm/Role=catalog-admin /EGEE/dm Massimo Sgaravatto - INFN Padova

20 Glite issues not addressed yet
#11092: KCA-signed certificates not handled properly Waiting for fnal to give krb principals to test this case #11227: VOMS fails to build on RedHat 7.2 Posted by VDT #12513: voms client man pages still refer to edg #12613: VOMS reconnects to the Oracle database for each SQL query The database services consider this bug a show-stopper for the move of the VOMRS/VOMS databases from the pre-production to the productions servers. #13356: init script uses bad ps command “ps --cols 1000 x –f” should be replaced with “ps –efww” Massimo Sgaravatto - INFN Padova


Download ppt "LCG and Glite open issues Massimo Sgaravatto INFN Padova"

Similar presentations


Ads by Google