EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment.

EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment Integration and Support CERN IT www.eu-egee.org

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 2 Example of failure: the.maradona “case” Problem description: https://lxb0704.cern.ch:9000/qv0Qb_iuxA1wwWHIcszaCQ  I submit a job (https://lxb0704.cern.ch:9000/qv0Qb_iuxA1wwWHIcszaCQ)  The job fails and as logging info I get: Event: Done - exit_code = 1 - host = lxb0704.cern.ch - reason = Cannot read JobWrapper output, both from Condor and from Maradona. - source = LogMonitor - src_instance = unique - status_code = FAILED - timestamp = Fri Feb 18 10:29:47 2005 - user = /C=CH/O=CERN/OU=GRID/CN=host/lxb0704.cern.ch

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 3 Example of failure: the.maradona “case” Problems with Maradona? Maradona can make the shot … Or it can miss the shot …

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 4 Example of failure: the.maradona “case” This error means the user job exit status failed to be delivered to the RB, when two independent methods should have been tried:  The job wrapper script writes the user job exit status to stdout, which is supposed to be sent back to the RB by Globus.  The user job exit status is written into an extra "Maradona" file that is copied to the RB with globus-url-copy. Such failures are really “expensive” because the job might have finished correctly but you just can not retrieve the exit status  Consequently, the job is considered FAILED and you can not retrieve the Output

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 5 Example of failure: the.maradona “case” Identify the CONDOR ID  Change to directory /var/edgwl/logmonitor/CondorG.log  grep CondorG*.log files for the EDG ID (in this case it was https://lxb0704.cern.ch:9000/qv0Qb_iuxA1wwWHIcszaCQ) if none found also check in the subdirectory recycle/ BE CAREFUL that no files are accidently created in these directories, so they would probably interrupt the logmonitor Probably only 1 file will match - although if the job is resubmitted more it is possible that more than one will match  Open matching file. Look for first ocurance of the EDG ID: Should be a message of the form 'Job submitted from host...‘  Look for the CondorG ID - this is the string in parenthesis: 744 (744.000.000) 02/18 11:29:49 Job submitted from host: \ ( https://lxb0704.cern.ch:9000/qv0Qb_iuxA1wwWHIcszaCQ )

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 6 Example of failure: the.maradona “case” Look for further occurances of the CondorG ID in the same log file.  The next should be 'Job submitted to Globus'.  Take the 'JM-Contact' string At this point you can start to check on the CE.  The hostname of the relevant CE is in the first part of the JM- Contact string However, so that we know what to look for it would be useful to check the final state of the job. (744.000.000) 02/18 11:30:04 Job submitted to Globus RM-Contact: lxb0701.cern.ch:2119/jobmanager-torque JM-Contact: https://lxb0701.cern.ch:20002/5924/1108722119/

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 7 Example of failure: the.maradona “case” Look in the file /var/edgwl/logmonitor/log/events.log for the CondorG ID.  In particular look for the 'Got job terminated event' This shows that:  The WMS system parsed the terminated event from CondorG  tried to check the output of the job to verify that it ran correctly  It wasn't able to find any output that it recognized. On the CE we expect to see that the job at least scheduled to run in the batch system but there was some problem after that point. 18 Feb, 11:29:47 […] Got job terminated event. 18 Feb, 11:29:47 […] For cluster 744; fake return value 0 18 Feb, 11:29:47 […] EDG id = https://lxb0704.cern.ch:9000/qv0Qb_iuxA1wwWHIcszaCQ 18 Feb, 11:29:47 […] Going to parse standard output file. 18 Feb, 11:29:47 […] Standard output does not contain useful data. 18 Feb, 11:29:47 […] Standard output was not useful, passing ball to Maradona... 18 Feb, 11:29:47 […] Cannot read JobWrapper output, both from Condor and from Maradona. Maradona fails the shot !!! 18 Feb, 11:29:47 […] Last job terminated (744) aborted.

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 8 Example of failure: the.maradona “case” How do you proceed on the Computing Element really depends on the type of job manager/batch system For jobmanager-torque (like in this case):  Check the /var/log/messages file for the last part of the JM contact string 1108722119 https://lxb0701.cern.ch:20002/5924/1108722119/ 91.lxb0701.cern.ch  This indicates the local batch system job id 91.lxb0701.cern.ch  Next step is to figure out which WN executed the job  Change dir to /var/spool/pbs/server_priv/accounting and grep for such jobid  Look for the exec_host: Feb 18 11:23:11 lxb0701 gridinfo: [6052-6186] Submitted job 1108722121:lcgpbs:internal_369096027:5924.1108722119 to batch system lcgpbs with ID 91.lxb0701.cern.ch 20050218:02/18/2005 11:23:15;S;91.lxb0701.cern.ch;user=atlas001 group=atlas jobname=STDIN queue=atlas ctime=1108722191 qtime=1108722191 etime=1108722191 start=1108722195exec_host=lxb0702.cern.ch Resource_List.cput=48:00:00 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=72:00:00

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 9 Example of failure: the.maradona “case” Now you know the WN name (lxb0702.cern.ch) and you can have a look at what happened there:  You can check if the node is up, in a funny state, if has any memory free, any disk space, you can look at the system log …  One way is: cd to /var/spool/pbs/mom_logs grep the files for the PBS job id and have a look at the one which matches. In my case I see:  It looks like someone cleaned up the job, including all the subprocesses Someone must have killed the job at the pbs level –The cleanup was “too clean”  OK, I admit it … it was me. I just “qdel” the job from the CE. Job;91.lxb0701.cern.ch;Started, pid = 30555 Job;91.lxb0701.cern.ch;kill_task: killing pid 30555 task 1 with sig 15 Job;91.lxb0701.cern.ch;kill_task: killing pid 30673 task 1 with sig 15 Job;91.lxb0701.cern.ch;kill_task: killing pid 30677 task 1 with sig 15 Job;91.lxb0701.cern.ch;kill_task: killing pid 31229 task 1 with sig 15 91.lxb0701.cern.ch;scan_for_terminated: task 1 terminated, sid 30555 91.lxb0701.cern.ch;Terminated

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 10 The “.maradona” case … other causes The job exceeds the queue time threshold and gets killed by the Batch System  Very similar to the case of the example Problem is connected to a problematic WN with a bad synchronization of the ssh keys with the CE it cannot upload the output on the CE A file system server (line NFS) serving the WNs is down. Example: at FZK network mounts were hanging due to the changed mount points of the file server. Example: at UCL the network file system had been remounted read-only. The job didn't start at all to run in the WN due to misconfiguration of some type Example: the home directory can not be accessed or it's filled up (Toronto July 2004) Jobs are aborted because the "/tmp" partition is full on job startup. Submitting the job directly via Globus to such a node produces /var/spool/PBS/mom_priv/jobs/333595.pbs-.SC: cannot create temp file for here document: No space left on device

INFNGRID site administrators tutorial, Frascati (Italy) – 21-25 February 2005 - 11 The “.maradona” case … other causes A very interesting use case at Toronto during the 2004 Summer:  The CE has had a large load on it for some days  When that happens ssh or scp processes tend to time out. In that occasion the solution has been to increase the number of maximum unauthenticated ssh connections from the default of 10 to 30. This allowed scp processes to succeed more often. There can be a mismatch between the callback ports being sent from a CE to the RB and the actual ports which are open for incoming calls in the firewall  resulting in the RB's return calls being blocked (RAL experience) At BUDAPEST the underlying cause was a problem with Condor.  Under heavy load the condor jobmanager gets messed up somehow and thinks he should kill some ghost jobs The site administrator has cleaned up undelivered stdout and stderr The clock of the WN being skewed Who knows what …

EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment.

Similar presentations

Presentation on theme: "EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment.

Similar presentations

Presentation on theme: "EGEE is a project funded by the European Union under contract IST-2003-508833 The Workload Management System: an example Simone Campana LCG Experiment."— Presentation transcript:

Similar presentations

About project

Feedback