gLite Job Management Amina KHEDIMI (a.khedimi@dtri.cerist.dz) CERIST 18/09/2018 gLite Job Management Amina KHEDIMI (a.khedimi@dtri.cerist.dz) CERIST Africa 6 -Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting Rabat, June 6, 2011 Rabat
Outline Key Element Job Description Language Job Life cycle 18/09/2018 Key Element Job Description Language Job Life cycle Submission and Management of a job Hands-on Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting Rabat
Key Elements In order to submit a Job to gLite infrastructure, users contact the Workload Management System (WMS) The Workload Management System (WMS) is the gLite component that allows users to submit jobs performs all tasks required to execute them, without exposing the user to the complexity of the Grid. WMProxy is the service providing access to the gLite WMS Web Services based interfaces, it can be accessed through the published WSDL (WebServiceDescriptionLanguage) implements SOA(ServiceOrientedArchitecture) Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 3
Key Elements Job Description Language (JDL) is the language used to describe a job. User have to describe his jobs and their requirements, and to retrieve the output when the jobs are finished. The Command Line Interface is a suite of gLite commands used in order to interact with the WMS. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 4
gLite Architecture Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 5
Job Flow Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
cancellation) expressed Job Flow Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Job Flow Finds an appropriate CE for each submission request, taking into account job requests and preferences, Grid status, utilization policies on resources Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
immediately available Job Flow Keeps submission requests Requests are kept for a while if no resources are immediately available Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Job Flow Repository of resource information available to matchmaker Updated via notifications and/or active polling on resources Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Job Flow The LB is responsible to: - Stores events generated by the various components of the WMS - Querying the LB user can retrieve information about the status of the job Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Job Description Language The Job Description Language (JDL) is a high-level language based on the Classified Advertisement (ClassAd) language, used to describe jobs and aggregates of jobs with arbitrary dependency relations. A job description is a file (called JDL file) consisting of lines having the format: attribute = expression; Expressions can consist of several lines, but only the last one must be terminated by a semicolon. Literal strings are enclosed in double quotes. If a string itself contains double quotes, they must be escaped with a backslash (e.g.: Arguments = "\"hello\" 10“;). 12
Simple example Type = "Job"; JobType = "Normal"; Executable = "my_exe"; StdInput = "myinput.txt"; StdOutput = "message.txt"; StdError = "error.txt"; InputSandbox = {"myinput.txt","/home/user/example/myexe"}; OutputSandbox = {"message.txt", "error.txt"}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 13
18/09/2018 Type The JDL allows description of the following Types of requests (the ones suppor ted by the WMS): Job, a simple job (default) DAG, a Direct Acyclic Graph of dependent jobs Collection, a set of independent jobs Although DAGs and collections represent sets of jobs, they are described through a single JDL file, and submitted in one shot to the WMS Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 14 Rabat 14
JobType This attribute is a string representing the type of the job described by the JDL; possible values are: Normal (default) MPICH (deprecated) Parametric This attribute only makes sense when the Type attribute equals to “Job” Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Executable InputSandbox = {"/home/amina/my_exe"}; The Executable attribute specifies the command to be run by the job. The user can specify an executable that resides already t on the remote CE- worker nodes, it must be expressed as an absolute path; Executable = “/user/local/java/bin/java”; or Executable = “$JAVA_HOME/bin/java”; If it has to be copied from the UI, only the file name must be specified, and the path of the command on the UI should be given in the InputSandbox attribute. Executable = “my_exe"; InputSandbox = {"/home/amina/my_exe"}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 16
Arguments The Arguments attribute contain a string value, which is taken as argument list for the executable: Special characters, such as &,|,\,<,>, should be preceded by triple \, or specified inside quoted strings Executable = “my_exe”; Arguments = “-i args_input -o file_out” InputSandbox = {“/home/of/UI/my_exe”,”args_input”}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 17
StdInput = “my_job_input”; This attribute is a string representing the standard input of the executable. This means that the job is run as follows: bash # job < standard_input by using the bash redirection StdInput = “my_job_input”; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
StdOutput and StdError The attributes StdOutput and StdError define the name of the files containing the standard output and standard error of the executable, once the job output is retrieved. StdOutput = "std.out"; StdError = "std.err"; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 19
InputSandbox This attribute specifies the list of files on the UI local file system (or an accessible gridFTP server), needed by the job for running. These files are transferred from the UI to the WMS, and then downloaded on the WN The InputSandbox cannot contain two files with the same name, even if they have a different absolute path, as when transferred they would overwrite each other InputSandbox = {“file_1”,...,“file_N”}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
InputSandboxBaseURI 18/09/2018 A new feature introduced by the gLite WMS is the possibility to indicate input sandbox files not stored on the UI,but on a GridFTP server, and, similarly, to specify that output files should be transferred to a GridFTP server when the job Finishes. InputSandbox = {“file_1”,...,“file_N”}; InputSandboxBaseURI = “gsiftp://ipaddress.of.gsiFT.server:432/tmp”; Represents the URI on a gridFTP server where the InputSandbox files (absolute/relative paths)are available for being transferred to WNs Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 21 Rabat
OutputSandbox = {“out_1”,...,“out_N”}; Represents the list of output files, generated at runtime by the executable, to be transferred back,retrieved to the UI after the job is finished File names can be provided as simple file names or relative paths with respect to the current working directory on the WN. The list should NOT contain two or more files having the same name, as when are transferred on the WMS machine they would over write themselves. OutputSandbox = {“out_1”,...,“out_N”}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
OutputSandboxDestURI In order to store the output sandbox files to a GridFTP server at job completion, the OutputSandboxDestURI attribute must be used together with the usual OutputSandbox attribute. where the first two files have to be copied to a GridFTP server, while the third file will be copied back to the WMS with the usual mechanism. Clearly, glite-wms-job-output will retrieve only the third file. OutputSandbox = {"fileA", "data/fileB", "fileC"}; OutputSandboxDestURI= {"gsiftp://lxb0707.cern.ch/cms/doe/fileA", "gsiftp://lxb0707.cern.ch/cms/doe/fileB","fileC"}; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
OutputSandboxBaseDestURI Another possibility is to use the OutputSandboxBaseDestURI attribute to specify a base URI on a GridFTP server where the files listed in OutputSandbox will be copied.. will copy both files under the specified GridFTP URI. Note: the directory on the GridFTP where the files have to be copied must already exist. OutputSandbox = {"fileA", "fileB"}; OutputSandboxBaseDestURI = “gsiftp://ipaddress.of.the.server:5432/tmp”; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
OutputSandbox notes The OutputSandboxDestURI and the OutputSandboxBaseDestURI attributes, cannot be pecified together in the same JDL. The OutputSandboxDestURI list must have the same cardinality as the OutputSandbox list If neither OutputSandboxDestURI nor OutputSandboxBaseDestURI are specified, then all the files listed in the OutputSandbox will be available on the WMS node for retrieval Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Environment and Virtual Organisation This attribute is a list of string representing environment settings on the WN, needed by the job to run The VirtualOrganisation attribute can be used to explicitly specify the VO of the user: Environment = {“JOB_LOG_FILE=/tmp/job.log”, “JAVABIN=/usr/local/bin/java”}; VirtualOrganisation = “gilda"; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
RetryCount It is possible to have the WMS automatically resubmit jobs which, for some reason, are aborted by the Grid. The user can limit the number of times the WMS should resubmit a job by using the JDL attribute RetryCount RetryCount = 5; Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Requirements The Requirements attribute can be used to express constraints on the resources where the job should run. Its value is a Boolean expression that must evaluate to true for a job to run on that specific CE. Note: Only one Requirements attribute can be specified (if there are more than one, only the last one is considered).If several conditions must be applied to the job, then they all must be combined in a single Requirements attribute. For example, let us suppose that the user wants to run on a CE using PBS as batch system, and whose WNs have at least two CPUs. He will write then in the job description file: Requirements = other.GlueCEInfoLRMSType == "PBS" && other.GlueCEInfoTotalCPUs > 1; ! Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 28
Requirements The WMS can be also asked to send a job to a particular queue in a CE with the following expression: Requirements = other.GlueCEUniqueID == "lxshare0286.cern.ch:2119/jobmanager-pbs-short"; It is also possible to use regular expressions when expressing a requirement. Let us suppose for example that the user wants all his jobs to run on any CE in the domain cern.ch. This can be achieved putting in the JDL file the following expression: Requirements = RegExp("cern.ch",other.GlueCEUniqueID); The opposite can be required by using: Requirements = (!RegExp("cern.ch", other.GlueCEUniqueID)); Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Job Description Language The character “ ‘ ” cannot be used in the JDL. Comments must be preceded by a sharp character (#) or a double slash (//) at the beginning if each line. Multi-line comments must be enclosed between “/*” and “*/” . Attention! The JDL is sensitive to blank characters and tabs. No blank characters or tabs should follow the semicolon at the end of a line. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 30
Job Life Cycle Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 31
Job submission Initialisation of the Proxy: Before the user can use the glite-wms-job-* commands he must set up a Short Term Proxy [amina@ui01 ~]$ voms-proxy-init --voms eumed Enter GRID pass phrase: Your identity: /C=IT/O=INFN/OU=Personal Certificate/L=DZ-eScience/CN=Khedimi Amina Creating temporary proxy .................................................................................................... Done Contacting voms2.cnaf.infn.it:15016 [/C=IT/O=INFN/OU=Host/L=CNAF/CN=voms2.cnaf.infn.it] "eumed" Done Creating proxy .......................................................................................................................................................................... Done Your proxy is valid until Mon Jun 6 00:16:13 2011 Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Delegating a proxy to WMProxy Each job submitted to WMProxy must be associated to a proxy credential previously delegated by the owner of the job to the WMProxy server. This proxy is then used any time WMProxy needs to interact with other services for job related operations (e.g. submission to the CE, a GridFTP file transfer etc) There are two possible mechanisms to ask for a delegation of the user credentails: asking the “automatic” delegation of the credentials during the submission operation asking for an “explicit“ delegation Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Delegating a proxy to WMProxy 18/09/2018 To automatically delegate a user proxy to WMProxy, the command to use is: glite-wms-job-delegate-proxy -a To explicitly delegate a user proxy to WMProxy, the command to use is: glite-wms-job-delegate-proxy -d <delegID> where <delegID> is a string chosen by the user. For example, to delegate a proxy: [amina@ui01 ~]$ glite-wms-job-delegate-proxy -d amina Connecting to the service https://wms.ulakbim.gov.tr:7443/glite_wms_wmproxy_serv er ================== glite-wms-job-delegate-proxy Success ================== Your proxy has been successfully delegated to the WMProxy(s): https://wms.ulakbim.gov.tr:7443/glite_wms_wmproxy_server with the delegation identifier: amina ========================================================================== Instead of creating a delegation ID with -d, the -a option can be used. This causes a delegated proxy to be established automatically. In this case you do not need to remember a delegation identifier. However, repeated use of this option is not recommended, since it delegates a new proxy each time the commands are issued. Delegation is a time-consuming operation, so it's better to use glite-wms-job-delegate-proxy and reuse the delegation ID when submitting your jobs. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 34 Rabat
Matching computing elements It is possible to see which CEs are useful to run a job glite-wms-job-list-match <JDL file> -a Automatic delegation or -d <dID> Use a previous explicitly delegated proxy. -o <file> Store the CE list in a file [ amina@ui01 ~]$ glite-wms-job-list-match -a hostname.jdl Connecting to the service https://wms.ulakbim.gov.tr:7443/glite_wms_wmproxy_server ========================================================================== COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* - cccreamceli09.in2p3.fr:8443/cream-sge-long - cccreamceli09.in2p3.fr:8443/cream-sge-medium - cccreamceli09.in2p3.fr:8443/cream-sge-short - ce0.m3pec.u-bordeaux1.fr:2119/jobmanager-pbs-eumed - ce01.isabella.grnet.gr:2119/jobmanager-pbs-eumed - cream.sns.it:8443/cream-pbs-grid - ========================================================================== Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
Submitting a simple job 18/09/2018 Starting from a simple JDL file, we can submit it via WMProxy by doing: $ glite-wms-job-submit –d mydelegID –r <CEId> test.jdl Options:- -a delegation -d <dID> Use a previous explicitly delegated proxy. must use either -a or -d -o <file> Append jobId to specified file (creating if necessary) -r <CE> Send a job directly to a particular CE. Don't check CE for suitability or create a BrokerInfo file. [amina@ui01 ~]$ glite-wms-job-submit -d amina hostname.jdl Connecting to the service https://wms.ulakbim.gov.tr:7443/glite_wms_wmproxy_server ====================== glite-wms-job-submit Success ====================== The job has been successfully submitted to the WMProxy Your job identifier is: https://wms.ulakbim.gov.tr:9000/P1wT0EB-b-6opI0BHuOWkQ ========================================================================== glite-wms-job-submit –a test.jdl For the automatic delegation the job identifier (jobID), which uniquely defines the job and can be used to perform further operations on the job, like interrogating the system about its status, or canceling it The format of the jobID is: https://<LB_hostname>[:<port>]/<unique_string> Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 36 Rabat
Retrieving the status of a job [amina@ui01 ~]$ glite-wms-job-status https://wms.ulakbim.gov.tr:9000/P1wT0EB-b-6opI0BHuOWkQ ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://wms.ulakbim.gov.tr:9000/P1wT0EB-b-6opI0BHuOWkQ Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: lpsc-ce.in2p3.fr:2119/jobmanager-pbs-eumed Submitted: Sun Jun 5 14:14:19 2011 CET ========================================================================== The verbosity level controls the amount of information provided. The value of the -v option ranges from 0 to 3. The commands to get the job status can have several jobIDs as arguments, or you can use the -i <file path> option: glite-wms-job-status –i jobid Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 37
Cancelling a job [amina@ui01 ~]$ glite-wms-job-cancel https://lb.grid.arn.dz:9000/XR5Mbu2uqhEr0z0amfNWgA Are you sure you want to remove specified job(s) [y/n]y : y Connecting to the service https://wms.grid.arn.dz:7443/glite_wms_wmproxy_server ============================= glite-wms-job-cancel Success ============================= The cancellation request has been successfully submitted for the following job(s): - https://lb.grid.arn.dz:9000/XR5Mbu2uqhEr0z0amfNWgA =================================================================================== If the cancellation is successful, the job will terminate in status CANCELLED Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 38
Retrieving the output(s) 18/09/2018 [amina@ui01 ~]$ glite-wms-job-output https://wms.ulakbim.gov.tr:9000/P1wT0EB-b-6opI0BHuOWkQ Connecting to the service https://wms.ulakbim.gov.tr:7443/glite_wms_wmproxy_server ================================================================================ JOB GET OUTPUT OUTCOME Output sandbox files for the job: https://wms.ulakbim.gov.tr:9000/P1wT0EB-b-6opI0BHuOWkQ have been successfully retrieved and stored in the directory: /tmp/jobOutput/amina_P1wT0EB-b-6opI0BHuOWkQ The default location for storing the outputs (normally /tmp) is defined in the UI configuration, but it is possible to specify in which directory to save the output using the --dir <path name> option. glite-wms-job-output –i jobId –dir /path Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 39 Rabat 39
Jobs State Machine (1/9) 18/09/2018 Submitted job is entered by the user to the User Interface Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 40 Rabat 40
Jobs State Machine (2/9) 18/09/2018 Waiting job accepted and waiting for Workload Manager processing. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 41 Rabat 41
Jobs State Machine (3/9) 18/09/2018 Ready job processed by WM but not yet transferred to the CE (local batch system queue). Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 42 Rabat 42
Jobs State Machine (4/9) Scheduled job waiting in the queue on the CE. 18/09/2018 Scheduled job waiting in the queue on the CE. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 43 Rabat 43
Jobs State Machine (5/9) Running job is running. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 44
Jobs State Machine (6/9) Done job exited or considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way). Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 45
Jobs State Machine (7/9) 18/09/2018 Aborted job processing was aborted by WMS (waiting in the WM queue or CE for too long, expiration of user credentials). Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 46 Rabat 46
Jobs State Machine (8/9) 18/09/2018 Cancelled job has been successfully canceled on user request. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 47 Rabat 47
Jobs State Machine (9/9) Cleared output sandbox was transferred to 18/09/2018 Cleared output sandbox was transferred to the user or removed due to the timeout. Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 48 Rabat 48
..an useful reminder Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 49
Sammary Jobs run in batch mode on traditional gLite grids. Steps in running a job on a gLite grid with WMS: Create a text file in “Job Description Language” Create a proxy Optional check: list the compute elements that match your requirements (“glite-wms-job-list-match myfile.jdl” command) Submit the job ~ “glite-wms-job-submit myfile.jdl” Non-blocking - Each job is given an id. Occasionally check the status of your job (“glite-wms-job-status” command) When “Done” retrieve output (“glite-wms-job-output” command) Or just cancel the job (“glite-wms-job-cancel” command) Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting
JDL Attributes Specification References WMProxy User’s guide https://edms.cern.ch/file/674643/1/EGEE-JRA1-TEC-674643-WMPROXY-guide-v0-3.pdf JDL Attributes Specification https://edms.cern.ch/file/590869/1/EGEE-JRA1-TEC-590869-JDL-Attributes-v0-9.pdf gLite User’s guide https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.pdf Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 51
Questions … Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 52
https://grid.ct.infn.it/twiki/bin/view/GILDA/SimpleJobSubmission Hands-on https://grid.ct.infn.it/twiki/bin/view/GILDA/SimpleJobSubmission https://grid.ct.infn.it/twiki/bin/view/GILDA/MoreOnJDL https://grid.ct.infn.it/twiki/bin/view/GILDA/WmProxyUse Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting 53
Questions … Rabat, Joint CHAIN/EPIKH/EUMEDGRID Support event in School on Application Porting