Download presentation
Presentation is loading. Please wait.
1
The EU DataGrid Job Submission Services
The European DataGrid Project Team
2
Contents The EDG Workload Management System (WMS)
Job Submission to the EDG Testbed Job Preparation Job Description Language (JDL) Job Submission & Monitoring A simple program example: the job lifecycle
3
The EDG WMS The user interacts with Grid via a Workload Management System The Goal of WMS is the distributed scheduling and resource management in a Grid environment. What does it allow Grid users to do? To submit their jobs To execute them To get information about their status To retrieve their output The WMS tries to optimize the usage of resources It provides to the whole community of the GRID users a set of tools to submit their Jobs, have them executed, get information about their status, retrieve their output, and allow them to access GRID resources in an optimal way
4
WMS Components WMS is currently composed of the following parts:
User Interface (UI) : access point for the user to the WMS Resource Broker (RB) : the broker of GRID resources, responsible to find the “best” resources where to submit jobs Job Submission Service (JSS) : provides a reliable submission system Information Index (II) : a LDAP server used by the Resource Broker as a filter to the information service (IS) to select resources Logging and Bookkeeping services (LB) : store Job Info available for users to query The EDG WMS is comprised of 5 major parts: The user interface which hosts all of the user level commands. This is the place where the user interacts with WMS (and thus with the Grid) The resource broker tries to schedule jobs on the grid allowing an efficient use of Grid resources The job submission system actually delivers jobs to the computing elements chosen by the resource broker. The information index, a “cache” to the Information Service (populated by the EDG Information Service providers), allows the resource broker to retrieve information about the status of the Grid Finally, the logging and bookkeeping services store job information and are used for job monitoring
5
Job Preparation: Let’s think the way the Grid thinks!
Information to be specified Job characteristics Level of Integration: Self-Contained. Medium Grid-Integration. Strong Grid-Integration. Requirements and Preferences of the computing system Software dependencies Job Data requirements Specified using a Job Description Language (JDL) Before actually running a job on the Grid, some issues on the nature of the job need to be clarified. What are the characteristics (executable, stdin, etc.) of the job ? What are the computational requirements (CPU speed, multi-processor machines, …)? What are the data requirements (input data, output storage element, etc.) of the job? Are there any software dependencies (needs software to be installed already on machines which will eventually execute the job?
6
Job Description Language (JDL) 1/5
Based upon Condor’s CLASSified ADvertisement language (ClassAd) ClassAd is a fully extensible language ClassAd is constructed with the classad construction operator [] It is a sequence of attributes separated by semi-colons. An attribute is a pair (key, value), where value can be a Boolean, an Integer, a list of strings, … <attribute> = <value>; So, the JDL allows to define a set of attribute, the WMS takes into account when making its scheduling decision In order to run a job on the EDG Testbed it has to be described by means of the EDG Job Description Language (JDL). The Job Description Language is the mean by which the job to be run is defined. JDL specifies job characteristics and requirements, such as the application to be run, the input data set, required and preferable resources, and so forth. JDL is based upon Condor’s Classified Advertisement Language (CLASSAD) which is a simple expression-based language to specify both, resources and requests. Classad facilitates a matching between resources and customers. Classad describes properties by means of <attribute> = <value> pairs.
7
Job Description Language (JDL) 2/5
The supported attributes are grouped in two categories: Job (Attributes) Define the job itself Resources Taken into account by the RB for carrying out the matchmaking algorithm Computing Resource (Attributes) Used to build expressions of Requirements and/or Rank attributes by the user Have to be prefixed with “other.” Data and Storage resources (Attributes) Input data to process, SE where to store output data, protocols spoken by application when accessing SEs
8
Job Description Language (JDL): relevant attributes 3/5
Executable (mandatory) The command name Arguments (optional) Job command line arguments StdInput, StdOutput, StdErr (optional) Standard input/output/error of the job Environment (optional) List of environment settings InputSandbox (optional) List of files on the UI local disk needed by the job for running The listed files will automatically staged to the remote resource OutputSandbox (optional) List of files, generated by the job, which have to be retrieved
9
Job Description Language (JDL): relevant attributes 4/5
Requirements Job requirements on computing resources Specified using attributes of resources published in the Information Service If not specified, default value defined in UI configuration file is considered Default: other.Active (the resource has to be able to accept jobs) Rank Expresses preference (how to rank resources that have already met the Requirements expression) If not specified, default value defined in the UI configuration file is considered Default: -other.EstimatedTraversalTime (the lowest estimated traversal time)
10
Job Description Language (JDL): “data” attributes 5/5
InputData (optional) Refers to data used as input by the job: these data are published in the Replica Catalog and stored in the SEs) PFNs and/or LFNs ReplicaCatalog (mandatory if InputData has been specified with at least one LFN) The Replica Catalog Identifier DataAccessProtocol (mandatory if InputData has been specified) The protocol or the list of protocols which the application is able to speak with for accessing InputData on a given SE OutputSE (optional) The Uniform Resource Identifier of the output SE RB uses it to choose a CE that is compatible with the job and is close to SE
11
Example JDL File Executable = “gridTest”; StdError = “stderr.log”;
StdOutput = “stdout.log”; InputSandbox = {“home/joda/test/gridTest”}; OutputSandbox = {“stderr.log”, “stdout.log”}; InputData = “LF:testbed ”; ReplicaCatalog = “ldap://sunlab2g.cnaf.infn.it:2010/ \ lc=test, rc=WP2 INFN Test, dc=infn, dc=it”; DataAccessProtocol = “gridftp”; Requirements = other.Architecture==“INTEL” && \ other.OpSys==“LINUX” && other.FreeCpus >=4; Rank = other.MaxCpuTime; This example jdl specifies a job that should run the executable gridTest on a computing element running on INTEL Linux boxes with at least 4 free CPUs, and it allows the user to retrieve the standard output and error files. The job requires input data that is registered with the logical file name LF:testbed , and a specific replica catalog should be used to map this logical file name to physical data. If the input data is only available remotely, gridftp should be used to make the data locally available. Finally, if multiple CEs match the job requirements, the CE with the maximum of allowed CPU time should be chosen.
12
Job Submission dg-job-submit [–r <res_id>] [–n <user address>] [-c <config file>] [-o <output file>] <job.jdl> -r the job is submitted by the RB directly to the computing element identified by <res_id> -n an message containing basic information regarding the job (status and identification) is sent to the specified < address> when the job enters one of the following status: DONE or ABORTED READY RUNNING -c the configuration file <config file> is pointed by the UI instead of the standard configuration file -o the generated dg_jobId is written in the <output file> Useful for other commands, e.g.: dg-job-status –i <input file> (or dg_jobId) -i the status information about dg_jobId contained in the <input file> are displayed The dg-job-status command, used without options, followed directly by dg_jobId, only displays bookkeeping information regarding that particular jobId. The -c option can be used not only in the dg-job-submit command but also in the others, where it is followed by the –all option, because they need to know which LBs are.
13
Job Submission Scenario
Replica Catalogue (RC) Information Service (IS) UI JDL Resource Broker (RB) Storage Element (SE) The next slides show the process of how the Job is being handled by the WMS. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element CE)
14
A Job Submission Example
Replica Catalogue (RC) Information Service (IS) Job Status Job Submit Event Input Sandbox UI JDL submitted Resource Broker (RB) Storage Element (SE) The user, once he/she has created a JDL file (describing his/her job), submits the JDL through UI, which is configured to contact the appropriate resource broker, and logs the submission event to the Logging & bookkeeping service. The InputSandbox, containing anything needed for the job to run (e.g. command line arguments, small input data sets) is transferred to the resource broker. The job is now in the SUBMITTED state. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
15
A Job Submission Example
Replica Catalogue (RC) Information Service (IS) Job Status UI JDL submitted waiting Resource Broker (RB) Storage Element (SE) The RB has received the job. Based on the information given in the JDL, the RB queries the RC (to get the PFNs given the LFNs, and therefore to know where the required data are available in the Grid) and the Information Service to get information on the current status of data and hardware resources, and matches the request to a suitable CE. During this phase, the job is in the WAITING state. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
16
A Job Submission Example
Replica Catalogue (RC) Information Service (IS) Job Status UI JDL submitted waiting ready Resource Broker (RB) Storage Element (SE) The RB has found a suitable CE. Eventually, it informs the logging & bookkeeping service of its decision and hands the job over to the job submission service (JSS). The job is now in the READY state. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
17
A Job Submission Example
Replica Catalogue (RC) Information Service (IS) Job Status UI JDL submitted waiting ready Resource Broker (RB) BrokerInfo scheduled Storage Element (SE) The JSS translates the request (into Globus RSL), passes it to the chosen CE, and also informs the Logging service. On the CE, it is also copied the .BrokerInfo file via a wrapper, where all RB decisions are written. The job enters in the SCHEDULED state. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
18
A Job Submission Example
Job Status Replica Catalogue (RC) submitted Information Service (IS) UI JDL waiting ready Input Sandbox running scheduled Resource Broker (RB) Storage Element (SE) The CE gets the InputSandbox from the RB and any necessary data from SEs. The job eventually is executed on a CE (on a WN), that is in the RUNNING state. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
19
A Job Submission Example
Job Status Replica Catalogue (RC) submitted Information Service (IS) UI JDL waiting ready scheduled Resource Broker (RB) Job Status running Storage Element (SE) The computing element informs the logging service about the progress of the job. During all this process the user may monitor the status of his job by contacting the logging & bookkeeping service. Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
20
A Job Submission Example
Job Status Replica Catalogue submitted Information Service UI JDL waiting ready scheduled Resource Broker running Storage Element The execution of the job has completed on the CE. The job is in the DONE state. Job Status done Logging & Bookkeeping Job Submission Service Compute Element
21
A Job Submission Example
Job Status Replica Catalogue submitted Information Service UI JDL waiting ready scheduled Resource Broker Job Status outputready Output Sandbox running Storage Element The CE transfers the output sandbox (basically the stdout, stderr, etc.) to the RB once the job is OutputReady. At this point the user can collect his/her output sandbox from the RB. done Logging & Bookkeeping Job Submission Service Compute Element
22
A Job Submission Example
Job Status Replica Catalogue (RC) submitted Information Service (IS) UI JDL waiting ready Output Sandbox cleared scheduled Resource Broker (RB) running Storage Element (SE) The user has retrieved all output files successfully. Bookkeeping information is purged some time after the job enters in the CLEARED state. Remember, during all this process the user may monitor the status of his job by contacting the logging & bookkeeping service. done Logging & Bookkeeping (LB) Job Submission Service (JS) outputready Compute Element (CE)
23
Possible Job States SUBMITTED WAITING READY DONE(cancelled) SCHEDULED
ABORTED RUNNING In this slide all possible job states are shown: SUBMITTED – the user has submitted the job via UI WAITING - the RB has received the job READY – A CE, which matches job requirements, has been selected, and the job is transferred to the JSS SCHEDULED – the JSS has sent the job to the CE RUNNING – the job is running on the CE DONE – this state has different meanings: DONE(ok) : the execution has terminated on the CE (WN) with success DONE(failure) : the execution has terminated on the CE (WN) with some problems DONE(cancelled) : the job has been cancelled with success OUTPUTREADY – the output sandbox is ready to be retrieved by the user The state OUTPUTREADY reflects the time difference between end of computation on CE and the moment RB got necessary notification via Condor and JSS. (job is terminated) CLEARED – the user has retrieved all output files successfully, and the job bookkeeping information is purged some time after the job enters in this state. ABORTED – the job has failed The job may fail for several reasons one of them is external to its execution (no resource found). DONE(failed) DONE(ok) OUTPUTREADY CLEARED
24
Job resubmission If something goes wrong, the RB tries to reschedule and resubmit the job (possibly to a different resource) Maximum number of resubmissions (considering all the resources matching the requirements): min(RetryCount, RB_submission_retries) RetryCount: JDL attribute RB_submission_retries: attribute in the RB configuration file E.g., to disable job resubmission for a particular job: RetryCount=0; in the JDL file
25
Other WMS UI Commands dg-job-list-match dg-job-cancel dg-job-status
Lists resources matching a job description Performs the matchmaking without submitting the job dg-job-cancel Cancels a given job dg-job-status Displays the status of the job dg-job-get-output Returns the job-output (the OutputSandbox files) to the user dg-job-get-logging-info Displays logging information about submitted jobs (all the events “pushed” by the various components of the WMS) Very useful for debug purposes dg-job-id-info A utility for the user to display job info in a formatted style WMS provides apart from the dg job submission command a set of command line tools that allow to ”manage” jobs already submitted: dg-job-list-match command returns the list of resources fulfilling job requirements. dg-job-cancel command cancels one or more submitted jobs. dg-job-status command displays bookkeeping information about submitted jobs. dg-job-get-output command requests the RB for the job output files (specified by the OutputSandbox attribute of the JDL) and stores them in the submitting machine local disk. dg-job-get-logging-info: Displays logging information about submitted jobs (all the events “pushed” by the various components of the WMS) dg-job-id-info command just parses the dg_jobId string and displays formatted information contained in the job identifier
26
UI configuration file Can be set if user is not happy with default one
Most relevant attributes: RB(s) When submitting a job, the first specified RB is tried, if the operation fails the second one is considered, etc. LBserver(s) The LB to be used for a job is chosen by the RB So when a dg-job-status <dg-jobid> is issued, the LB to contact is specified in the dg-jobid This list specifies the LB(s) that must be contacted when issuing a dg-job-status –all / dg-job-get-logging-info –all (to have information for all the jobs belonging to that user) Default JDL Requirements other.active Default JDL Rank - other.EstimatedTraversalTime
27
UI configuration file Example
# LB addresses # [<prot>://]<full host name>[:<port number>] %%beginLB%% %%endLB%% # RB addresses # <full host name>[:<port number>] %%beginRB%% ccedgli01.in2p3.fr:7771 %%endRB%% # UI environment settings and corresponding default values # # Format is always <var name> = <var value>. ## Stage IN/OUT Storage Paths ## DEFAULT_STORAGE_AREA_IN = /tmp ## Default values for Mandatory Attributes ## requirements = other.Active rank = - other.EstimatedTraversalTime ## Job Submission User Interface Version Version = ## Default location for storage of log files ErrorStorage= /tmp ## Default values for number of Retries on Fatal Error: RetryCountLB = 1 RetryCountJobId = 1 ## Number of timeout seconds for dg-log-transfer API call LoggingTimeout = 10
28
WMS Match Making 1/4 The RB is the core component of WMS.
It has to find the best suitable computing resource (CE) where the job will be executed It interacts with Data Management service and Information Service They supply RB with all the information required for the resolution of the matches The CE chosen by RB has to match the job requirements (e.g. runtime environment, data access requirements, and so on) If 2 or more CEs satisfy all the requirements, the one with the best Rank is chosen
29
WMS Match Making 2/4 The RB has to deal with three possible scenarios.
Scenario : Direct Job Submission Job is scheduled on a given CE (specified in the dg-job-submit command via –r option) RB doesn’t perform any matchmaking algorithm The main task of the WMS is to match the requirements of a job with the available resources in an efficient way. Thereby, WMS has to deal with three scenarios: A CE was specified in JDL. This is the easiest way, since WMS only has to schedule the job on the given CE. If the CE is not able to execute the job (e.g. the batch system is not correct) the user receives an error message.
30
WMS Match Making 3/4 Scenario : Job Submission without data-access Requirements Neither CE nor input data are specified. RB starts the matchmaking algorithm, which consists of two phases: Requirements check (RB contacts the IS to check which CEs satisfy all the requirements) If more than one CE satisfies the job requirements, the CE with the best rank is chosen by the RB 2. A job without specifying a CE and without input data requirements. In this case the RB contacts the information services to retrieve information on the current Grid status. Based on this information it matches the requirements against the current testbed situation and ranks potential CEs according to the rank information given in the JDL. During the first phase, the RB uses the information cached in the II, while in the second one the RB contacts each suitable CE, rather than using the IS as source of information, since rank attribute represent variable varying in time.
31
WMS Match Making 4/4 Scenario : Job Submission with also data-access Requirements CE is not specified in the JDL RB interacts with Data Management service to find out the most suitable CE taking into account also the SEs where both input data sets are physically stored and output data sets should be staged on completion of job execution RB strategy consists of submitting jobs close to data The main two phases of the match making algorithm remain unchanged: Requirements check Rank computation What changes with respect to the second scenario? Now, the RB executes the two phases for each class of CEs that satisfy the data-access requirements (i.e. which are close to data) 3. A job without specifying a CE but with input data requirements. This case is similar to case 2., however the RB has to take also data location into consideration. Hence, apart from contacting the Information Services it also needs to contact the Replica Catalog in order to retrieve the physical locations of the needed data. The RB policy is to submit jobs close to data
32
Example of Job Submission Sequence
User logs in on the UI User issues a grid-proxy-init and enters his certificate’s password, getting a valid Globus proxy User sets up his or her JDL file Example of Hello World JDL file : Executable = “/bin/echo”; Arguments = “Hello World”; StdOutput = “Messagge.txt”; StdError = “stderr.log”; OutputSandbox = {“Message.txt”,”stderr.log”}; We now go through a detailed “hello world” job submission example.
33
Example of Job Submission Sequence
User issues a: dg-job-submit HelloWorld.jdl and gets back from the system a unique Job Identifier (JobId) User issues a: dg-job-status JobId to get logging information about the current status of his Job When the “OutputReady” status is reached, the user can issue a dg-job-get-output JobId and the system returns the name of the temporary directory where the job output can be found on the UI machine. The temporary directory is composed by two directories: The first one can be specified in the UI configuration file, or in the dg-job-get-output command if it is followed by the –dir option. The second one represents a unique number of the dg_jobId identifier.
34
Job Submission Example
EliJDL]$ dg-job-submit HelloWorld.jdl Connecting to host lxshare0381.cern.ch, port 7771 Logging to host lxshare0381.cern.ch, port 15830 ************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Resource Broker. Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is: - JobId
35
Job Submission Example Cont’d
EliJDL]$ dg-job-status \ Retrieving Information from LB server Please wait: this operation could take some seconds. BOOKKEEPING INFORMATION: Printing status info for the Job : dg_JobId = Status = OutputReady Last Update Time (UTC) = Wed Aug 21 12:19: Job Destination = testbed008.cnaf.infn.it:2119/jobmanager-pbs-short Status Reason = terminated Job Owner = /C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Mario Status Enter Time (UTC) = Wed Aug 21 12:19:
36
Job Submission Example Cont’d
EliJDL]$ dg-job-get-output --dir result ************************************************************************** JOB GET OUTPUT OUTCOME Output sandbox files for the job: - have been successfully retrieved and stored in the directory: /shift/lxshare072d/data01/UIhome/reale/EliJDL/result/ EliJDL]$ more result/ /Message.txt Hello World EliJDL]$ more result/ /stderr.log result directory name is specified in the option –dir. represents the unique number of the dg_jobId.
37
Common Error Messages 1/2
The UI commands accept some arguments in input. If the user makes a mistake via command line, the following messages can appear: Argument * is not allowed (the argument is not known) Argument * must be specified at the end of the command (both the jobId and JDL file name must be put at the end of the command line) Argument * is missing for the “—output” option (the user forgot to add the parameter, required by the argument) Argument “-all” cannot be specified with argument “—input” (some arguments are OR-exclusive) CEId format is: <full hostname>;<port number>/jobmanager-<service>. The provided CEID: “ has a wrong format. (the user has mis-spelled the CE identifier after –resource) During the calling of the RB API, the following can happen: Resource Broker “grid013g.cnaf.infn.it:7771” not available (can’t open a connection with the RB specified in the UI configuration file) Unable to get LB address from RB “grid013g.cnaf.infn.it” (the function get_lb_contact returned an error)
38
Common Error Messages 2/2
While the UI commands are checking the JDL file, the following errors may occur: Mandatory Attribute default error in the configuration file “/opt/edg/etc/UI_ConfigENV.cfg” (there aren’t any default values) Mandatory Attribute missing in JDL file “Executable” (Executable is one of the mandatory attributes) Multiple “InputSandbox” attribute found in JDL file (InputSandbox attribute is repeated twice) Wrong function call for list attribute *. Function usage is: “Member/IsMember(List, Value)” (e.g. in the requirements attribute the function Member/IsMember is used with a wrong syntax) Proxy (this refers to the security grid proxy and not to a proxy machine) If the user specifies a duration for the proxy that he wants to provide, using the option –h of dg-job-submit, a possible message is Proxy certificate will expire in less then X hours. Creating a new X-hours-duration certificate (this to make sure that at least the required proxy validity is granted )
39
WMS Proxy Renewal Why? To avoid job failure because it outlived the validity of the initial proxy WMS support automatic proxy renewal mechanism as long as the user credentials are handled by a proxy server. Short term proxies can then be used to start jobs using grid-proxy-init –hours <hours> command Register this proxy with the MyProxy server using myproxy-init –s <server> [-t <cred> -c <proxy>] server is the server address (e.g. lxshare0375.cern.ch) cred is the number of hours the proxy should be valid on the server proxy is the number of hours renewed proxies should be valid MyProxyServer specified in the JDL file The Proxy is automatically renewed by WMS without user intervention for all the job life
40
Further Information The EDG User’s Guide EDG WP1 Web site ClassAd
EDG WP1 Web site In particular WMS User & Admin Guide and JDL docs ClassAd
42
dg-job-submit myjob.jdl
Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.