JRA1 IT-CZ cluster meeting Milano, May 3-4, 2004 Padova site report JRA1 IT-CZ cluster meeting Milano, May 3-4, 2004
People Paolo Andreetto Stefano Borgia (part-time) Alvise Dorigo (since mid of May 2004) Alessio Gianelle Matteo Mordacchini (PhD student) Massimo Sgaravatto Luigi Zangrando Mister (hopefully Miss) X (on-going recruitment)
On-going activities Getting to grips with EDG WMS code (in particular JC & LM) Paolo Rel. 3 (DAGMan) testing activities Alessio Support of existing LCG-2 code, if/when needed Alessio (shortly also Paolo) Working on the resource access problem (CE) Luigi, Massimo, Matteo, Stefano
Other activities for the next future “Porting” the existing Job Submission components to EGEE CVS according to the new SCM JS reengineering Improvements on submission to LCG-2 CE Some of the improvements already identified at the last meeting Improvements on MPI support See if the outbound connectivity requirement with the LCG-2 CE can be removed … Move to submission to EGEE CE Guardian Angels of NS, WM, I-S reengineering
Activities concerning the CE problem Analysis of existing technologies and tools Web service & WSRF specifications DRMAA GGF proposal as common API to different LRMS Globus GRAM Globus 3.2 GRAM doc and code analyzed Globus team already contacted to know plans and timelines (in order to plan evaluation) of new GRAM, which will be in Globus v. 4 Already got a document explaining architecture, changes wrt previous GRAM, etc. Planned phone conference to discuss these items Alien Available (very limited) documentation studied Some studying of the code Plan to install and evaluate it
Activities concerning the CE Specifying requirements and expected functionality for the planned CE Specifying APIs Designing the architecture for the planned CE All these ideas collected in a work-in-progress document /workload/papers/CE @ infnforge CVS http://www.pd.infn.it/grid/jra1/CE/ce.ps This could be our contribution to the EGEE mw architecture document (CE section) Preliminary ideas, but we are ready to get feedbacks
Req.s & expected functionality Environment In EDG a CE was a LRMS queue, which had to encompass only homogeneous WNs In particular sysadmins don’t like it For EGEE we propose a CE being a site cluster, managed by a LRMS, encompassing heterogeneous resources, where multiple LRMS queues (which usually define policies on resource usage) can exist
Req.s & expected functionality Interface with the LRMS Very well specified interface with the underlying LRMS Interface with a specific LRMS implemented as pluggable module, easily replace with another one supporting another LRMS We plan to implement the interface with PBS, LSF, Condor (?) Make possible and easy to implement the interface with other LRMSs
Req.s & expected functionality Network connectivity SA1 doesn’t want neither inbound nor outbound connectivity on WNs In the ARDA middleware document the Site Proxy service planned to route messages from/to WNs Who is responsible to design and implement such service ??
Req.s & expected functionality Main functionality Job management Available to end-users and other Grid services (e.g. the “RB”) As a Web Service Push and pool model Architecture for pull model must be discussed and agreed in wider context (not a problem restricted to the CE) Some of the issues to be clarified When a CE should notify that is willing to receive jobs ? It could be “available” only for some kind of jobs, with some specific requirements, belonging to some specific users Who should be notified ? …
Req.s & expected functionality Job management operation Submit jobs Evaluate job execution Are there matching resources for this JDL ? If so, what is the expected quality of service (e.g. the Estimated Traversal Time) ? Remove jobs Suspend/resume jobs Get job status Get job outputs Get notifications E.g. when job changes status, when job reaches a certain status, etc.
Req.s & expected functionality Job types Sequential, batch jobs (as in EDG) Parallel (MPI) jobs (as in EDG) Checkpointable jobs (as in EDG) Interactive jobs (as in EDG) DAG jobs (as in EDG) DAG whose nodes have to be planned and executed within the CE Partitionable jobs (as in EDG) ? Jobs to be partitioned within the CE
Req.s & expected functionality Other functionality Provision of CE characteristics and status E.g. how many and which resources are there in the CE ? How many active jobs are there ? … To be decided which information and which interface to be used APIs and/or information published to an Information Service Grid accounting sensors To report on job resource usage To be integrated with the EGEE (DGAS ?) accounting system …
Req.s & expected functionality Security (Authentication & Authorization) Not too clear what JRA3 is going to provide Recommendations ? Software ? …
Req.s & expected functionality Need to talk with other “Grid systems” Which one ? GRAM, Condor-G, … At which level should these interfaces be implemented ? Should these Grid systems be considered as LRMS ? We have been suggested to consider the interface at an higher level EGEE CE able to “understand” GRAM SOAP messages, Condor-G SOAP messages, etc. and able to speak these protocols Need to understand if this is feasible (not only from a technical point of view)
CE Architecture CE JC JM WNs WEB WEB LSF PBS ? Client jobAssess A client could: 1) ask the CE whether a job could be executed and what is the expected QoS (e.g. ETT) 2) submit a job 3) query the CE to get its characteristics and status (and/or this info should be published to an IS ?) Client JDL jobAssess jobSubmit The CE matches the job req. against the resources available and computes the expected QoS QoS WEB WEB CE JC UC JM getWN insertWN deleteWN updateWN getUC createUC deleteUC updateUC DRMAA?? getUC updateUC WN UC LSF PBS ? WNs
CE Architecture CE JC JM WNs WEB WEB LSF PBS ? Client jobKill A client could: 1) ask the CE whether a job could be executed and what is the expected QoS (e.g. ETT) 2) submit a job 3) query the CE to get its characteristics and status (and/or this info should be published to an IS ?) Client JDL jobKill jobSuspend jobResume jobGetStatus jobGetOutput jobSignal jobMonitorSub jobAssess jobSubmit notify The CE checks if the client has already an UserContext. Create/Update the UC JC URL WEB Job status WEB CE submit UC JC JM JDL job getWN insertWN deleteWN updateWN getUC createUC deleteUC updateUC DRMAA?? getUC updateUC WN UC LSF PBS ? WNs
API specification jobAssess jobSubmit jobSuspend / jobResume jobList jobKill jobGetStatus / jobGetAllStatus jobGetOutput jobMonitorSub jobSignal
API specification jobAssess jobSubmit Description: Checks whether the job specified in the JDL could be run in the CE. It matches the job requirements against the available resources. If the job is effectively runnable on the worker nodes of the CE, it provides an estimation of the exptected QoS (e.g. waiting time in the local queue before the job can be runned). jobSubmit Description: Submit the job specified in the JDL to the CE.
API specification jobSuspend jobResume jobKill jobList Description: Allows to suspend the execution of the specified job(s) or to hold the job(s) in the local queue. jobResume Description: Allows to resume the execution of the specified job(s) or to release the job(s) in the local queue. jobKill Description: Allows to kill one or more jobs. jobList Description: Retrieves the list of the jobIDs submitted by the user.
API specification jobGetOutput jobGetStatus jobSignal jobMonitorSub Description: Allows the user to retrieve the final results of the execution of the specified job(s). jobGetStatus Description: Retrieves the status of the specified job(s). jobSignal Description: sends a signal to the specified job(s). jobMonitorSub Description: Allows the user to subscribe to the asyncronous notification system (JM) of the CE (e.g. To be notified about job status chenges)