BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez
BQS Integration to gLite CE2 Context We have been running a BQS-backed computing element since the early days of Datagrid –BQS Information Provider Maps BQS information data to Glue Schema (ldiff) –bqs-jobmanager Maps Globus commands to BQS commands Maps job queues to “BQS classes”, requests AFS tokens for jobs needing them, archives job information, logs job information for accounting purposes, creates the BQS job wrapper, caches job status information… Currently trying to integrate BQS to gLite-CE –STEP 1: develop a “BLAH-to-Globus jobmanager” adapter So that we can reuse the bqs-jobmanager currently in production with LCG-CE –STEP 2: develop a grid-neutral front-end to BQS and use it with several CE (e.g. gLite-CE, CREAM, GT4 WS-GRAM) We are here
BQS Integration to gLite CE3 BQS integration in LCG-CE Gatekeeper BQS job-manager BDII Local batch system CE Submit job Provided CC-IN2P3 To be done UIRB BQS Information Provider BQS
BQS Integration to gLite CE4 BQS job-manager BQS integration in gLite-CE (STEP 1) BQS GatekeeperBDII Condor-CBlahpd Launch Condor-C Local batch system CE Submit job fork job-manager BLAH to Globus Provided CC-IN2P3 To be done BQS Information Provider UIWMS
BQS Integration to gLite CE5 Purpose of this presentation Provide feedback about the difficulties to integrate a new LRMS to gLite-CE –These difficulties are not specific to BQS –No impossibility to do it –…but can not do it efficiently !
BQS Integration to gLite CE6 Overview Difficulties –gLite-CE installation –Plug-in development –Plug-in testing BQS integration in CREAM Discussion
BQS Integration to gLite CE7 gLite-CE installation On a standard Scientific Linux –gLite and 3.0.1: solution to most bugs were found on mailing-lists archives –gLite update 6: almost no more bugs for installation On our site-customized Scientific Linux –Customization related to different releases of language interpreters (perl, python) modified environment variables –Sensible to modifications on the execution environment About 2/3 of problems found were specific to this customization –Such kind of problems were not observed with other software packages (e.g. GT4) –Some problems were hard to resolve (e.g. Globus fork-jobmanager script modified to set a specific and non-trivial order of directories in $PATH) It seems to work now (with PBS), but there may be some remaining problems with untested features –Not yet re-tested with gLite update 6
BQS Integration to gLite CE8 Plug-in development BLAH expects 5 commands for interacting with the underlying LRMS –One per action (submit, status, cancel, hold, resume) –In the case of PBS and LSF, these commands are implemented as Shell scripts Lack of complete documentation is not a big issue –Provided plug-ins for PBS and LSF are a good starting point –Following the job lifecycle through testing is also instructive for understanding the system But testing is the hard part (more on next slides)
BQS Integration to gLite CE9 Plug-in testing (1/4) CAN NOT TEST EFFICIENTLY BECAUSE… Can not test CE in standalone mode (without WMS) –This adds complexity and lot of opportunities for job failures –We had to deploy a WMS locally WMS deployed on PPS were not stable enough (before summer) Needed to understand where and why jobs fail Each job submission test takes too long time to complete –Around 4’30” to execute a “hello-world” job on not loaded machines connected to the same LAN –15’ for an abnormally ended job => No test can be done in less than 5 minutes !
BQS Integration to gLite CE10 Plug-in testing (2/4) Some services sometimes fail to start, start in a bad way or stop working (WMS, CE) –(NOT security related problems: time synchronization, CRL & gridmap file updates) –Occur after a configuration change or a simple service restart => restart the relevant services several times in different order –Sometimes unable to get back to a working configuration (even by resetting original values) => reinstalling is the fastest solution We haven’t been able to deactivate automatic retry of jobs –(setting RetryCount/ShallowRetryCount to 0 in JDL does not do it) –Lifecycle of failed jobs is longer to complete –Previous failed jobs continue to pollute the CE log files
BQS Integration to gLite CE11 Plug-in testing (3/4) Job cancellation often does not work –The glite-job-cancel command always returns “ request has been successfully submitted ”, but has often no effect on the job –Don’t know how to get WMS & CE back to a “clean” state First submitted job almost always fails –Not systematic anymore with latest release, but still very often –We often face this situation because the development phase implies frequent configuration changes, and this often requires restarting the gLite services
BQS Integration to gLite CE12 Plug-in testing (4/4) Hard to find the cause of failures –Many silent failures or useless messages "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE". –Command “ glite-job-logging-info -v 2 ” does not often help to understand why the job has been retried for 900 seconds –Need to follow the job life by looking at the log files, but they are dispersed, and some are ephemeral (they disappear too quickly) Several log files per component: Globus gatekeeper, Globus job- manager, Condor-C (ephemeral logs), BLAH (ephemeral logs), GridManager, … Several directories contain logs: /var/log, $HOME, /tmp, … –No error detection when the LRMS-specific BLAH scripts return unexpected output
BQS Integration to gLite CE13 BQS integration in CREAM Currently exploring the integration of BQS to CREAM –Have just started installing CREAM with PBS (27/10/2006) CREAM installation (ongoing) –Not yet automated, but not sensible to modification on the execution environment Plug-in development (not started yet) –STEP 1:Implementing a “BLAH Log Parser” is required => reusing code developed for LCG-CE may require modifications –STEP 2:Develop a CREAM connector for BQS Plug-in testing (not started yet) –Seems to have none of previously mentioned difficulties Thanks to Massimo Sgaravatto for providing early access to CREAM for gLite 3.1
BQS Integration to gLite CE14 BQS integration in CREAM (STEP 1) BQS job-manager CREAMCEMon Blahpd Local batch system CE BLAH connector BLAH to Globus Provided CC-IN2P3 To be done ICE BQS BLAH Log Parser ??? Submit job BQS Information Provider BQS
BQS Integration to gLite CE15 BQS integration in CREAM (STEP 2) CREAMCEMon Local batch system CE BLAH connectorBQS connector Provided CC-IN2P3 To be done ICE Submit job BQS Information Provider BQS grid-neutral front-end BQS
BQS Integration to gLite CE16 References gLite – BLAH – CREAM –
BQS Integration to gLite CE17 Discussion Are there tips to work more efficiently with WMS and gLite-CE components ? –How to configure WMS/gLite-CE to reduce time to complete ? –How to deactivate automatic retry of jobs ? What is the recommended way to proceed ? –Will the next releases of gLite-CE provide some answers to the problems reported in this talk? –Should we instead concentrate on working on the BQS integration to CREAM? (our preferred way) Will WMS support CREAM before the support for LCG-CE will be dropped? –As a site, will we have to support both gLite-CE and CREAM ? Is there any plan to drop support for LCG-CE in the near future ?