BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Expanding scalability of LCG CE A.Kiryanov, PNPI.
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
DIANE Overview Germán Carrera, Alfredo Solano (CNB/CSIC) EMBRACE COURSE Monday 19th of February to Friday 23th. CNB-CSIC Madrid.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
GRID The GRID distribution toolkit at INFN Flavia Donno (INFN Pisa) Andrea Sciaba` (INFN Pisa) Zhen Xie (INFN Pisa) presented by Massimo Sgaravatto (INFN.
Annual Renewal in ACAP © Created by Isaac P. E. Mailleue Research Integrity Coordinator University of North Carolina at Greensboro Updated March 2014.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.
Grid and Cloud Computing Dr. Guy Tel-Zur. Today’s agenda UNICORE (see a separate presentation) AWS + Python (Boto) – ideas for projects… Hadoop (see a.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Nadia LAJILI User Interface User Interface 4 Février 2002.
Grid job submission using HTCondor Andrew Lahiff.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
Some Title from the Headrer and Footer, 19 April Overview Requirements Current Design Work in Progress.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
DataGRID WPMM, Geneve, 17th June 2002 Testbed Software Test Group work status for 1.2 release Andrea Formica on behalf of Test Group.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Guide to Linux Installation and Administration, 2e1 Chapter 11 Using Advanced Administration Techniques.
Test Specifications A Specification System for Multi-Platform Test Suite Configuration, Build, and Execution Greg Cooksey.
1 Grid2Win: porting of gLite middleware to Windows Dario Russo INFN Catania
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 – The Ganga Evolution Andrew Maier.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid2Win : gLite for Microsoft Windows Roberto.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
A. Gheata, ALICE offline week March 09 Status of the analysis framework.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Accounting in DataGrid HLR software demo Andrea Guarise Milano, September 11, 2001.
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.
Tests at Saclay D. Calvet, A. Formica, Z. Georgette, I. Mandjavidze, P. Micout DAPNIA/SEDI, CEA Saclay Gif-sur-Yvette Cedex.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.
Consorzio COMETA - Progetto PI2S2 UNIONE EUROPEA Grid2Win : gLite for Microsoft Windows Elisa Ingrà - INFN.
Five todos when moving an application to distributed HTC.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM: current status and next steps EGEE-JRA1.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
EGEE is a project funded by the European Union under contract IST Padova report Massimo Sgaravatto On behalf of the INFN Padova JRA1 Group.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
Regression Testing with its types
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Update on gLite WMS tests
GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
Presentation transcript:

BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez

BQS Integration to gLite CE2 Context We have been running a BQS-backed computing element since the early days of Datagrid –BQS Information Provider Maps BQS information data to Glue Schema (ldiff) –bqs-jobmanager Maps Globus commands to BQS commands Maps job queues to “BQS classes”, requests AFS tokens for jobs needing them, archives job information, logs job information for accounting purposes, creates the BQS job wrapper, caches job status information… Currently trying to integrate BQS to gLite-CE –STEP 1: develop a “BLAH-to-Globus jobmanager” adapter So that we can reuse the bqs-jobmanager currently in production with LCG-CE –STEP 2: develop a grid-neutral front-end to BQS and use it with several CE (e.g. gLite-CE, CREAM, GT4 WS-GRAM) We are here

BQS Integration to gLite CE3 BQS integration in LCG-CE Gatekeeper BQS job-manager BDII Local batch system CE Submit job Provided CC-IN2P3 To be done UIRB BQS Information Provider BQS

BQS Integration to gLite CE4 BQS job-manager BQS integration in gLite-CE (STEP 1) BQS GatekeeperBDII Condor-CBlahpd Launch Condor-C Local batch system CE Submit job fork job-manager BLAH to Globus Provided CC-IN2P3 To be done BQS Information Provider UIWMS

BQS Integration to gLite CE5 Purpose of this presentation Provide feedback about the difficulties to integrate a new LRMS to gLite-CE –These difficulties are not specific to BQS –No impossibility to do it –…but can not do it efficiently !

BQS Integration to gLite CE6 Overview Difficulties –gLite-CE installation –Plug-in development –Plug-in testing BQS integration in CREAM Discussion

BQS Integration to gLite CE7 gLite-CE installation On a standard Scientific Linux –gLite and 3.0.1: solution to most bugs were found on mailing-lists archives –gLite update 6: almost no more bugs for installation On our site-customized Scientific Linux –Customization related to different releases of language interpreters (perl, python) modified environment variables –Sensible to modifications on the execution environment About 2/3 of problems found were specific to this customization –Such kind of problems were not observed with other software packages (e.g. GT4) –Some problems were hard to resolve (e.g. Globus fork-jobmanager script modified to set a specific and non-trivial order of directories in $PATH) It seems to work now (with PBS), but there may be some remaining problems with untested features –Not yet re-tested with gLite update 6

BQS Integration to gLite CE8 Plug-in development BLAH expects 5 commands for interacting with the underlying LRMS –One per action (submit, status, cancel, hold, resume) –In the case of PBS and LSF, these commands are implemented as Shell scripts Lack of complete documentation is not a big issue –Provided plug-ins for PBS and LSF are a good starting point –Following the job lifecycle through testing is also instructive for understanding the system But testing is the hard part (more on next slides)

BQS Integration to gLite CE9 Plug-in testing (1/4) CAN NOT TEST EFFICIENTLY BECAUSE… Can not test CE in standalone mode (without WMS) –This adds complexity and lot of opportunities for job failures –We had to deploy a WMS locally WMS deployed on PPS were not stable enough (before summer) Needed to understand where and why jobs fail Each job submission test takes too long time to complete –Around 4’30” to execute a “hello-world” job on not loaded machines connected to the same LAN –15’ for an abnormally ended job => No test can be done in less than 5 minutes !

BQS Integration to gLite CE10 Plug-in testing (2/4) Some services sometimes fail to start, start in a bad way or stop working (WMS, CE) –(NOT security related problems: time synchronization, CRL & gridmap file updates) –Occur after a configuration change or a simple service restart => restart the relevant services several times in different order –Sometimes unable to get back to a working configuration (even by resetting original values) => reinstalling is the fastest solution We haven’t been able to deactivate automatic retry of jobs –(setting RetryCount/ShallowRetryCount to 0 in JDL does not do it) –Lifecycle of failed jobs is longer to complete –Previous failed jobs continue to pollute the CE log files

BQS Integration to gLite CE11 Plug-in testing (3/4) Job cancellation often does not work –The glite-job-cancel command always returns “ request has been successfully submitted ”, but has often no effect on the job –Don’t know how to get WMS & CE back to a “clean” state First submitted job almost always fails –Not systematic anymore with latest release, but still very often –We often face this situation because the development phase implies frequent configuration changes, and this often requires restarting the gLite services

BQS Integration to gLite CE12 Plug-in testing (4/4) Hard to find the cause of failures –Many silent failures or useless messages "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE". –Command “ glite-job-logging-info -v 2 ” does not often help to understand why the job has been retried for 900 seconds –Need to follow the job life by looking at the log files, but they are dispersed, and some are ephemeral (they disappear too quickly) Several log files per component: Globus gatekeeper, Globus job- manager, Condor-C (ephemeral logs), BLAH (ephemeral logs), GridManager, … Several directories contain logs: /var/log, $HOME, /tmp, … –No error detection when the LRMS-specific BLAH scripts return unexpected output

BQS Integration to gLite CE13 BQS integration in CREAM Currently exploring the integration of BQS to CREAM –Have just started installing CREAM with PBS (27/10/2006) CREAM installation (ongoing) –Not yet automated, but not sensible to modification on the execution environment Plug-in development (not started yet) –STEP 1:Implementing a “BLAH Log Parser” is required => reusing code developed for LCG-CE may require modifications –STEP 2:Develop a CREAM connector for BQS Plug-in testing (not started yet) –Seems to have none of previously mentioned difficulties Thanks to Massimo Sgaravatto for providing early access to CREAM for gLite 3.1

BQS Integration to gLite CE14 BQS integration in CREAM (STEP 1) BQS job-manager CREAMCEMon Blahpd Local batch system CE BLAH connector BLAH to Globus Provided CC-IN2P3 To be done ICE BQS BLAH Log Parser ??? Submit job BQS Information Provider BQS

BQS Integration to gLite CE15 BQS integration in CREAM (STEP 2) CREAMCEMon Local batch system CE BLAH connectorBQS connector Provided CC-IN2P3 To be done ICE Submit job BQS Information Provider BQS grid-neutral front-end BQS

BQS Integration to gLite CE16 References gLite – BLAH – CREAM –

BQS Integration to gLite CE17 Discussion Are there tips to work more efficiently with WMS and gLite-CE components ? –How to configure WMS/gLite-CE to reduce time to complete ? –How to deactivate automatic retry of jobs ? What is the recommended way to proceed ? –Will the next releases of gLite-CE provide some answers to the problems reported in this talk? –Should we instead concentrate on working on the BQS integration to CREAM? (our preferred way) Will WMS support CREAM before the support for LCG-CE will be dropped? –As a site, will we have to support both gLite-CE and CREAM ? Is there any plan to drop support for LCG-CE in the near future ?