LCG and Glite open issues Massimo Sgaravatto INFN Padova

Slides:



Advertisements
Similar presentations
Workload Management David Colling Imperial College London.
Advertisements

EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Elisabetta Ronchieri - How To Use The UI command line - 10/29/01 - n° 1 How To Use The UI command line Elisabetta Ronchieri by WP1 elisabetta.ronchieri.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
F.Pacini - Milan - 8 May, n° 1 Results of Meeting on Workload Manager Components Interaction DataGrid WP1 F. Pacini
M. Sgaravatto – n° 1 Overview of WP1 Workload Management System in EDG 2.x Massimo Sgaravatto INFN Padova - DataGrid WP1
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM and ICE Massimo Sgaravatto – INFN Padova.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
Enabling Grids for E-sciencE Workload Management System on gLite middleware - commands Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi.
EGEE is a project funded by the European Union under contract IST WS-Based Advance Reservation and Co-allocation Architecture Proposal T.Ferrari,
Workload Management System Jason Shih WLCG T2 Asia Workshop Dec 2, 2006: TIFR.
INFSO-RI Enabling Grids for E-sciencE GILDA Praticals Giuseppe La Rocca INFN – Catania gLite Tutorial at the EGEE User Forum CERN.
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
User Interface UI TP: UI User Interface installation & configuration.
EGEE is a project funded by the European Union under contract IST LCG open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting,
RI EGI-TF 2010, Tutorial Managing an EGEE/EGI Virtual Organisation (VO) with EDGES bridged Desktop Resources Tutorial Robert Lovas, MTA SZTAKI.
EGEE is a project funded by the European Union under contract IST Catania Site Report 1 half Marco Pappalardo INFN Catania JRA1 ITCZ Cluster.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
EGEE is a project funded by the European Union under contract IST Datamat Status Report F. Pacini Datamat S.p.a. Milan, IT-CZ JRA1 meeting,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
CE design report Luigi Zangrando
EGEE is a project funded by the European Union under contract IST Padova report Massimo Sgaravatto On behalf of the INFN Padova JRA1 Group.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
Resource access in the EGEE project Massimo Sgaravatto INFN Padova
Architecture Review 10/11/2004
Gri2Win: Porting gLite to run under Windows XP Platform
British Library Document Supply Service (BLDSS) API
CEMon
DGAS A.Guarise April 19th, Athens
JRA1 IT-CZ cluster meeting Milano, May 3-4, 2004
OGF PGI – EDGI Security Use Case and Requirements
CREAM and ICE Test Results
CE-Monitor Luigi Zangrando INFN-Padova
How to connect your DG to EDGeS? Zoltán Farkas, MTA SZTAKI
Security aspects of the CREAM-CE
WP1 WMS release 2: status and open issues
Workload Management System ( WMS )
Preview Testbed Massimo Sgaravatto – INFN Padova
Technical Board Meeting, CNAF, 14 Feb. 2004
Accounting at the T1/T2 Sites of the Italian Grid
Chapter 2: System Structures
EGEE tutorial, Job Description Language - more control over your Job Assaf Gottlieb Tel-Aviv University EGEE is a project.
Grid2Win: Porting of gLite middleware to Windows XP platform
Introduction to Grid Technology
Grid2Win: Porting of gLite middleware to Windows XP platform
Workload Management System
CRC exercises Not happy with the way the document for testbed architecture is progressing More a collection of contributions from the mware groups rather.
Short update on the latest gLite status
TCG Discussion on CE Strategy & SL4 Move
Gri2Win: Porting gLite to run under Windows XP Platform
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
EGEE Middleware: gLite Information Systems (IS)
WMS+LB Server Installation and Configuration
LCG and Glite open issues Massimo Sgaravatto INFN Padova
Presentation transcript:

LCG and Glite open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting, December 14-15, 2004 LCG and Glite open issues Massimo Sgaravatto INFN Padova www.eu-egee.org EGEE is a project funded by the European Union under contract IST-2003-508833

How to manage LCG and GLITE bugs Different ways to “manage” LCG and GLITE bugs See next slides LCG and GLITE both use Savannah Easy to get confused Check if the bug web page title starts with “LCG” or “JRA1 middleware” to distinguish among them <event>, <date> - 2

How to deal with LCG bugs I (and usually also Pacio) receive LCG bug notifications Then I CC (via Savannah) the relevant person(s) Relevant persons are supposed: To attach the fix/provide your feedback to the already attached patch implemented by LCG To commit the same patch to our CVS(s), if applicable Don’t change bug status in Savannah <event>, <date> - 3

How to deal with GLITE bugs wp1-help@infn.it receives GLITE bug notifications for CE, WMS, Accounting, LB As far as I understand security bugs assigned to JRA3; then J. Hahkala assigns the VOMS server ones to Valerio/Vincenzo Now *all* bugs also notified to the Iteam ML This is going to change: instead all bugs will be notified to: project-eu-egee-glite-bugs@cern.ch See: http://listboxservices.web.cern.ch/listboxservices/Help/?kbid=010012 You are supposed to change status in Savannah for “your” bugs None  Accepted  In progress In progress  Ready for integration (when fixed in CVS) Don’t close the bug (this is up to the testing team or to the person who opened the bug) ! Let D. Smith know about the problem (and the proper fix) if it is applicable also for LCG <event>, <date> - 4

LCG problems hopefully already addressed The bugs below are still open in the LCG Savannah, but they have already been addressed Patches provided (by us, or by LCG) Still open because patches under test/still to be tested #2716, #3252, #3546, #3807, #3848, #3883, #3884, #3895, #3896, #3900, #3916, #4009, #4047, #4070, #4098, #4109, #4126, #4127, #4144, #4378, #4836, #4891, #4909, #5237, #5238, #5244, #5261, #5269, #5274, #5427, #5471, #5488, #5575 <event>, <date> - 5

LCG issues not addressed yet #3302: On a RB+SE node there is a GridFTP problem Asked for clarifications to LCG: no answer Not considered a high priority problem #3671: To drain an RB They would like to make possible to disallow new submissions, while allowing the other commands Not addressed yet: only suggested, as trick, to set MaxInputSandboxSize=0 Doesn’t work for jobs without ISB #3724: LogMonitor should be resilient to full file system Still to be understood why irepository.dat could not be recovered Actually not investigated further #3808: NetworkServer must log from which UI the job was submitted A patch was provided, but it logs the UI address and the user DN in *separate* messages (and it is not possible to unambiguously connect them) Asked if instead they could use the LB info instead: no answer <event>, <date> - 6

LCG issues not addressed yet #3871: edg-wl-bkserverd: Terminating after 500 connections 'event_store_recover’ likely a inter-thread locking bug, which must be investigated #4319: Suggestion for change of policy for resubmitted jobs Basically they (D. Smith) think that if the job doesn’t even start its execution on a WN, this should not be counted as (re)submission Fix applied by David Smith under test: Logging of “running” by LRMS as priority event, and return code checked if the logging doesn't appear successful, the job script returns an indication of the error in the output/maradona and exits without starting the job No events logged by the LRMS  the job didn’t start  shallow resubmission can be performed The maximum number of these new type of resubmissions per job is a broker side configuration option The new resubmissions won't be done if doing so would send the job back to a previously tried destination <event>, <date> - 7

LCG issues not addressed yet #4894: NS can become unresponsive during dialogue with client Marco agreed with D. Smith to review that part of code #4570: Multiple cancel requests can crash WM (and possibly PR) Addressed for PR For WM already discussed (it would require major modifications) #5347: FD limit for LM D. Smith changed the system hard limit on file descriptors for the LM (to 16384) because of the big number of condorG logfiles (and associated state files) This was not sufficient; at some points in the code (eg. in dgssl.c) select()s are done on fd sets which of type 'fd_set'. These are only large enough for 1024 descriptors #5351: WMS uninitialised variable Noticed possible use of unintialised variable in JobControllerReal::cancel -but there is no indication that it was causing problems. Waiting for further information <event>, <date> - 8

LCG issues not addressed yet #5404: JC/LM id repository Inconsistency between the JC (memory resident) id repository and the LM (disk resident) version To be investigated #5442: Setting output path for LCG GUI Job Monitor Actually the problem was that the user didn’t read the doc  The only problem that needs to be fixed is that the GUI always try to use the home directory for the retrieval of the OSB (it doesn’t remember the previous choice) #5549: NS cannot handle being addressed through RB host alias Selected Virtual Organisation name (from --config-vo option): dteam **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server In the NS log file: 10 Nov, 23:03:29 -F- "Manager::run": Manager: Failed to acquire credentials... 10 Nov, 23:05:01 -F- "Manager::run": Manager: Failed to acquire credentials... 10 Nov, 23:05:30 -F- "Manager::run": Manager: Failed to acquire credentials... <event>, <date> - 9

Issues addressed by LCG that we didn’t integrate yet #3931: Suggest a local proxy expiration check for WMS jobs Proxy expiry check in the jobwrapper #4318: Matchmaking policy for resubmitted jobs Remove previously matched sites in resubmission Now we remove only previously matched CEs #4365: WL libraries/daemons must retry BDII queries When the first query fails, it sleeps 5 seconds and retries; when the second attempt fails, it sleeps another 5 seconds and tries a third, final time #4892: NS can (partially) crash with ‘unable to receive’ uncaught exception #5109: WMS daemon memory leaks  Memory leaks in JC, ldif2classad, LM, LB, NS Fixes integrated only for JC and LM (as far as I know) <event>, <date> - 10

GLITE problems hopefully already addressed The bugs below are still open in the Glite Savannah, but they have already been addressed Still open because patches under test/still to be tested #4588, #4630, #4631, #4893, #5071, #5089, #5094, #5115, #5202, #5248, #5325, #5361, #5406, #5792, #5832, #5869, #5903, #5904, #5926, #5932, #5934, #5977 <event>, <date> - 11

GLITE issues not addressed yet #5029: On /opt/glite/libexec/voms/voms_install_db strange error message for typo in parameter #5125: glite-lb-bkserverd start/stop/status displays usage options Still waiting for clarifications from the user who submitted this bug #5378: voms-proxy-info crashes voms-proxy-info taken from UI in AFS (Datamat) I don’t see anything in Savannah for this bug #5383: wms client commands crash when used with a VOMS proxy Doesn’t appear anymore … Status to be changed into “Ready for integration” ? #5278: lack of logging information for the workload_manager daemon Discussed between Mario and FrancescoG #5494: Can't generate voms-proxies “Can't interpret old format!” error message <event>, <date> - 12

GLITE issues not addressed yet #5582: Unable to get voms proxy info from a voms proxy Submitted by PeppeGrid #5802: hardcoded GLITE_LOCATION in voms-proxy-init #5804: unecessary C++ statement std::flush after std::endl VOMS related bug #5833: all jobs in SUBMITTED after a job storm  SUBMITTED status for approx. 3 hours because most LB events did not arrive to the bookkeeping server in timely fashion Being investigated by CESNET #5938: Error using VOMS_Retrieve from voms C api <event>, <date> - 13

GLITE issues not addressed yet #5965: noncorrect chown in the glite-wms-parse-configuration.sh file The scripts does: chown $GLITE_WMS_USER.$GLITE_WMS_USER $location This doesn’t work if the GLITE_WMS_USER belongs to the group with other name and the GLITE_WMS_USER group does not exist Should we also define a GLITE_WMS_GROUP ? <event>, <date> - 14