EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM Massimo Sgaravatto – INFN Padova On behalf of the CREAM cluster of competence
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Current status CREAM CE released for production in EGEE in Oct 2008 Since that, regular updates with bug fixes and improvements As of May 4: 22 CREAM CEs (~ 200 CEIds) published in the EGEE production BDII –Used in particular by Alice They report good results, in terms of reliability and performance Also ICE (enabling submissions to CREAM through the WMS) released (released more recently than CREAM), even if there are still some scalability issues A version of ICE (which is much better than the version in production) is in certification –The one tested in the PPS pilot testbed –But still some other scalability problems (bug #47911) Being addressed Several problems with testing –CMS is starting some tests submitting to CREAM via the WMS
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Current status: more details Some patches recently released in production –The ones tested in the CREAM PPS pilot –Patch #2748: CREAM, CEMon, BLAH (glite-CREAM) Bug fixes –Patch #2845: CREAM & CEMon client for SL4 (UI, WMS) Bug fixes & IPv6 compliance –Patch #2750: yaim-cream-ce (glite-CREAM): Bug fixes In certification –Patch #2875: UI for sl5_x86_64 Includes CREAM and CEMon client Same software tag of patch #2845 –Patch #2966: CREAM & CEmon client for VOBox Same stuff of patch #2845 –Patch #2597: WMS Includes new ICE (the one tested in the CREAM PPS pilot) As agreed, the new ICE had to be released quickly in production with the other CREAM patches, but this didn’t happen
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, LCG-CE CREAM-CE “Sites are encouraged to deploy a CREAM CE in parallel to their LCG CE” Defined criteria that must be met to start the transition from LCG-CE to CREAM – itionhttp://twiki.cern.ch/twiki/bin/view/LCG/LCGCEtoCREAMCETrans ition –Functionality and performance criteria Details of how/when/where doing (some of) these formal tests being finalized –Activity b of Phase 3 of CREAM PPS pilot –Joint SA1/SA3 effort First tests (“at least 5K simultaneous jobs per CE node”) are being started
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Submission to CREAM from CondorG One requirement to be fulfilled is the submission to CREAM via CondorG At CHEP Sanjay Padhi (CMS US) reported they have done it, but they see a high failure rate in their tests –Not reported before –Problems with proxy delegation –We are not aware of such problems Installed and made them available a CREAM CE to be used to debug such problems –Still waiting for Sanjay’s feedback
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Workplan Premise –CREAM and related software components are pretty new –Most of the time will have to be spent in support, very likely Also support for OSG, for what concerns CEMon –Not too much feedback so far We expect more for the future – The current plan (specified in Savannah) can heavily change in the future Note –We are still thinking considering the current model (the proposed one for EGEE III year II is not fully clear and will be discussed tomorrow) –E.g. as “expected date” we are considering the date in which the patch is released for certification
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Release 1.5 Task #9732 Expected date: May 2009 Patch #2666 (Fourth update of CREAM CE for slc4/i386 platform) Release notes –Several bug fixes –Porting to voms-api-java (task #7744) This also means that VOMS server certificates won’t be needed anymore in the CREAM CE node (.lsc files will be enough) voms-api-java (patch #2771) must be released in production first –First release of new BLAH parsers for LSF and PBS Use of the batch system status/history commands instead of parsing the log files Use of old/new parser decided at configuration (yaim) time –IPv6 compliance for BLAH (task #8825)
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Release 1.6 Task #9734 Expected date: July 2009 Release notes –Bug fixes –Proper management of error codes and error messages (task #9295) Task recently added as requested by the management Proper project-wide guidelines must be defined first –glexec sudo (task #9557) Replace glexec calls with sudo calls Long discussions if this acceptable from a security point of view Eventually discussed and approved by MWSG and SCG Glexec will be used only once per job submission, just to get the local user to be used in the sudo calls Eventually this local user will be given by the new AuthZ service Besides improving performance and reduce dependencies, this should facilitate the migration to new AuthZ service The same need to be done in BLAH
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Release 1.7 Task #9735 Expected date: October 2009 Release notes –Bug fixes –Move to new AuthZ Service (task #7746) Depends on: Availability of new AuthZ service (task #7718) and its “maturity” (to be verified for the glexec on WN use case) Integration of gridftpd with new AuthZ Service (Chad) Glexec sudo (task #9557) –BES and JSDL v. 1.0 support (task #7739) Since BES and JSDL are not really usable for production activities, not really active in this task Much more effort in following PGI activities (task #9290) Goal: definition of appropriate profiles needed for production use
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Release 1.8 Task #9736 Expected date: January 2010 Release notes –Bug fixes –Support for bulk job submissions (task #7740) Submissions of multiple jobs to CREAM CE via a single call Also (in particular) for submission through WMS/ICE
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Release 1.9 Task #9738 Expected date: April 2010 Release notes –Bug fixes –CEMon backend refactoring (task #7747) Problems with JNDI based backend Performance problems Difficult to maintain Already discussed and found agreement with OSG people RDBS (Mysql) for CEMon in CREAM CE Light embedded DB (e.g. Derby) for CEMon in OSG –Some support for high availability/scalability CE (task #7742) Requested in particular by CERN people To support a pool of CREAM CE machines seen as a single CREAM Preparing a proposal describing different possible options
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, IPv6 compliance Bugs opened by Mario Reale –CREAM and CEMon clients: fixed and already released in production (task #7801) –CREAM and CEMon server: no bugs opened –BLAH: fixed in CVS. Will be released with release 1.5 (patch #2666) (task #8825) As agreed we haven’t done any tests on IPv6 –To be done by SA2 –As far as I can understand support for IPv6 is still missing is several packages that we depend on (e.g. gridsite, gsoap-plugin, voms) – They can’t test too much right now (for both CREAM client and CREAM server)
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Porting to new platforms Our understanding of requirements –CREAM and CEMon client on sl5_x86_64 (task #9289) Hopefully done (patch #2875 in certification) –CREAM CE on sl5_x86_64 (task #9288) org.glite.ce.* ~ already builds (at least if you just build org.glite.ce) Not performed any tests yet also because not all needed software components needed for the CREAM-CE node build for SL5_x86_64 –CREAM and CEMon client on MacOS X (task #9293) –CREAM and CEMon client on sl5_ia32 (task #9292) –CREAM and CEMon client on deb4_x86_64 (task #9291) –WMS (and therefore ICE) on sl5_x86_64 (task #9429) Issues –Not clear by when this is required Not clear deadlines given, apart for UI on SL5_x86_64 –Not completely up to us (in the CREAM CE there isn’t only our software) –Are these all (and the only ones) platforms we’ll have to support ? Can other platforms be supported if asked by some customers ?
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Documentation Everything available in the CREAM web site Doc for users –CREAM CLI documentation –CREAM JDL documentation –CREAM C++ API documentation and tutorial –… Doc for admins –Installation and configuration guides –Description of CREAM control mechanisms –Info for troubleshooting –… Trying to keep it updated –Recently added: Forwarding of requirements to the batch system howto
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Other tasks Task #7743: Better integration between CREAM and LB (LB events logged also by CREAM) –Depends on task #7638 ([LB] Support native CREAM jobs) But its expected date is 30/04/2010 – Not by the end of EGEE-III … Better support for MPI jobs –See MPI WG activities –Still to be checked if the mechanisms to forward requirements to the batch system (via BLAH) is enough Recent requests to use CEMon for Alice dashboard –They would like CEMon notifies the dashboard about CREAM job status changes –Still discussing with the relevant persons
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Current modus operandi We are responsible for developments and maintenance of –CREAM: INFN Padova –CEMon: INFN Padova –BLAH: INFN Milano –ICE: INFN Padova –yaim-cream-ce: INFN Padova Usually software released in the form of: –Patches for CREAM and CEMon client To be installed on the UI, WMS and VOBOX nodes –Patches for BLAH, CREAM and CEMon server To be installed on the glite-CREAM node –Patches for yaim-cream-ce To be installed on the glite-CREAM node –Patches for WMS (or only the ICE component) To be installed on the WMS node
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Used procedures: precertification When it’s time to finalize a patch –After developers’ tests –Software is tagged and ETICS confs are locked A specific script (which increments the version numbers in the ini files, perform the CVS tags, create the ETICS confs) is used –RPMs (taken from the ETICS permanent repository) are installed for testing in the testbed Small testbed (testbedA) Larger testbed (testbedB) 7 CREAM CEs with INFN-Padova, 7 CREAM CEs with INFN-Padova, 7 CREAM CEs with INFN-CNAF Used in particular for testing submission via the WMS –Performed tests Functionality tests Implemented a testsuite for CREAM and CREAM-CLI Tests to check if the bugs specified in the patch are really fixed –Precertification report attached to patch –Still missing Real regression tests Performance tests (GRNET is working on that)
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures CVS –No major problems with it –Don’t like at all the CVS notification mechanisms Simba mailing lists for the existing (but not all) subsystems Not flexible You have to ask someone to create a new mailing list for the interested subsystem Doesn’t allow to be notified about commits for just a specific component or a specific directory tree oE.g. I am interested in just org.glite.yaim.cream-ce and not the whole org.glite.yaim subsystem The approach used for the IN2P3 for EDG was much better.cvsnotify files containing addresses All listed people received notifications for commits done under that directory
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures ETICS –Powerful tool but too complex –The average user has not (and doesn't want to have) a deep knowledge of the system He just wants to be able to manage his use cases –Very often very few people are able to understand the reasons of some problems/behaviors E.g. GGUS #45622 (ETICS client problem) E.g. why configuration xyz builds against a certain project config, and it doesn’t build anymore after locking ? E.g. why org.glite.ce.common-java builds if I build just the org.glite.ce subsystem, while it doesn’t build if I try to build the whole org.glite ? –If even the release manager complains about that, there is a problem …
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures ETICS –Testing of new ETICS versions should be probably improved It happened more than once that major problems were introduced with new versions –Not always very effective support E.g. GGUS #45622 Opened on Jan 27, 2009 (high priority) Solution found on March 3, 2009 (several “pings” were needed) Still open (i.e. we have to do some hacks by hand waiting for the new client, if we have to run the client on a CREAM CE machine) –Not too clear if/when some of the requested features will be provided E.g. Is there an ETICS workplan available somewhere ?
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures ETICS –The main problem was that no clear directives were given about how ETICS should be used in gLite E.g. specification of dependencies Static dependencies or properties –Clear, well documented and “bomb-proof” guidelines, receipts and tools should be given to developers and “internal” integrators to manage their use cases –Should someone checks if the configurations released for certifications are compliant with these guidelines ? Also via some automatic tool ?
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures Savannah –Not a “homogeneous” way to use it in the project Some use to track via Savannah everything This is what we use to do Basically each commit refers to a Savannah “bug” For some other components it is used only for bugs submitted by users –Procedure for closing bugs should be improved, otherwise many bugs keep staying open even if they have been fixed When a patch goes to production, the bugs should go in status “Ready for Review” Foreseen but this doesn’t always happen When a bug goes to “Ready for review”, it should be assigned to the person who submitted it Otherwise difficult to understand which bugs you are supposed to verify Not foreseen (bugs keep be assigned to “egeetest”) Even not technically possible, if that person is not part of the JRA1 MW Savannah group
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures GGUS –Saying that it is a “best effort” or a “voluntary basis” activity doesn’t make too sense –Right now there is just a “workload management” support unit with just includes the WMS developers –Single “job management” support unit or multiple support units (one for WMS, one for CREAM, one for BLAH, etc.) ? –Some problems if the procedure explained by Diana at last AH meeting E.g. interactions between GGUS and Savannah We should just put the Savannah bug number in the GGUS ticket As far as I can see GGUS is not to taking care of the rest, as it is supposed to be oE.g. filling of GGUS field in the Savannah bug oE.g. updates of Savannah bug logged in GGUS ticket
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures Current certification process –A patch is released for certification The ETICS conf. has been built and locked against glite_branch_3_1_0 project config The RPMs are available in the permanent ETICS repository –The new RPMs are installed on the relevant node types where certification tests are performed –When the patch is released for production, the glite_branch_3_1_0 is updated with the new ETICS conf. of that patch Not suitable for all scenarios –Just testing the new RPMs doesn’t always mean testing the new stuff –E.g. consider recent trustmanager and util-java patch Some used jars are the ones installed via the RPMs But some other used jars are included in the webapps wars (e.g. CREAM, CEMon, FTS), so they are consider at build time New trustmanager and util-java are really used everywhere only when the involved RPMs are deployed AND the relevant components are built against the new stuff
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures org.glite.ce subsystem –Includes CREAM, CEMon and BLAH –Includes both server (to be installed on the CREAM CE) and client (to be installed on the UI and on the WMS) Doesn’t fit well with the current software organization –Specifying a whole org.glite.ce subsystem configuration for e.g. a CREAM server patch doesn’t make too sense Which conf.s should be specified for the CREAM client components ? Willing to consider node type (metapackage) configurations instead –How it is possible to keep synchronized these metapackage configurations with the versions of the software used in production ?
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Feedback on tools and procedures Time for a patch to go in production is very long –1-3 months –Most of the spent time was not in the certification itself, but in the time waiting for the patch to start the certification and waiting for the patch to be deployed in the PPS after having been certified Wasn’t the precertification supposed to address this issue ??? Not very flexible procedure –E.g. I simply forgot to add “VOBOX” in the “affected metapackage” field of a CREAM client patch, and a new patch had to be created !!! And it has to follow the usual (long !!!) procedure At any rate should be better with the new organization
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Other feedbacks Coordination and communication should definitely improve –E.g. management of non-backward compatible changes E.g. new jobid in gLite 3.2 –E.g. porting to SL5/glite 3.2 Feel like there are different views (and priorities) by JRA1 and SA3 management –E.g. “dependency challenge” done some time ago Different opinions and different outcomes by different reviewers We heavily modified all our dependencies based on that review, and now it turns out that we have to modify them again –E.g. release of CREAM and ICE software used in the PPS pilot in production The agreement was to release it in production in a short time, after a quick certification, but this didn’t happen
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, Other feedbacks Rules and guidelines –The few defined rules and guidelines are not always enforced –E.g. update of RPMs of a patch during its certification process Wasn’t it decided that this should not happen and instead the patch has to be rejected/obsoleted and a new one created (via the cloning Savannah tool) ? Not always done Up to developers and/or certifiers Dependencies on other gLite components –We have several dependencies on other gLite components –Feel like that in some cases people don’t feel committed in supporting these components, if the raised issues are not relevant for them Afraid that it will be even worst in the future
Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE JRA1-SA3 All-hands meeting - Cyprus, May 6-8, New organization for EGEE-III year II To be discussed tomorrow, but not clear how the “one product team per node” proposed model can fit with our model –In a CREAM CE there are a lot of other software components which we don’t implement and maintain E.g. yaim-core, voms, lcas, lcmaps, glexec, etc. –Saying that everything in the CREAM CE node will be under our full control is not really true –What about CREAM related software components not installed in the CREAM node ? CREAM client (installed in the UI, VOBOX and WMS) ICE (installed in the WMS)