Presentation is loading. Please wait.

Presentation is loading. Please wait.

Massimo Sgaravatto INFN Padova On behalf of the CREAM product team

Similar presentations


Presentation on theme: "Massimo Sgaravatto INFN Padova On behalf of the CREAM product team"— Presentation transcript:

1 Massimo Sgaravatto INFN Padova On behalf of the CREAM product team
CREAM-CE Massimo Sgaravatto INFN Padova On behalf of the CREAM product team

2 What is CREAM CREAM service: Computing Resource Execution And Management service Service for job management operations at the Computing Element (CE) level Allows to submit, cancel, monitor, … jobs Web service interface Supported batch systems (through the BLAH interface): LSF, PBS/Torque, Condor, SGE, BQS Supposed to replace the LCG-CE Implemented and maintained by the Padova middleware team in the context of the EGEE project Milano for the BLAH component Support, maintenance and new developments will continue in the context of EMI / IGI A CREAM CE includes several other software components implemented by other teams Security components (trustmanager, glexec, voms, …), … INFN-GRID & CCR Workshop - May 17-21, 2010

3 CREAM: job submission scenarios
Direct submission We provide and maintain a CREAM CLI installed in the User Interface But users can also implement their own clients using a Web Service framework CREAM CREAM CREAM INFN-GRID & CCR Workshop - May 17-21, 2010

4 CREAM: job submission scenarios
Submission through a higher level job management service gLite WMS ARC (Nordugrid) Condor ICE CREAM CREAM CREAM INFN-GRID & CCR Workshop - May 17-21, 2010

5 Some history Jun 08 Dec 08 Jun 09 Dec 09 Jun 10 October 2008
First version of the CREAM CE and CREAM-CLI for gLite 3.1/sl4_i386 released in production Sites then invited to deploy CREAM CEs in parallel to their LCG CEs INFN-GRID & CCR Workshop - May 17-21, 2010

6 Some history Jun 08 Dec 08 Jun 09 Dec 09 Jun 10 March 2009
Availability in production of a first WMS able to submit to CREAM CEs But CREAM CEs not matched by default by gLite WMS, since publishing Special as GlueCeStateStatus INFN-GRID & CCR Workshop - May 17-21, 2010

7 Some history Jun 08 Dec 08 Jun 09 Dec 09 Jun 10 October 2009
WLCG/SA1 invited sites to publish Production instead of Special as GlueCeStateStatus CREAM CEs matched by default by gLite WMS INFN-GRID & CCR Workshop - May 17-21, 2010

8 Some history Jun 08 Dec 08 Jun 09 Dec 09 Jun 10 January 2010
Availability in production of CREAM CE for gLite 3.2/sl5_x86_64 INFN-GRID & CCR Workshop - May 17-21, 2010

9 Some history Jun 08 Dec 08 Jun 09 Dec 09 Jun 10 May 2010
Release in production of CREAM CE 1.6 INFN-GRID & CCR Workshop - May 17-21, 2010

10 CREAM CEs in the BDII INFN-GRID & CCR Workshop - May 17-21, 2010

11 LCG-CE  CREAM-CE WLCG defined a set of criteria to be met before the phase out of the LCG-CE can start Most of them already satisfied Some of them not really well defined E.g. performance criteria Some of them are not under our control In particular support for submission to CREAM from Condor The known issues that we are aware of and for which we are responsible, addressed in CREAM CE 1.6 (already available in production for glite 3.2) and new WMS (coming soon) That list of criteria is not really updated and followed now Feedback from experiments considered instead INFN-GRID & CCR Workshop - May 17-21, 2010

12 ALICE Alice was the first experiment using CREAM
Tests started even before the official release of CREAM in production Pilot activities done at FZK Early feedbacks from users and admins Using direct submissions to CREAM using the CREAM CLI 2009 activities CREAM deployed in all their T1s (all but NIKHEF) and in many T2s Dual submission CREAM-CE (directly via CREAM CLI) LCG-CE (via gLite WMS) INFN-GRID & CCR Workshop - May 17-21, 2010

13 Alice (cont.ed) Status now
CREAM deployed in all T1s and in most T2s Only 2 sites not having CREAM deployed Going to deprecate the submission to the LCC-CE ALICE has established the 31st of May as the deadline to have CREAM-CE at all sites After that date and based on the status of the pending sites, these sites might be blacklisted Only at T0 the deprecation of the LCG-CE is not expected for the moment Because there are only 3 CREAM CEs instances wrt ~ 20 LCG-CEs About 1 million jobs executed since March 30th (start of data taking) About 70 % executed on CREAM CEs INFN-GRID & CCR Workshop - May 17-21, 2010

14 ATLAS CREAM used to directly submit pilots (via Condor/via WMS) that then pick up jobs from PanDA Issues found for submissions from Condor Lease not renewed Problem to be fixed in Condor Overload of the gridftp server on the Condor host when many jobs start/finishes together Condor is going to use a different approach for sandbox management (see next slide) See G. Stewart talk at March GDB No much progress since that R. Walker: ”In summary: CE looks good but we cannot use it” Atlas is asking to keep having a LCG-CE in their sites INFN-GRID & CCR Workshop - May 17-21, 2010

15 Condor  CREAM sandbox management
Condor now uses 1 File transfer 1 done when job starts/finishes running Gridftpd on Condor host can get overloaded when many job starts/finishes running together Going to move to 2 2a done when job is submitted/when output is retrieved 2b done when job starts running/finishes Hopefully going to address the issue since one Condor submitting host using several CREAM CEs Condor submitting host 1 2a Job 2b CREAM CE WN INFN-GRID & CCR Workshop - May 17-21, 2010

16 ATLAS (cont.ed) Submission of pilots to CREAM from WMS
Tests done by David Rebatto Submissions of pilot jobs through gLiteWMS to CREAM CEs ~ 320K pilot jobs submitted since March 23rd Peaks of ~ 12Kjobs/day No gridftp overload problems Only several stuck gridftp connections from a site but not causing problems Is the high load seen with Condor  CREAM really because of gridftp ? Arranging a debugging session with R. Walker Issues ICE crashes in the first tests Now addressed Additional workflows: Ganga  WMS  CREAM For analysis No feedbacks INFN-GRID & CCR Workshop - May 17-21, 2010

17 CMS First test round on T1s in Oct-Nov 2009
Only production Submitted to CREAM using the gLite WMS Results: “Inefficiencies due to ICE/CREAM are negligible” CREAM CEs have been considered in production after this test Use by CMS production activities A bug found in the ICE component of the gLite WMS In some cases jobs are reported in a non final state even if finished Major problem  CREAM CEs removed from production They will be reintroduced when the problem is fixed Use by CMS analysis activities CREAM CEs available at the CMS T2 sites are used transparently The above bug is not considered critical INFN-GRID & CCR Workshop - May 17-21, 2010

18 CMS (cont.ed) The ICE bug addressed in the new WMS 3.2.14
Coming soon Big delay also because of the new certification and release procedure Basically not possible to release anything for gLite 3.1 for ~ 4 months Many other fixes and improvements in ICE The new WMS/ICE along with the new CREAM CE 1.6 (already in production) will also allow a much prompt job status changes detection by the WMS INFN-GRID & CCR Workshop - May 17-21, 2010

19 LHCB CREAM CEs used transparently
They submit to the WMS, which can match LCG-CEs or CREAM-CEs No particular issues reported Going to move towards direct submission model (through the CREAM CLI) A proof of concept Dirac agent using it already in place at CNAF A concern raised on the outputsandbox management already addressed Even if it was not considered a showstopper for them Solution very appreciated also by Alice INFN-GRID & CCR Workshop - May 17-21, 2010

20 Example: LHCB pilots @ FZK
INFN-GRID & CCR Workshop - May 17-21, 2010

21 OSG Open Science Grid is also evaluating CREAM
Among other solutions (e.g. GT5) Tests done at UCSD by Igor Sfiligoi They focused on submissions to CREAM from Condor, using Condor as batch system Condor support as batch system for PIC was responsibility For configuration and info providers Now they are not going to continue this work anymore First feedbacks Gridftp overload on the Condor submission host Same problem seen by ATLAS Waiting for Condor changing the sandbox management Going through the gridftp server of the CREAM CE They would also like to be able to use batch system staging capabilities for file movements between CE and WN instead of gridftp It will be available in next CREAM CE release INFN-GRID & CCR Workshop - May 17-21, 2010

22 Summary Direct submission to CREAM seems in good shape
E.g. good feedback from Alice Still some issues in the submission through WMS Hopefully addressed with the new WMS being released Issues also in the submission through Condor The submission chain here is not fully under our control Number of CREAM CE installations increasing, but still not too many INFN-GRID & CCR Workshop - May 17-21, 2010

23 cream-support [at] lists [dot] infn [dot] it
Other Info cream-support [at] lists [dot] infn [dot] it INFN-GRID & CCR Workshop - May 17-21, 2010

24 Backup slides

25 Latest release: CREAM CE 1.6
New operation QueryEvent To be used by ICE To improve ICE’s job status changes detection Scalability problems in current (CEMon + polling) approach Glexec  sudo Glexec used only once per job submission (just to find the uid to be used in sudo calls) Better performance Will facilitate the integration with Argus BLAH release 1.14 Support for sudo New BLparser model for LSF and Torque/PBS Using the status/history commands instead of parsing the log files Also easier yaim configuration Old parser still supported Support of SGE (as contributed by CESGA) INFN-GRID & CCR Workshop - May 17-21, 2010

26 “Simplified”view of CREAM-CE v. 1.6 architecture
Job submit (job register + job start), job cancel, polling (jobStatus), etc Polling with QueryEvent instead WMS Queue of async requests AuthN / AuthZ Sudo instead CREAM Interface Async. commands (start, cancel, etc.) ... Command Manager Pool of threads Sync. commands (register, status, etc.) BLAH LRMS (PBS, LSF, ...)‏ Job Executor Notifications about CREAM job status changes CEMon Glexec CREAM-JOB plugin Not used anymore INFN-GRID & CCR Workshop - May 17-21, 2010

27 Latest release: CREAM CE 1.6 (cont.ed)
Limiter to protect CREAM when the machine is overloaded New job submissions are blocked when this happens Taking into account load, memory usage, # of file descriptors, etc. Very similar to the limiter used in the WMS Support of file transfers from/to gridftp servers started with user credentials Asked by Condor for Condor glidein Several bug fixes INFN-GRID & CCR Workshop - May 17-21, 2010

28 CREAM CE v. 1.7 Integration with Argus Bug fixes
Argus used to decide if a certain operation on a CREAM CE is authorized Also used to get the local user id Single AuthZ system in the CREAM CE Now there is gJAF, LCAS/LCMAPS for glexec, LCAS/LCMAPS for gridftpd Because of bugs/misconfigurations inconsistent authZ decisions could be taken Also gridftpd integrated with Argus Not using gJAF and glexec (and dependencies) anymore Initally the old code will be maintained At configuration possible to decide if Argus or “old system” has to be used Bug fixes INFN-GRID & CCR Workshop - May 17-21, 2010

29 CREAM CE v. 1.8 Initial integration with LB
CREAM (and not only JobWrapper) logging to LB Allows to monitor via LB all CREAM jobs Not only the ones submitted through the WMS Allows integration with dashboards (Alice request) Depends on availability of LB 2.1 Support for bulk job submissions Submission of parametric jobs, job collections In CREAM and CREAM CLI To be used also by ICE Bug fixes INFN-GRID & CCR Workshop - May 17-21, 2010

30 Standardization BES (Basic Execution Service) was supposed to be the standard interface for job management services Actually, even if the BES and JSDL specifications are in the final state, they are not suited for production use because they lack significant capabilities Used only in demos Standardization activities continuing in 2 ways: Within the EMI project Agreement between ARC, gLite, Unicore to be finalized asap Within the OGF PGI (Production Grid Infrastructure ) working group Longer term goal INFN-GRID & CCR Workshop - May 17-21, 2010

31 Other tasks Support for high availability/scalability
Pool of CREAM CE machines seen as a single CREAM CE Some work already done Common clients and APIs Better support for MPI jobs Some support for interactive jobs In order to monitor and/or steer the job while it is running Hopefully possible integrating existing solutions INFN-GRID & CCR Workshop - May 17-21, 2010


Download ppt "Massimo Sgaravatto INFN Padova On behalf of the CREAM product team"

Similar presentations


Ads by Google