ALICE Workload Model – WMS and CREAM

ALICE Workload Model – WMS and CREAM
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ALICE – GridKa operations meeting, 3rd – July

Outlook ALICE job submission in 7 points and workload management system ALICE and the gLite-VOBOX ALICE and the gLite-WMS ALICE and the CREAM-CE Summary and Conclusions 03/07/09 ALICE - GridKa Operations Meeting

ALICE Job Submission in 7 points
Job (agents) submission is performed using the WLCG WM system if existing Otherwise direct submission to CE’s Real Jobs are held in a central queue handling priorities and quotas Job (agents) are submitted to provide a standard environment (job wrapper) across different systems Real jobs are pulled by the sites Automatic operations Extensive monitoring 03/07/09 ALICE - GridKa Operations Meeting

ALICE Workload Management System
Standard ALICE Job Submission Model 03/07/09 ALICE - GridKa Operations Meeting

ALICE and the gLite-VOBOX
WLCG and the gLite-VOBOX The VOBOX is a WLCG service developed in 2006 to provide the experiments with a service to: Run their own services In addition it also provides file system access to the experiment software area ALICE and the gLite-VOBOX VO-boxes deployed at all T0-T1-T2 sites providing resources for ALICE Mandatory requirement to enter the production Required in addition to all standard LCG Services Entry door to the LCG Environment Runs standard LCG components and ALICE specific services Uniform deployment Same behaviour for T1 and T2 in terms of production Differences between T1 and T2 a matter of QoS only Site related problems are expected to be handled by site administrators 03/07/09 ALICE - GridKa Operations Meeting

ALICE and the gLite-WMS
ALICE deprecated the LCG-RB in November 2008 ALL ALICE SITES ARE CURRENTLY USING WMS FOR JOB AGENTS SUBMISSION (Since last year already) For each site we have defined up to 3 WMS Configured via ALiEn LDAP Based in a regional distribution In addition are used as backup in most of the T2 sites The local WMS configuration (available at each local VOBOX) is NOT use ALICE creates the WMS config files (based on the LDAP list of local WMS) on the fly 03/07/09 ALICE - GridKa Operations Meeting

ALICE Experiences with the gLite-WMS
ALICE began suffering important instabilities with the gLite-WMS during the Christmas period of 2008/2009 When ALICE began to use the system massively Most of the issues observed with Most demanded WMS for the whole production At that moment there were 3 ALICE-dedicated WMS nodes at CERN 2 nodes configured with gLite3.0 and old HW 1 node configured already with gLite3.1 in a n 8 core machine Used for CERN submissions and ‘failover’ for almost all ALICE sites The 8 core node began to show too large backlog from the 21st of December In addition the gLite3.0 nodes were also unstable Replaced by two new nodes just after Christmas 03/07/09 ALICE - GridKa Operations Meeting

Several problem sources were discussed during GDB meetings in January, February and March 2009: Destination queue not available by any configuration problem at the destination site By WMS design, submitted jobs can be kept in queues for several hours ALICE jdl construction A complicated jdl will slow down the matchmaking process The workload manager will not be able to keep up with all the requests which are being sent by the WMProxy service DISCARDED BDII overloaded Network problems Myproxy server overloaded 03/07/09 ALICE - GridKa Operations Meeting

Whatever it was creating the instability, the ALICE perception was jobs in status WAITING or READY forever Suicide mode – new requests still coming and accepted, thus worsening the status The above hypotheses were considered as ingredients of a possible high load, but not the unique reason We followed carefully the evolution of the 3 WMS nodes (gLite3.1, 8 core) and report further issue to the developers In March 2009 still we could not conclude the origin of the problem 03/07/09 ALICE - GridKa Operations Meeting

Some of the hypothesis in detail
Overload of myproxy server The ALICE submission procedure was changed in January to avoid this Proxy delegation request once per hour - 'frugal’ usage of myproxy server Conclusion - the WMS overload is uncorrelated with the use of myproxy 03/07/09 ALICE - GridKa Operations Meeting

Some of the hypothesis in detail
The destination queue is not available If the queue(s) declared in the jdl is not available the request will be in a standby status for several hours (configurable at each WMS) These requests cannot be tracked Jobs stay in status WAITING forever The submission continue and continue until WMS overload ALICE has implemented a 15min expirytime at the jdl level at all sites This possible problem is not longer an issue since March 2009 03/07/09 ALICE - GridKa Operations Meeting

Current status of the gLite-WMS
In collaboration with IT-FIO and IT-GS up to 4 WMS were setup at CERN for exclusive ALICE use since March 2009 France has deployed gLite3.2 at all WMS used by ALICE Also CNAF T1 has upgraded the version of the WMS used by ALICE At sufficient number of WMS deployed for ALICE and a correct monitoring of the service can ensure the good behaviour of the ALICE WMS in production Even today we are seeing instabilities at the 03/07/09 ALICE - GridKa Operations Meeting

gLite-WMS and ALICE: final remarks
Some legends to forget: This issue is affecting ALICE only... FALSE This issue is affecting ALICE because of their submission model… FALSE ALICE should change to bulk submission… THIS IS NOT THE QUESTION IMHO: Any service should cope with the computing models of any LHC experiment If any new service requires deep changes in this computing model, then it should be redefine Any new service should ensure AT LEAST the same functionality covered by the previous service Any gap in the service features is not experiment’s fault 03/07/09 ALICE - GridKa Operations Meeting

ALICE and CREAM-CE ALICE is interested in the deployment of the CREAM-CE service at all sites which provide support to the experiment Focus: Direct CREAM-CE submission WMS submission not required The experiment has began to test the CREAM-CE since the beginning of Summer 2008 in a fully production environment ALICE requires the CREAM-CE service in parallel to the LCG-CE at all sites BEFORE THE REAL DATA TAKING 03/07/09 ALICE - GridKa Operations Meeting

The WLCG Statement Nov 2008 (WLCG GDB Meeting). CREAM Status:
WMS submission via ICE to CREAM was still not available Proxy renewal mechanism of the system not optimized (CURRENTLY SOLVED) In addition: Condor-G submission to CREAM was not enabled (ALREADY AVAILABLE AND BEING TESTED BY CMS) At that point, these factors were preventing the setup of the system in production (also in testing) for ATLAS and CMS, but not for ALICE and LHCb The highest priorities for LHCb were (and are) glexec and SLC5 The highest priority for ALICE was (and is) CREAM One experiment was ready to stress the system Good opportunity for the experiment, developers and site admins to gain experience with the new system WLCG encouraged sites to provide a CREAM-CE system in parallel to the current LCG-CE 03/07/09 ALICE - GridKa Operations Meeting

ALICE requirements Setup of a 2nd VOBOX at each site providing CREAM-CE Each ALICE VOBOX submits to an specific backend One VOBOX  LCG-CE OR CREAM-CE submission: replacement approach Two VOBOXES  LCG-CE AND CREAM-CE submission: parallel approach Two VOBOXES approach is required in those sites providing an important Storage environment for ALICE Setup of the ALICE production queue behind the CREAM-CE This procedure puts the CREAM-CE directly in production Setup of the GridFTP server Required to retrieve the job (agent) outputs No specific wish for the placement of this service Although we suggest the 2nd VOBOX as a good placement 03/07/09 ALICE - GridKa Operations Meeting

1st phase of the service tests by ALICE
Test CREAM-CE instance provided by GridKa Tests operated through a second VOBOX parallel to the already existing service at the T1 Access to the CREAM-CE Initially 30 CPUS (in PPS) available for the testing Moved in few weeks to the production ALICE queue (production setup) Intensive functionality and stability tests from July to September 2008 Production stopped at the end of September of 2008 for CREAM-CE upgrades at the site Confirmed the deployment of new version last week at GridKa (tested and perfectly working) Excellent support from the CREAM-CE developers and the site admins Thanks to Massimo Sgaravatto (Padova-INFN) and Angela Poschlad (GridKa) for the continuous support 03/07/09 ALICE - GridKa Operations Meeting

2nd phase of the service tests by ALICE
After a debug phase of the CREAM module in January 2009, the new module in production the 19th of February (2nd testing phase started) Stability and performance are currently the most important test issues at the sites providing CREAM-CE The deployment of a 2nd VOBOX ensures that the production will continue on parallel through the WMS A unique VOBOX would require a fully and dedicated babysitting of the system (not realistic) Feedback of all issues are directly provided to the CREAM developers 03/07/09 ALICE - GridKa Operations Meeting

CREAM-CE current status
CREAM-CE has demonstrated to cope perfectly with the stability and performance requirements of ALICE It is highly required at all sites before the real data taking Very good feedback also from the site admins Although LCG-CE is still the WLCG-CE, CREAM-CE is a production service Sites providing CREAM-CE for the ALICE production CERN, RAL, IN2P3-Subatech, GSI, GridKa, INFN-CNAF, INFN- Torino, INFN-Legnaro, SARA, Prague, IHEP, Kolkata, Kisti 03/07/09 ALICE - GridKa Operations Meeting

Summary and Conclusions
The ALICE workload management system has achieved a good level of maturity and it is ready for the real data taking Developers and support always ready to implement any new feature in the ALICE environment which can be useful for the whole system It is based in 3 mayor WLCG services gLite-VOBOX Very stable service across all ALICE sites gLite-3.1 WMS A lot of headaches until we managed to ensure a quasi-stable system Still we need to continuously monitor the system This babysitting regime cannot be expected during the real data taking CREAM-CE Most adecuate service for the ALICE data taking It is highly demanded to fully deprecate the current gLite-WMS 03/07/09 ALICE - GridKa Operations Meeting

ALICE Workload Model – WMS and CREAM

Similar presentations

Presentation on theme: "ALICE Workload Model – WMS and CREAM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ALICE Workload Model – WMS and CREAM

Similar presentations

Presentation on theme: "ALICE Workload Model – WMS and CREAM"— Presentation transcript:

Similar presentations

About project

Feedback