ALICE Workload Model – WMS and CREAM

Slides:



Advertisements
Similar presentations
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
Advertisements

Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Angela Poschlad (PPS-FZK), Antonio Retico.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
CREAM CE: upgrades in the system  Migration of the ALICE production queue in the CREAM CE: DONE  From pps-cream-fzk.gridka.de:8443/cream-pbs-pps to.
SRM 2.2: experiment requirements, status and deployment plans 6 th March 2007 Flavia Donno, INFN and IT/GD, CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
INFSO-RI Enabling Grids for E-sciencE CREAM, WMS integration and possible deployment scenarios Massimo Sgaravatto – INFN Padova.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
Status of the SL5 migration ALICE TF Meeting
Kilian Schwarz ALICE Computing Meeting GSI, October 7, 2009
WLCG IPv6 deployment strategy
L’analisi in LHCb Angelo Carbone INFN Bologna
LCG Service Challenge: Planning and Milestones
Sviluppi in ambito WLCG Highlights
Latest WMS news and more
Status of the Production
Summary on PPS-pilot activity on CREAM CE
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Data Challenge with the Grid in ATLAS
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Update on Plan for KISTI-GSDC
The CREAM CE: When can the LCG-CE be replaced?
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Olof Bärring LCG-LHCC Review, 22nd September 2008
CRAB Server CRAB (CMS Remote Analysis Builder)
1 VO User Team Alarm Total ALICE ATLAS CMS
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
LHC Data Analysis using a worldwide computing grid
ETHZ, Zürich September 1st , 2016
The LHCb Computing Data Challenge DC06
Presentation transcript:

ALICE Workload Model – WMS and CREAM Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ALICE – GridKa operations meeting, 3rd – July - 2009

Outlook ALICE job submission in 7 points and workload management system ALICE and the gLite-VOBOX ALICE and the gLite-WMS ALICE and the CREAM-CE Summary and Conclusions 03/07/09 ALICE - GridKa Operations Meeting

ALICE Job Submission in 7 points Job (agents) submission is performed using the WLCG WM system if existing Otherwise direct submission to CE’s Real Jobs are held in a central queue handling priorities and quotas Job (agents) are submitted to provide a standard environment (job wrapper) across different systems Real jobs are pulled by the sites Automatic operations Extensive monitoring 03/07/09 ALICE - GridKa Operations Meeting

ALICE Workload Management System Standard ALICE Job Submission Model 03/07/09 ALICE - GridKa Operations Meeting

ALICE and the gLite-VOBOX WLCG and the gLite-VOBOX The VOBOX is a WLCG service developed in 2006 to provide the experiments with a service to: Run their own services In addition it also provides file system access to the experiment software area ALICE and the gLite-VOBOX VO-boxes deployed at all T0-T1-T2 sites providing resources for ALICE Mandatory requirement to enter the production Required in addition to all standard LCG Services Entry door to the LCG Environment Runs standard LCG components and ALICE specific services Uniform deployment Same behaviour for T1 and T2 in terms of production Differences between T1 and T2 a matter of QoS only Site related problems are expected to be handled by site administrators 03/07/09 ALICE - GridKa Operations Meeting

ALICE and the gLite-WMS ALICE deprecated the LCG-RB in November 2008 ALL ALICE SITES ARE CURRENTLY USING WMS FOR JOB AGENTS SUBMISSION (Since last year already) For each site we have defined up to 3 WMS Configured via ALiEn LDAP Based in a regional distribution In addition WMS@CERN are used as backup in most of the T2 sites The local WMS configuration (available at each local VOBOX) is NOT use ALICE creates the WMS config files (based on the LDAP list of local WMS) on the fly 03/07/09 ALICE - GridKa Operations Meeting

ALICE Experiences with the gLite-WMS ALICE began suffering important instabilities with the gLite-WMS during the Christmas period of 2008/2009 When ALICE began to use the system massively Most of the issues observed with WMS@CERN Most demanded WMS for the whole production At that moment there were 3 ALICE-dedicated WMS nodes at CERN 2 nodes configured with gLite3.0 and old HW 1 node configured already with gLite3.1 in a n 8 core machine Used for CERN submissions and ‘failover’ for almost all ALICE sites The 8 core node began to show too large backlog from the 21st of December In addition the gLite3.0 nodes were also unstable Replaced by two new nodes just after Christmas 03/07/09 ALICE - GridKa Operations Meeting

ALICE Experiences with the gLite-WMS Several problem sources were discussed during GDB meetings in January, February and March 2009: Destination queue not available by any configuration problem at the destination site By WMS design, submitted jobs can be kept in queues for several hours ALICE jdl construction A complicated jdl will slow down the matchmaking process The workload manager will not be able to keep up with all the requests which are being sent by the WMProxy service DISCARDED BDII overloaded Network problems Myproxy server overloaded 03/07/09 ALICE - GridKa Operations Meeting

ALICE Experiences with the gLite-WMS Whatever it was creating the instability, the ALICE perception was jobs in status WAITING or READY forever Suicide mode – new requests still coming and accepted, thus worsening the status The above hypotheses were considered as ingredients of a possible high load, but not the unique reason We followed carefully the evolution of the 3 WMS nodes (gLite3.1, 8 core) and report further issue to the developers In March 2009 still we could not conclude the origin of the problem 03/07/09 ALICE - GridKa Operations Meeting

Some of the hypothesis in detail Overload of myproxy server The ALICE submission procedure was changed in January to avoid this Proxy delegation request once per hour - 'frugal’ usage of myproxy server Conclusion - the WMS overload is uncorrelated with the use of myproxy 03/07/09 ALICE - GridKa Operations Meeting

Some of the hypothesis in detail The destination queue is not available If the queue(s) declared in the jdl is not available the request will be in a standby status for several hours (configurable at each WMS) These requests cannot be tracked Jobs stay in status WAITING forever The submission continue and continue until WMS overload ALICE has implemented a 15min expirytime at the jdl level at all sites This possible problem is not longer an issue since March 2009 03/07/09 ALICE - GridKa Operations Meeting

Current status of the gLite-WMS In collaboration with IT-FIO and IT-GS up to 4 WMS were setup at CERN for exclusive ALICE use since March 2009 France has deployed gLite3.2 at all WMS used by ALICE Also CNAF T1 has upgraded the version of the WMS used by ALICE At sufficient number of WMS deployed for ALICE and a correct monitoring of the service can ensure the good behaviour of the ALICE WMS in production Even today we are seeing instabilities at the WMS@CERN 03/07/09 ALICE - GridKa Operations Meeting

gLite-WMS and ALICE: final remarks Some legends to forget: This issue is affecting ALICE only... FALSE This issue is affecting ALICE because of their submission model… FALSE ALICE should change to bulk submission… THIS IS NOT THE QUESTION IMHO: Any service should cope with the computing models of any LHC experiment If any new service requires deep changes in this computing model, then it should be redefine Any new service should ensure AT LEAST the same functionality covered by the previous service Any gap in the service features is not experiment’s fault 03/07/09 ALICE - GridKa Operations Meeting

ALICE and CREAM-CE ALICE is interested in the deployment of the CREAM-CE service at all sites which provide support to the experiment Focus: Direct CREAM-CE submission WMS submission not required The experiment has began to test the CREAM-CE since the beginning of Summer 2008 in a fully production environment ALICE requires the CREAM-CE service in parallel to the LCG-CE at all sites BEFORE THE REAL DATA TAKING 03/07/09 ALICE - GridKa Operations Meeting

The WLCG Statement Nov 2008 (WLCG GDB Meeting). CREAM Status: WMS submission via ICE to CREAM was still not available Proxy renewal mechanism of the system not optimized (CURRENTLY SOLVED) In addition: Condor-G submission to CREAM was not enabled (ALREADY AVAILABLE AND BEING TESTED BY CMS) At that point, these factors were preventing the setup of the system in production (also in testing) for ATLAS and CMS, but not for ALICE and LHCb The highest priorities for LHCb were (and are) glexec and SLC5 The highest priority for ALICE was (and is) CREAM One experiment was ready to stress the system Good opportunity for the experiment, developers and site admins to gain experience with the new system WLCG encouraged sites to provide a CREAM-CE system in parallel to the current LCG-CE 03/07/09 ALICE - GridKa Operations Meeting

ALICE requirements Setup of a 2nd VOBOX at each site providing CREAM-CE Each ALICE VOBOX submits to an specific backend One VOBOX  LCG-CE OR CREAM-CE submission: replacement approach Two VOBOXES  LCG-CE AND CREAM-CE submission: parallel approach Two VOBOXES approach is required in those sites providing an important Storage environment for ALICE Setup of the ALICE production queue behind the CREAM-CE This procedure puts the CREAM-CE directly in production Setup of the GridFTP server Required to retrieve the job (agent) outputs No specific wish for the placement of this service Although we suggest the 2nd VOBOX as a good placement 03/07/09 ALICE - GridKa Operations Meeting

1st phase of the service tests by ALICE Test CREAM-CE instance provided by GridKa Tests operated through a second VOBOX parallel to the already existing service at the T1 Access to the CREAM-CE Initially 30 CPUS (in PPS) available for the testing Moved in few weeks to the production ALICE queue (production setup) Intensive functionality and stability tests from July to September 2008 Production stopped at the end of September of 2008 for CREAM-CE upgrades at the site Confirmed the deployment of new version last week at GridKa (tested and perfectly working) Excellent support from the CREAM-CE developers and the site admins Thanks to Massimo Sgaravatto (Padova-INFN) and Angela Poschlad (GridKa) for the continuous support 03/07/09 ALICE - GridKa Operations Meeting

2nd phase of the service tests by ALICE After a debug phase of the CREAM module in January 2009, the new module in production the 19th of February (2nd testing phase started) Stability and performance are currently the most important test issues at the sites providing CREAM-CE The deployment of a 2nd VOBOX ensures that the production will continue on parallel through the WMS A unique VOBOX would require a fully and dedicated babysitting of the system (not realistic) Feedback of all issues are directly provided to the CREAM developers 03/07/09 ALICE - GridKa Operations Meeting

CREAM-CE current status CREAM-CE has demonstrated to cope perfectly with the stability and performance requirements of ALICE It is highly required at all sites before the real data taking Very good feedback also from the site admins Although LCG-CE is still the WLCG-CE, CREAM-CE is a production service Sites providing CREAM-CE for the ALICE production CERN, RAL, IN2P3-Subatech, GSI, GridKa, INFN-CNAF, INFN- Torino, INFN-Legnaro, SARA, Prague, IHEP, Kolkata, Kisti 03/07/09 ALICE - GridKa Operations Meeting

Summary and Conclusions The ALICE workload management system has achieved a good level of maturity and it is ready for the real data taking Developers and support always ready to implement any new feature in the ALICE environment which can be useful for the whole system It is based in 3 mayor WLCG services gLite-VOBOX Very stable service across all ALICE sites gLite-3.1 WMS A lot of headaches until we managed to ensure a quasi-stable system Still we need to continuously monitor the system This babysitting regime cannot be expected during the real data taking CREAM-CE Most adecuate service for the ALICE data taking It is highly demanded to fully deprecate the current gLite-WMS 03/07/09 ALICE - GridKa Operations Meeting