EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The usage of the gLite Workload Management.


Similar presentations
1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.

INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
A tool to enable CMS Distributed Analysis
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Experience with the gLite Workload Management.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Security and Job Management.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
INFSO-RI Enabling Grids for E-sciencE The gLite Workload Management System Elisabetta Molinari (INFN-Milan) on behalf of the JRA1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the WMS Salvatore Monforte (INFN.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid2Win : gLite for Microsoft Windows Roberto.
Use of the gLite-WMS in CMS for production and analysis Giuseppe Codispoti On behalf of the CMS Offline and Computing.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM: current status and next steps EGEE-JRA1.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Readiness of ATLAS Computing - A personal view
The LHCb Computing Data Challenge DC06
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The usage of the gLite Workload Management System by the LHC experiments Simone Campana, CERN Enzo Miccio, CERN-INFN Andrea Sciabà, CERN EGEE User Forum 2007, Manchester (UK)

Enabling Grids for E-sciencE EGEE-II INFSO-RI Outline Monte Carlo production and data analysis in ATLAS and CMS –The ATLAS production system –The CMS Remote Analysis Builder The gLite Workload Management System –Architecture –Main functionalities –Requirements for 2007 and 2008 –WLCG acceptance criteria Testing the gLite WMS –Results of the acceptance tests –Single job submission tests Usage of the gLite WMS in the experiments –ATLAS –CMS –ALICE and LHCb Conclusions

Enabling Grids for E-sciencE EGEE-II INFSO-RI Monte Carlo production The LHC experiments need to generate huge amounts of simulated data to validate the reconstruction software, test the computing model and develop physics data analysis –~ 50 million events/month in 2007 –~ 100 million events/month in 2008 –MC production done at Tier-2 sites  distributed activity Specific tools have been developed by each experiment to manage the production workflow –ATLAS production system –CMS Production Agent –These tools need to be interfaced to one or more Grid workload management systems  To use different Grids  To use the same Grid in different ways

Enabling Grids for E-sciencE EGEE-II INFSO-RI Example: the ATLAS Production System A central database of jobs to be run A “supervisor” for each Grid that takes jobs from the central database, submits them to the Grid, monitors them and checks their outcome An “executor” acting as interface to the Grid middleware –EGEE/WLCG  Lexor using the gLite WMS  Condor-G direct submission

Enabling Grids for E-sciencE EGEE-II INFSO-RI Data analysis Data analysis is either done by individuals, or as an organized activity (event reconstruction, data reprocessing) –Reconstruction and reprocessing done at Tier-1 –End-user analysis done at Tier-2 –Datasets can be distributed, or replicated at several sites Tools exist specifically to submit and manage analysis jobs –To shield the user from the differences between Grids or job management systems –To integrate the analysis job workflow with the experiment data management system –To implement higher level job management functionalities

Enabling Grids for E-sciencE EGEE-II INFSO-RI Example: data analysis with CRAB The user develops and compiles his code on the UI –Common libraries are pre-installed at all CMS sites Given the datasets to analyze, CRAB splits the task in many jobs and submits them near the data Jobs are submitted via –LCG RB or gLite WMS (EGEE, Open Science Grid) –via Condor-G (OSG only) The user retrieves the output once the task has finished Used since two years by physicists and in data challenges It is being complemented by an analysis server to automate the management of analysis tasks

Enabling Grids for E-sciencE EGEE-II INFSO-RI The gLite WMS architecture The service to submit and manage jobs –Task queue: holds jobs not yet dispatched –Information SuperMarket: caches all information about Grid resources –Match Maker: selects the best resource for each job –Job Submission & Monitoring –Interacts with Data Management, Logging & Bookkeeping, etc. WMProxy service optimizes job management and stands between the user and the real WMS –Service Oriented Architecture (SOA) compliant  Implemented as a SOAP Web service –Validates, converts and prepares jobs and sends them to the WM –Interacts with the L&B via LBProxy (a state storage of active jobs) –Implements most new features

Enabling Grids for E-sciencE EGEE-II INFSO-RI Main functionalities The gLite WMS offers several advantages over the old LCG WMS –Bulk submission  Collections: sets of independent jobs  New, much more reliable implementation as a compound job submission –Job sandboxes  Shared input sandboxes for a collection  Download/upload of sandboxes via GridFTP, http, https –Faster match-making  "bulk" matchmaking and ranking for collections –Internal task queue  If a job cannot match right away it is kept for some time until it matches –Resubmission of failed jobs  a job is resubmitted right away after a middleware/infrastructure-related failure  greatly improves the job success rates –A limiter mechanism which prevents submission of new jobs if the load exceeds a certain threshold  Leads to "artificial", but desired, limitations of the job submission rate  Improves the stability of the system –Last but not least, the gLite WMS is actively developed and maintained, while the LCG RB is "frozen"

Enabling Grids for E-sciencE EGEE-II INFSO-RI Requirements for the gLite WMS CMSATLAS Performance K jobs/day20K production jobs/day + analysis load K jobs/day (120K to EGEE, 80K to OSG) Using <10 WMS entry points 100K jobs/day through the WMS; Using <10 WMS entry points Stability <1 restart of WMS or LB every month under load

Enabling Grids for E-sciencE EGEE-II INFSO-RI WLCG acceptance criteria Based on the experiment requirements, some criteria have been defined to decide if the gLite WMS satisfies the requirements –At least jobs/day submitted for at least five days –No service restart required for any WMS component –The WMS performance should not show any degradation during this period –The number of jobs "stuck" should be less than 1% of the total

Enabling Grids for E-sciencE EGEE-II INFSO-RI Testing the gLite WMS The testing of the gLite WMS is mainly done by the Experiment Integration and Support team of WLCG –Collaboration between EIS, JRA1 (EGEE developers), SA1 (EGEE), SA3 (EGEE) –Bugs discovered, fixed and patched bypassing normal certification procedures –Huge improvements in stability and performance The gLite WMS is not yet really in production, but is an "experimental" service –Few instances at CERN, CNAF and Milan used for tests, CMS 2006 data challenge (CSA06) and ATLAS MC production

Enabling Grids for E-sciencE EGEE-II INFSO-RI Test setup The latest version of the gLite WMS is installed on dedicated machines at CNAF –Dual Opteron 2.4 GHz, 4 GB RAM The WMS is stressed by submitting a large number of jobs –Collections of a few hundreds jobs –Single jobs The behaviour or the WMS is closely monitored –Job status: to check that jobs do not become "stuck", or abort due to the WMS –WMS internal status: WMS components running fine, check for bottlenecks, etc. –System status: high load, excessive I/O or memory consumption, etc.

Enabling Grids for E-sciencE EGEE-II INFSO-RI Results of the acceptance test jobs submitted in 7 days –~16000 jobs/day well exceeding acceptance criteria –The "limiter" prevented submission when load was very high (>12) All jobs were processed normally but for 320 –~0.3% of jobs with problems, well below the required threshold –Recoverable using a proper command by the user No stale jobs The WMS dispatched jobs to computing elements with no noticeable delay Acceptance tests were passed

Enabling Grids for E-sciencE EGEE-II INFSO-RI Single job submission Submission of single jobs from different parallel processes has been also studied –Useful for applications that do not need to submit very large numbers of jobs per user but with many users Results –The time needed for the job submission becomes a limiting factor  max. submission rate/thread  7000 jobs/day –The limiter refuses about ~30% of jobs because the load is always near threshold (12)  Effective rate/process is ~4500 jobs/day per thread –The total submission rate is proportional to the number of threads 12 Failures due to limiter

Enabling Grids for E-sciencE EGEE-II INFSO-RI Single job submission (cont.) Thread #1Thread #2 When the load reaches 12 the limiter decreases the submission rate

Enabling Grids for E-sciencE EGEE-II INFSO-RI The WMS in the ATLAS MC production Big ramp-up in the last months Reached jobs/day Wallclock time lost in failures is usually low –Validation periods or occasional incidents increase it from time to time

Enabling Grids for E-sciencE EGEE-II INFSO-RI ATLAS MC error analysis 30% of errors due to WMS + infrastructure (Computing Elements and Batch Farms problems) The wasted wallclock time comes from problems in the data management (75% of the total)

Enabling Grids for E-sciencE EGEE-II INFSO-RI The WMS in CMS data analysis CMS supports submission of job collections via WMS in CRAB (analysis jobs) –Tested during the computing/software/analysis challenge (CSA06) in 2006 for ~1 month –The submission rate of "fake"analysis jobs reached about jobs/day (2 WMS instances used) –Globally the gLite + Condor-G submission systems achieved the goal of jobs/day Submitted jobs Successful jobs

Enabling Grids for E-sciencE EGEE-II INFSO-RI ALICE and LHCb experience ALICE is using the LCG Resource Broker to submit "pilot" jobs –A pilot job "pulls" real MC production jobs from central queue –Pilot jobs are much less (by one order of magnitude) than real jobs  much less stringer requirements on the WMS than ATLAS and CMS –First tests with the WMS have been very successful  Huge reduction of time needed to dispatch pilot jobs LHCb also plans to abandon the LCG RB for the gLite WMS –As for ALICE, the WMS is used to send pilot jobs  less stringent requirements –The gLite WMS is already integrated in DIRAC, the LHCb submission system –Already used in production for a short time, usage should significantly increase very soon

Enabling Grids for E-sciencE EGEE-II INFSO-RI Conclusions Most reliability problems are understood –A few minor issues still being investigated The WMS is not yet officially in production in EGEE –The current version used for the tests is in certification and will be available for deployment in a couple of weeks The advantages compared to the LCG Resource Broker are very significant –Performance with single job submission still needs to be tuned The improvements achieved during these months have already had a big impact on the amount of effort required to run the gLite WMS in production activities –e.g. for the ATLAS Monte Carlo production All the LHC experiments are ready to use it –Either they are already using it, or have finished the testing phase

Enabling Grids for E-sciencE EGEE-II INFSO-RI Acknowledgements Thanks to Julia Andreeva, Gianluca Castellani, Gerhild Maier, Patricia Méndez and Roberto Santinelli for their contributions to this presentation Thanks to JRA1, SA1 and SA3 for their continuing support