1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.

CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.

Development of test suites for the certification of EGEE-II Grid middleware Task 2: The development of testing procedures focused on special details of.

Sphinx Server Sphinx Client Data Warehouse Submitter Generic Grid Site Monitoring Service Resource Message Interface Current Sphinx Client/Server Multi-threaded.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

A tool to enable CMS Distributed Analysis

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.

1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04.

Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC

OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005.

VOX Project Status T. Levshina. Talk Overview VOX Status –Registration –Globus callouts/Plug-ins –LRAS –SAZ Collaboration with VOMS EDG team Preparation.

EGEE is a project funded by the European Union under contract IST Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague.

How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.

DataGrid Applications Federico Carminati WP6 WorkShop December 11, 2000.

Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 Plans for the integration of grid tools in the CMS computing environment Claudio.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

BNL ATLAS Database service update Yuri Smirnov, Iris Wu BNL, USA LCG Database Deployment and Persistency Workshop, CERN, Geneva October 17-19, 2005.

OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

High Energy Physics & Computing Grids TechFair Univ. of Arlington November 10, 2004.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

ATLAS Grid Computing Rob Gardner University of Chicago ICFA Workshop on HEP Networking, Grid, and Digital Divide Issues for Global e-Science THE CENTER.

David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

1 ATLAS Computing on Grid3 and the OSG Rob Gardner University of Chicago Mardi Gras Conference February 4, 2005.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,

ATLAS on Grid3/OSG R. Gardner December 16, 2004.

U.S. ATLAS Grid Production Experience

Extended OSG client for WLCG

Zhongliang Ren 12 June 2006 WLCG Tier2 Workshop at CERN

PES Lessons learned from large scale LSF scalability tests

Leigh Grundhoefer Indiana University

ATLAS DC2 & Continuous production

Presentation transcript:

1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February 2006 CHEP06, Mumbai, India

2 Capone Workflow manager for Grid3 and OSG Designed for ATLAS (managed and user) production Used for testbed (Grid3 and OSG sites) troubleshooting and testing Used as platform to test/integrate experimental technologies (PXE) in the Grid environment Uses GriPhyN VDT (Globus, Codor, VDS) as grid middleware Easy installation and support (released as Pacman package) No more used for official ATLAS production

3 Past ATLAS Production with Capone DC2  Phase I: Simulation (Jul-Sep 04) generation, simulation & pileup Produced datasets stored on Tier1 centers, then CERN (Tier0) scale: ~10M events, 30 TB  Phase II: “Tier0 (1/10 scale) Produce ESD, AOD (reconstruction) Stream to Tier1 centers  Phase III: Distributed analysis (Oct-Dec 04) access to event and non-event data from anywhere in the world both in organized and chaotic ways Rome  Jan-May 2005 Full chain Montecarlo production User production User production and Testing

4 LCG Nordu Grid Grid3 LSF LCG exe NG exe G3 exe Legacy exe super prodDB (CERN) data management RLS Jabber/pysoap Jabber/py Don Quijote “DQ” Windmill or Eowyn Lexor Lexor-CG AMI (Metadata) Capone Dulcinea ATLAS Global Architecture this talk

5 Capone and Grid Requirements Interface to Grid3/OSG (GriPhyN VDT based) Manage all steps in the job life cycle  prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state information to Windmill (  ProdDB)

6 Capone Architecture Message interface  Web Service  Jabber Translation layer  Windmill schema CPE (Process Engine) Processes  Grid3/OSG: GCE interface  Stub: local shell testing  DonQuijote (future) Server side: GCE Server  ATLAS Releases and TRFs  Execution sandbox (kickstart) Message protocols Translation Web Service CPE Jabber Windmill User Stub Grid Don Quijote GCE Server

7 Capone Grid Interactions Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote

8 Performance Summary of DC2 (Dec 04) Several physics and calibration samples produced 91K job attempts at Windmill level  9K of these aborted before grid submission: mostly RLS down or selected CE down “Full” success rate: 63%  Average success after submitted: 70%  Includes subsequent problems at submit host  Includes errors from development Job statusCapone Total failed34065 finished91634

9 Performance Summary of Rome (5/2005) Several physics and calibration samples produced 253k job attempts at Windmill level “Full” success rate: 73%  Includes subsequent problems at submit host  Includes errors from development Scalability is a problem for short jobs  Submission rate  Handling of many small jobs Data movement is also problematic

10 Capone Failure Statistics Submission 2.4% Execution 2.0% Post-job check 5.9% Stage out41.6% RLS registration 5.1% Capone host interruptions14.1% Capone succeed, Windmill fail 0.3% Other26.6%

11 Production lessons Single points of failure  Prodsys or grid components  System expertise (people) Fragmented production software Client (Capone submit) hosts  Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes (VDT dagman processes)  No client host persistency Need local database for job recovery Not enough tools for testing Certificate problems (expiration, CRL expiration)

12 Improvements DAG batching in Condor-G  To scale better by reducing the load in the submit host Multiple stage, (persistent) servers multithreaded  Overcome Python thread limitation  Maintain server redundancy  Recoverability Checkpointing  To recover from Capone or submit host failures  Rollback Recovery procedures  Workarounds (retries, …) to Grid problems … … … …

13 Performance and Scalability Tests Submit host  Dual CPU, 1.3GHz Xeon, 1GB RAM Job mix  Event generation  Generic CPU usage (900sec, 30min)  File I/O Testbed  9 OSG sites (UTA_dpcc, UC_Teraport_OSG_ITB BU_ATLAS_Tier2, BNL_ATLAS, UC_ATLAS_Tier2 PSU_Grid3, IU_ATLAS_Tier2, OUHEP, SMU_Physics_Cluster) Tests  Multiple tests, repetition and sustained rate  Job submission  Job recovery (system crash, DNS problem)  Sustained submission, overload

14 Tests results Results (avg/min/max)  Submission rate to Capone (jobs/min): 541/281/1132  Submission rate to the Grid (jobs/min): 18/15/48  Number of jobs handled by Capone as visible issuing./capone status summary: Running jobs: 4019/0/6746 Total jobs: 7176/0/8100  Number of job as visible in Condor-G (less) Only part of the execution Remotely running/queued runningpendingunsubmittedtotal avg min max

15 Development and support practices 2 developers team pacman packaging and easy update (1 line installation or update) 2 releases/branches starting Capone 1.0.x/1.1.x  stable one for production (only bug fixes)  development one (new features) iGOC  Redirection of Capone problem  Collaboration in site troubleshooting. Problems resolved at weekly iVDGL operations meetings Use of community tools:  Savannah portal (CVS, bugzilla, file repository)  Twiki (documentation)  Mailing lists and IM for communications and troubleshooting

16 Conclusions More flexible execution model  Possibility to execute TRFs using shared or local disk areas  No need of preinstalled transformation (possibility to stage it in with the job) Improved performance  Job checkpointing and recoverability from submit host failures  Max jobs no more limited by max number of python threads  Recovery actions for some Grid errors  Higher submission rate for clients  The submission rate to the Grid could be higher but there were always queued jobs Feasibility in small team of development and support  Production and development versions  Extended documentation  Production and user support and troubleshooting

17 Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) US ATLAS testbed sites and Grid3/OSG site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info  Twiki  Savanna portal