1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February 2006 CHEP06, Mumbai, India
2 Capone Workflow manager for Grid3 and OSG Designed for ATLAS (managed and user) production Used for testbed (Grid3 and OSG sites) troubleshooting and testing Used as platform to test/integrate experimental technologies (PXE) in the Grid environment Uses GriPhyN VDT (Globus, Codor, VDS) as grid middleware Easy installation and support (released as Pacman package) No more used for official ATLAS production
3 Past ATLAS Production with Capone DC2 Phase I: Simulation (Jul-Sep 04) generation, simulation & pileup Produced datasets stored on Tier1 centers, then CERN (Tier0) scale: ~10M events, 30 TB Phase II: “Tier0 (1/10 scale) Produce ESD, AOD (reconstruction) Stream to Tier1 centers Phase III: Distributed analysis (Oct-Dec 04) access to event and non-event data from anywhere in the world both in organized and chaotic ways Rome Jan-May 2005 Full chain Montecarlo production User production User production and Testing
4 LCG Nordu Grid Grid3 LSF LCG exe NG exe G3 exe Legacy exe super prodDB (CERN) data management RLS Jabber/pysoap Jabber/py Don Quijote “DQ” Windmill or Eowyn Lexor Lexor-CG AMI (Metadata) Capone Dulcinea ATLAS Global Architecture this talk
5 Capone and Grid Requirements Interface to Grid3/OSG (GriPhyN VDT based) Manage all steps in the job life cycle prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state information to Windmill ( ProdDB)
6 Capone Architecture Message interface Web Service Jabber Translation layer Windmill schema CPE (Process Engine) Processes Grid3/OSG: GCE interface Stub: local shell testing DonQuijote (future) Server side: GCE Server ATLAS Releases and TRFs Execution sandbox (kickstart) Message protocols Translation Web Service CPE Jabber Windmill User Stub Grid Don Quijote GCE Server
7 Capone Grid Interactions Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote
8 Performance Summary of DC2 (Dec 04) Several physics and calibration samples produced 91K job attempts at Windmill level 9K of these aborted before grid submission: mostly RLS down or selected CE down “Full” success rate: 63% Average success after submitted: 70% Includes subsequent problems at submit host Includes errors from development Job statusCapone Total failed34065 finished91634
9 Performance Summary of Rome (5/2005) Several physics and calibration samples produced 253k job attempts at Windmill level “Full” success rate: 73% Includes subsequent problems at submit host Includes errors from development Scalability is a problem for short jobs Submission rate Handling of many small jobs Data movement is also problematic
10 Capone Failure Statistics Submission 2.4% Execution 2.0% Post-job check 5.9% Stage out41.6% RLS registration 5.1% Capone host interruptions14.1% Capone succeed, Windmill fail 0.3% Other26.6%
11 Production lessons Single points of failure Prodsys or grid components System expertise (people) Fragmented production software Client (Capone submit) hosts Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes (VDT dagman processes) No client host persistency Need local database for job recovery Not enough tools for testing Certificate problems (expiration, CRL expiration)
12 Improvements DAG batching in Condor-G To scale better by reducing the load in the submit host Multiple stage, (persistent) servers multithreaded Overcome Python thread limitation Maintain server redundancy Recoverability Checkpointing To recover from Capone or submit host failures Rollback Recovery procedures Workarounds (retries, …) to Grid problems … … … …
13 Performance and Scalability Tests Submit host Dual CPU, 1.3GHz Xeon, 1GB RAM Job mix Event generation Generic CPU usage (900sec, 30min) File I/O Testbed 9 OSG sites (UTA_dpcc, UC_Teraport_OSG_ITB BU_ATLAS_Tier2, BNL_ATLAS, UC_ATLAS_Tier2 PSU_Grid3, IU_ATLAS_Tier2, OUHEP, SMU_Physics_Cluster) Tests Multiple tests, repetition and sustained rate Job submission Job recovery (system crash, DNS problem) Sustained submission, overload
14 Tests results Results (avg/min/max) Submission rate to Capone (jobs/min): 541/281/1132 Submission rate to the Grid (jobs/min): 18/15/48 Number of jobs handled by Capone as visible issuing./capone status summary: Running jobs: 4019/0/6746 Total jobs: 7176/0/8100 Number of job as visible in Condor-G (less) Only part of the execution Remotely running/queued runningpendingunsubmittedtotal avg min max
15 Development and support practices 2 developers team pacman packaging and easy update (1 line installation or update) 2 releases/branches starting Capone 1.0.x/1.1.x stable one for production (only bug fixes) development one (new features) iGOC Redirection of Capone problem Collaboration in site troubleshooting. Problems resolved at weekly iVDGL operations meetings Use of community tools: Savannah portal (CVS, bugzilla, file repository) Twiki (documentation) Mailing lists and IM for communications and troubleshooting
16 Conclusions More flexible execution model Possibility to execute TRFs using shared or local disk areas No need of preinstalled transformation (possibility to stage it in with the job) Improved performance Job checkpointing and recoverability from submit host failures Max jobs no more limited by max number of python threads Recovery actions for some Grid errors Higher submission rate for clients The submission rate to the Grid could be higher but there were always queued jobs Feasibility in small team of development and support Production and development versions Extended documentation Production and user support and troubleshooting
17 Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) US ATLAS testbed sites and Grid3/OSG site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info Twiki Savanna portal