1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04
2 ATLAS Data Challenges Purpose Validate the LHC computing model Develop distributed production & analysis tools Provide large datasets for physics working groups Schedule DC1 ( ): full software chain DC2 (2004): automatic grid production system DC3 (2006): drive final deployments for startup
3 ATLAS DC2 Production Phase I: Simulation (Jul-Sep 04) generation, simulation & pileup produced datasets stored on Tier1 centers, then CERN (Tier0) scale: ~10M events, 30 TB Phase II: “Tier0 (1/10 scale) Produce ESD, AOD (reconstruction) Stream to Tier1 centers Phase III: Distributed analysis (Oct-Dec 04) access to event and non-event data from anywhere in the world both in organized and chaotic ways cf. D. Adams, #115
4 ATLAS Production System Components Production database ATLAS job definition and status Supervisor (all Grids): Windmill (L. Goossens, #501) Job distribution and verification system Data Management: Don Quijote (M. Branco #142) Provides ATLAS layer above Grid replica systems Grid Executors LCG: Lexor (D. Rebatto #364) NorduGrid: Dulcinea (O. Smirnova #499) Grid3: Capone (this talk)
5 LCG Nordu Grid Grid3 LSF LCG exe NG exe G3 exe Legacy exe super prodDB (CERN) data management RLS jabbersoap jabber Don Quijote “DQ” Windmill Lexor AMI (Metadata) Capone Dulcinea ATLAS Global Architecture this talk
6 Capone and Grid3 Requirements Interface to Grid3 (GriPhyN VDT based) Manage all steps in the job life cycle prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state information to Windmill ( ProdDB)
7 Capone Execution Environment GCE Server side ATLAS releases and transformations Pacman installation, dynamically by grid-based jobs Execution sandbox Chimera kickstart executable Transformation wrapper scripts MDS info providers (required site-specific attributes) GCE Client side (web service) Capone Chimera/Pegasus, Condor-G (from VDT) Globus RLS and DQ clients “GCE” = Grid Component Environment
8 Capone Architecture Message interface Web Service Jabber Translation layer Windmill schema CPE (Process Engine) Processes Grid3: GCE interface Stub: local shell testing DonQuijote (future) Message protocols Translation Web Service CPE Jabber Windmill ADA Stub Grid DonQuijote
9 Capone System Elements GriPhyN Virtual Data System (VDS) Transformation A workflow accepting input data (datasets), parameters and producing output data (datasets) Simple (executable)/Complex (DAG) Derivation Transformation where the parameters have been bound to actual parameters Directed Acyclic Graph (DAG) Abstract DAG (DAX) created by Chimera, with no reference to concrete elements in the Grid Concrete DAG (cDAG) created by Pegasus, where CE, SE and PFN have been assigned Globus, RLS, Condor
10 Capone Grid Interactions Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote
11 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (1, submission) Reception Job received from Windmill Translation Un-marshalling, ATLAS transformation DAX generation Chimera generates abstract DAG Input file retrieval from RLS catalog Check RLS for input LFNs (retrieval of GUID, PFN) Scheduling: CE and SE are chosen Concrete DAG generation and submission Pegasus creates Condor submit files DAGMan invoked to manage remote steps Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote
12 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (2, execution) Remote job running / status checking Stage-in of input files, create POOL FileCatalog Athena (ATLAS code) execution Remote Execution Check Verification of output files and exit codes Recovery of metadata (GUID, MD5sum, exe attributes) Stage Out: transfer from CE site to destination SE Output registration Registration of the output LFN/PFN and metadata in RLS Finish Job completed successfully, communicates to Windmill that jobs is ready for validation Job status is sent to Windmill during all the execution Windmill/DQ validate & register output in ProdDB
13 Performance Summary (9/20/04) Several physics and calibration samples produced 56K job attempts at Windmill level 9K of these aborted before grid submission: mostly RLS down or selected CE down “Full” success rate: 66% Average success after submitted: 70% Includes subsequent problems at submit host Includes errors from development 60 CPU-years consumed since July 8 TB produced Job statusCapone Total failed18812 finished37371
14 ATLAS DC2 CPU usage G. Poulard, 9/21/04 Total ATLAS DC2 ~ 1470 kSI2k.months ~ jobs ~ 7.94 million events ~ 30 TB Total ATLAS DC2 ~ 1470 kSI2k.months ~ jobs ~ 7.94 million events ~ 30 TB
15 Ramp up ATLAS DC2 Sep 10 Mid July CPU-day
16 J. Shank, 9/21/04 Job Distribution on Grid3
17 #CE GatekeeperTotal Jobs Finished Jobs Failed Success Rate (%) 1UTA_dpcc UC_ATLAS_Tier BU_ATLAS_Tier IU_ATLAS_Tier BNL_ATLAS_BAK BNL_ATLAS UM_ATLAS UCSanDiego_PG UBuffalo_CCR FNAL_CMS PDSF CalTech_PG SMU_Physics_Cluster Rice_Grid UWMadison FNAL_CMS UFlorida_PG Site Statistics (9/20/04) Average success rate by site: 70%
18 Capone & Grid3 Failure Statistics Total jobs (validated)37713 Jobs failed19303 Submission472 Execution392 Post-job check1147 Stage out8037 RLS registration989 Capone host interruptions2725 Capone succeed, Windmill fail57 Other5139 9/20/04
19 Production lessons Single points of failure Production database RLS, DQ, VDC and Jabber servers One local network domain Distributed RLS System expertise (people) Fragmented production software Fragmented operations (defining/fixing jobs in the production database) Client (Capone submit) hosts Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes No client host persistency Need local database for job recovery next phase of development DOEGrids certificate or certificate revocation list expiration
20 Production lessons (II) Site infrastructure problems Hardware problems Software distribution, transformation upgrades File systems (NFS major culprit); various solutions by site administrators Errors in stage-out caused by poor network connections and gatekeeper load. Fixed by adding I/O throttling, checking number of TCP connections Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition Load on gatekeepers Improved by moving md5sum off gatekeeper Post job processing Remote execution (mostly in pre/post job) error prone Reason of the failure difficult to understand No automated tools for validation
21 Operations Lessons Grid3 iGOC and US Tier1 developed operations response model Tier1 center core services “on-call” person available always response protocol developed iGOC Coordinates problem resolution for Tier1 “off hours” Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings Shift schedule (8-midnight since July 23) 7 trained DC2 submitters Keeps queues saturated, reports sites and system problems, cleans working directories Extensive use of lists Partial use of alternatives like Web portals, IM
22 Conclusions Completely new system Grid3 simplicity requires more functionality and state management on the executor submit host All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems clients exposed to all manner of grid failures good for experience, but a client-heavy system Major areas for upgrade to Capone system Job state management and controls, state persistency Generic transformation handling for user-level production
23 Authors GIERALTOWSKI, Gerald (Argonne National Laboratory) MAY, Edward (Argonne National Laboratory) VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University) YOUSSEF, Saul (Boston University) BAKER, Richard (Brookhaven National Laboratory) DENG, Wensheng (Brookhaven National Laboratory) NEVSKI, Pavel (Brookhaven National Laboratory) MAMBELLI, Marco (University of Chicago) GARDNER, Robert (University of Chicago) SMIRNOV, Yuri (University of Chicago) ZHAO, Xin (University of Chicago) LUEHRING, Frederick (Indiana University) SEVERINI, Horst (Oklahoma University) DE, Kaushik (University of Texas at Arlington) MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington) SOSEBEE, Mark (University of Texas at Arlington)
24 Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) US ATLAS testbed sites and Grid3 site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info US ATLAS Grid DC2 shift procedures US ATLAS Grid Tools & Services