Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04.

Similar presentations


Presentation on theme: "1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04."— Presentation transcript:

1 1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04

2 2 ATLAS Data Challenges Purpose  Validate the LHC computing model  Develop distributed production & analysis tools  Provide large datasets for physics working groups Schedule  DC1 (2002-2003): full software chain  DC2 (2004): automatic grid production system  DC3 (2006): drive final deployments for startup

3 3 ATLAS DC2 Production Phase I: Simulation (Jul-Sep 04)  generation, simulation & pileup  produced datasets stored on Tier1 centers, then CERN (Tier0)  scale: ~10M events, 30 TB Phase II: “Tier0 Test” @CERN (1/10 scale)  Produce ESD, AOD (reconstruction)  Stream to Tier1 centers Phase III: Distributed analysis (Oct-Dec 04)  access to event and non-event data from anywhere in the world both in organized and chaotic ways cf. D. Adams, #115

4 4 ATLAS Production System Components  Production database ATLAS job definition and status  Supervisor (all Grids): Windmill (L. Goossens, #501) Job distribution and verification system  Data Management: Don Quijote (M. Branco #142) Provides ATLAS layer above Grid replica systems  Grid Executors LCG: Lexor (D. Rebatto #364) NorduGrid: Dulcinea (O. Smirnova #499) Grid3: Capone (this talk)

5 5 LCG Nordu Grid Grid3 LSF LCG exe NG exe G3 exe Legacy exe super prodDB (CERN) data management RLS jabbersoap jabber Don Quijote “DQ” Windmill Lexor AMI (Metadata) Capone Dulcinea ATLAS Global Architecture this talk

6 6 Capone and Grid3 Requirements Interface to Grid3 (GriPhyN VDT based) Manage all steps in the job life cycle  prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state information to Windmill (  ProdDB)

7 7 Capone Execution Environment GCE Server side  ATLAS releases and transformations Pacman installation, dynamically by grid-based jobs  Execution sandbox Chimera kickstart executable Transformation wrapper scripts  MDS info providers (required site-specific attributes) GCE Client side (web service)  Capone  Chimera/Pegasus, Condor-G (from VDT)  Globus RLS and DQ clients “GCE” = Grid Component Environment

8 8 Capone Architecture Message interface  Web Service  Jabber Translation layer  Windmill schema CPE (Process Engine) Processes  Grid3: GCE interface  Stub: local shell testing  DonQuijote (future) Message protocols Translation Web Service CPE Jabber Windmill ADA Stub Grid DonQuijote

9 9 Capone System Elements GriPhyN Virtual Data System (VDS) Transformation  A workflow accepting input data (datasets), parameters and producing output data (datasets)  Simple (executable)/Complex (DAG) Derivation  Transformation where the parameters have been bound to actual parameters Directed Acyclic Graph (DAG)  Abstract DAG (DAX) created by Chimera, with no reference to concrete elements in the Grid  Concrete DAG (cDAG) created by Pegasus, where CE, SE and PFN have been assigned Globus, RLS, Condor

10 10 Capone Grid Interactions Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote

11 11 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (1, submission) Reception  Job received from Windmill Translation  Un-marshalling, ATLAS transformation DAX generation  Chimera generates abstract DAG Input file retrieval from RLS catalog  Check RLS for input LFNs (retrieval of GUID, PFN) Scheduling: CE and SE are chosen Concrete DAG generation and submission  Pegasus creates Condor submit files  DAGMan invoked to manage remote steps Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote

12 12 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (2, execution) Remote job running / status checking  Stage-in of input files, create POOL FileCatalog  Athena (ATLAS code) execution Remote Execution Check  Verification of output files and exit codes  Recovery of metadata (GUID, MD5sum, exe attributes) Stage Out: transfer from CE site to destination SE Output registration  Registration of the output LFN/PFN and metadata in RLS Finish  Job completed successfully, communicates to Windmill that jobs is ready for validation Job status is sent to Windmill during all the execution Windmill/DQ validate & register output in ProdDB

13 13 Performance Summary (9/20/04) Several physics and calibration samples produced 56K job attempts at Windmill level  9K of these aborted before grid submission: mostly RLS down or selected CE down “Full” success rate: 66%  Average success after submitted: 70%  Includes subsequent problems at submit host  Includes errors from development 60 CPU-years consumed since July 8 TB produced Job statusCapone Total failed18812 finished37371

14 14 ATLAS DC2 CPU usage G. Poulard, 9/21/04 Total ATLAS DC2 ~ 1470 kSI2k.months ~ 100000 jobs ~ 7.94 million events ~ 30 TB Total ATLAS DC2 ~ 1470 kSI2k.months ~ 100000 jobs ~ 7.94 million events ~ 30 TB

15 15 Ramp up ATLAS DC2 Sep 10 Mid July CPU-day

16 16 J. Shank, 9/21/04 Job Distribution on Grid3

17 17 #CE GatekeeperTotal Jobs Finished Jobs Failed Success Rate (%) 1UTA_dpcc88176703211476.02 2UC_ATLAS_Tier261324980115281.21 3BU_ATLAS_Tier263364890144677.18 4IU_ATLAS_Tier248363625121174.96 5BNL_ATLAS_BAK4579359198878.42 6BNL_ATLAS3116254856881.77 7UM_ATLAS35831998158555.76 8UCSanDiego_PG2097171238581.64 9UBuffalo_CCR1925159433182.81 10FNAL_CMS26491456119354.96 11PDSF2328143089861.43 12CalTech_PG1834135048473.61 13SMU_Physics_Cluster66043822266.36 14Rice_Grid349336313073.63 15UWMadison516258 50.00 16FNAL_CMS234322811566.47 17UFlorida_PG39418221246.19 Site Statistics (9/20/04) Average success rate by site: 70%

18 18 Capone & Grid3 Failure Statistics Total jobs (validated)37713 Jobs failed19303  Submission472  Execution392  Post-job check1147  Stage out8037  RLS registration989  Capone host interruptions2725  Capone succeed, Windmill fail57  Other5139 9/20/04

19 19 Production lessons Single points of failure  Production database  RLS, DQ, VDC and Jabber servers One local network domain Distributed RLS  System expertise (people) Fragmented production software Fragmented operations (defining/fixing jobs in the production database) Client (Capone submit) hosts  Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes  No client host persistency Need local database for job recovery  next phase of development DOEGrids certificate or certificate revocation list expiration

20 20 Production lessons (II) Site infrastructure problems  Hardware problems  Software distribution, transformation upgrades  File systems (NFS major culprit); various solutions by site administrators  Errors in stage-out caused by poor network connections and gatekeeper load. Fixed by adding I/O throttling, checking number of TCP connections  Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition Load on gatekeepers  Improved by moving md5sum off gatekeeper Post job processing  Remote execution (mostly in pre/post job) error prone  Reason of the failure difficult to understand No automated tools for validation

21 21 Operations Lessons Grid3 iGOC and US Tier1 developed operations response model Tier1 center  core services  “on-call” person available always  response protocol developed iGOC  Coordinates problem resolution for Tier1 “off hours”  Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings Shift schedule (8-midnight since July 23)  7 trained DC2 submitters  Keeps queues saturated, reports sites and system problems, cleans working directories Extensive use of email lists  Partial use of alternatives like Web portals, IM

22 22 Conclusions Completely new system  Grid3 simplicity requires more functionality and state management on the executor submit host  All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems clients exposed to all manner of grid failures good for experience, but a client-heavy system Major areas for upgrade to Capone system  Job state management and controls, state persistency  Generic transformation handling for user-level production

23 23 Authors GIERALTOWSKI, Gerald (Argonne National Laboratory) MAY, Edward (Argonne National Laboratory) VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University) YOUSSEF, Saul (Boston University) BAKER, Richard (Brookhaven National Laboratory) DENG, Wensheng (Brookhaven National Laboratory) NEVSKI, Pavel (Brookhaven National Laboratory) MAMBELLI, Marco (University of Chicago) GARDNER, Robert (University of Chicago) SMIRNOV, Yuri (University of Chicago) ZHAO, Xin (University of Chicago) LUEHRING, Frederick (Indiana University) SEVERINI, Horst (Oklahoma University) DE, Kaushik (University of Texas at Arlington) MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington) SOSEBEE, Mark (University of Texas at Arlington)

24 24 Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) US ATLAS testbed sites and Grid3 site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info  US ATLAS Grid http://www.usatlas.bnl.gov/computing/grid/http://www.usatlas.bnl.gov/computing/grid/  DC2 shift procedures http://grid.uchicago.edu/dc2shifthttp://grid.uchicago.edu/dc2shift  US ATLAS Grid Tools & Services http://grid.uchicago.edu/gts/http://grid.uchicago.edu/gts/


Download ppt "1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04."

Similar presentations


Ads by Google