1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04.

Slides:

Advertisements

Similar presentations

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.

Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

Sphinx Server Sphinx Client Data Warehouse Submitter Generic Grid Site Monitoring Service Resource Message Interface Current Sphinx Client/Server Multi-threaded.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

AustrianGrid, LCG & more Reinhard Bischof HPC-Seminar April 8 th 2005.

Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.

A tool to enable CMS Distributed Analysis

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.

Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.

ATLAS Data Challenge Production and U.S. Participation Kaushik De University of Texas at Arlington BNL Physics & Computing Meeting August 29, 2003.

Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC

OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.

David Adams ATLAS ATLAS Distributed Analysis David Adams BNL March 18, 2004 ATLAS Software Workshop Grid session.

1 Grid3: an Application Grid Laboratory for Science Rob Gardner University of Chicago on behalf of the Grid3 project CHEP ’04, Interlaken September 28,

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25,

Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 Plans for the integration of grid tools in the CMS computing environment Claudio.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

ATLAS Data Challenge Production Experience Kaushik De University of Texas at Arlington Oklahoma D0 SARS Meeting September 26, 2003.

MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

US ATLAS Grid Projects Rob Gardner Indiana University Mid Year Review of US ATLAS Computing NSF Headquarters, Arlington VA June 20, 2002

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

David Adams ATLAS ATLAS Distributed Analysis David Adams BNL September 30, 2004 CHEP2004 Track 5: Distributed Computing Systems and Experiences.

Grid Production Experience in the ATLAS Experiment Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech.

D. Adams, D. Liko, K...Harrison, C. L. Tan ATLAS ATLAS Distributed Analysis: Current roadmap David Adams – DIAL/PPDG/BNL Dietrich Liko – ARDA/EGEE/CERN.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Post-DC2/Rome Production Kaushik De, Mark Sosebee University of Texas at Arlington U.S. Grid Phone Meeting July 13, 2005.

VDT 1 The Virtual Data Toolkit Todd Tannenbaum (Alain Roy)

Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

ATLAS Grid Computing Rob Gardner University of Chicago ICFA Workshop on HEP Networking, Grid, and Digital Divide Issues for Global e-Science THE CENTER.

David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.

ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.

1 ATLAS Computing on Grid3 and the OSG Rob Gardner University of Chicago Mardi Gras Conference February 4, 2005.

Magda Distributed Data Manager Torre Wenaus BNL October 2001.

1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,

ATLAS on Grid3/OSG R. Gardner December 16, 2004.

ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.

U.S. ATLAS Grid Production Experience

U.S. ATLAS Testbed Status Report

ATLAS DC2 & Continuous production

Presentation transcript:

1 ATLAS DC2 Production …on Grid3 M. Mambelli, University of Chicago for the US ATLAS DC2 team September 28, 2004 CHEP04

2 ATLAS Data Challenges Purpose  Validate the LHC computing model  Develop distributed production & analysis tools  Provide large datasets for physics working groups Schedule  DC1 ( ): full software chain  DC2 (2004): automatic grid production system  DC3 (2006): drive final deployments for startup

3 ATLAS DC2 Production Phase I: Simulation (Jul-Sep 04)  generation, simulation & pileup  produced datasets stored on Tier1 centers, then CERN (Tier0)  scale: ~10M events, 30 TB Phase II: “Tier0 (1/10 scale)  Produce ESD, AOD (reconstruction)  Stream to Tier1 centers Phase III: Distributed analysis (Oct-Dec 04)  access to event and non-event data from anywhere in the world both in organized and chaotic ways cf. D. Adams, #115

4 ATLAS Production System Components  Production database ATLAS job definition and status  Supervisor (all Grids): Windmill (L. Goossens, #501) Job distribution and verification system  Data Management: Don Quijote (M. Branco #142) Provides ATLAS layer above Grid replica systems  Grid Executors LCG: Lexor (D. Rebatto #364) NorduGrid: Dulcinea (O. Smirnova #499) Grid3: Capone (this talk)

5 LCG Nordu Grid Grid3 LSF LCG exe NG exe G3 exe Legacy exe super prodDB (CERN) data management RLS jabbersoap jabber Don Quijote “DQ” Windmill Lexor AMI (Metadata) Capone Dulcinea ATLAS Global Architecture this talk

6 Capone and Grid3 Requirements Interface to Grid3 (GriPhyN VDT based) Manage all steps in the job life cycle  prepare, submit, monitor, output & register Manage workload and data placement Process messages from Windmill Supervisor Provide useful logging information to user Communicate executor and job state information to Windmill (  ProdDB)

7 Capone Execution Environment GCE Server side  ATLAS releases and transformations Pacman installation, dynamically by grid-based jobs  Execution sandbox Chimera kickstart executable Transformation wrapper scripts  MDS info providers (required site-specific attributes) GCE Client side (web service)  Capone  Chimera/Pegasus, Condor-G (from VDT)  Globus RLS and DQ clients “GCE” = Grid Component Environment

8 Capone Architecture Message interface  Web Service  Jabber Translation layer  Windmill schema CPE (Process Engine) Processes  Grid3: GCE interface  Stub: local shell testing  DonQuijote (future) Message protocols Translation Web Service CPE Jabber Windmill ADA Stub Grid DonQuijote

9 Capone System Elements GriPhyN Virtual Data System (VDS) Transformation  A workflow accepting input data (datasets), parameters and producing output data (datasets)  Simple (executable)/Complex (DAG) Derivation  Transformation where the parameters have been bound to actual parameters Directed Acyclic Graph (DAG)  Abstract DAG (DAX) created by Chimera, with no reference to concrete elements in the Grid  Concrete DAG (cDAG) created by Pegasus, where CE, SE and PFN have been assigned Globus, RLS, Condor

10 Capone Grid Interactions Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote

11 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (1, submission) Reception  Job received from Windmill Translation  Un-marshalling, ATLAS transformation DAX generation  Chimera generates abstract DAG Input file retrieval from RLS catalog  Check RLS for input LFNs (retrieval of GUID, PFN) Scheduling: CE and SE are chosen Concrete DAG generation and submission  Pegasus creates Condor submit files  DAGMan invoked to manage remote steps Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote

12 Capone Condor-G schedd GridMgr CE gatekeeper gsiftp WN SE Chimera RLS Monitoring MDS GridCat MonALISA Windmill Pegasus ProdDB VDC DonQuijote A job in Capone (2, execution) Remote job running / status checking  Stage-in of input files, create POOL FileCatalog  Athena (ATLAS code) execution Remote Execution Check  Verification of output files and exit codes  Recovery of metadata (GUID, MD5sum, exe attributes) Stage Out: transfer from CE site to destination SE Output registration  Registration of the output LFN/PFN and metadata in RLS Finish  Job completed successfully, communicates to Windmill that jobs is ready for validation Job status is sent to Windmill during all the execution Windmill/DQ validate & register output in ProdDB

13 Performance Summary (9/20/04) Several physics and calibration samples produced 56K job attempts at Windmill level  9K of these aborted before grid submission: mostly RLS down or selected CE down “Full” success rate: 66%  Average success after submitted: 70%  Includes subsequent problems at submit host  Includes errors from development 60 CPU-years consumed since July 8 TB produced Job statusCapone Total failed18812 finished37371

14 ATLAS DC2 CPU usage G. Poulard, 9/21/04 Total ATLAS DC2 ~ 1470 kSI2k.months ~ jobs ~ 7.94 million events ~ 30 TB Total ATLAS DC2 ~ 1470 kSI2k.months ~ jobs ~ 7.94 million events ~ 30 TB

15 Ramp up ATLAS DC2 Sep 10 Mid July CPU-day

16 J. Shank, 9/21/04 Job Distribution on Grid3

17 #CE GatekeeperTotal Jobs Finished Jobs Failed Success Rate (%) 1UTA_dpcc UC_ATLAS_Tier BU_ATLAS_Tier IU_ATLAS_Tier BNL_ATLAS_BAK BNL_ATLAS UM_ATLAS UCSanDiego_PG UBuffalo_CCR FNAL_CMS PDSF CalTech_PG SMU_Physics_Cluster Rice_Grid UWMadison FNAL_CMS UFlorida_PG Site Statistics (9/20/04) Average success rate by site: 70%

18 Capone & Grid3 Failure Statistics Total jobs (validated)37713 Jobs failed19303  Submission472  Execution392  Post-job check1147  Stage out8037  RLS registration989  Capone host interruptions2725  Capone succeed, Windmill fail57  Other5139 9/20/04

19 Production lessons Single points of failure  Production database  RLS, DQ, VDC and Jabber servers One local network domain Distributed RLS  System expertise (people) Fragmented production software Fragmented operations (defining/fixing jobs in the production database) Client (Capone submit) hosts  Load and memory requirements for job management Load caused by job state checking (interaction with Condor-G) Many processes  No client host persistency Need local database for job recovery  next phase of development DOEGrids certificate or certificate revocation list expiration

20 Production lessons (II) Site infrastructure problems  Hardware problems  Software distribution, transformation upgrades  File systems (NFS major culprit); various solutions by site administrators  Errors in stage-out caused by poor network connections and gatekeeper load. Fixed by adding I/O throttling, checking number of TCP connections  Lack of storage management (eg SRM) on sites means submitters do some cleanup remotely. Not a major problem so far, but we’ve not had much competition Load on gatekeepers  Improved by moving md5sum off gatekeeper Post job processing  Remote execution (mostly in pre/post job) error prone  Reason of the failure difficult to understand No automated tools for validation

21 Operations Lessons Grid3 iGOC and US Tier1 developed operations response model Tier1 center  core services  “on-call” person available always  response protocol developed iGOC  Coordinates problem resolution for Tier1 “off hours”  Trouble handling for non-ATLAS Grid3 sites. Problems resolved at weekly iVDGL operations meetings Shift schedule (8-midnight since July 23)  7 trained DC2 submitters  Keeps queues saturated, reports sites and system problems, cleans working directories Extensive use of lists  Partial use of alternatives like Web portals, IM

22 Conclusions Completely new system  Grid3 simplicity requires more functionality and state management on the executor submit host  All functions of job planning, job state tracking, and data management (stage-in, out) managed by Capone rather than grid systems clients exposed to all manner of grid failures good for experience, but a client-heavy system Major areas for upgrade to Capone system  Job state management and controls, state persistency  Generic transformation handling for user-level production

23 Authors GIERALTOWSKI, Gerald (Argonne National Laboratory) MAY, Edward (Argonne National Laboratory) VANIACHINE, Alexandre (Argonne National Laboratory) SHANK, Jim (Boston University) YOUSSEF, Saul (Boston University) BAKER, Richard (Brookhaven National Laboratory) DENG, Wensheng (Brookhaven National Laboratory) NEVSKI, Pavel (Brookhaven National Laboratory) MAMBELLI, Marco (University of Chicago) GARDNER, Robert (University of Chicago) SMIRNOV, Yuri (University of Chicago) ZHAO, Xin (University of Chicago) LUEHRING, Frederick (Indiana University) SEVERINI, Horst (Oklahoma University) DE, Kaushik (University of Texas at Arlington) MCGUIGAN, Patrick (University of Texas at Arlington) OZTURK, Nurcan (University of Texas at Arlington) SOSEBEE, Mark (University of Texas at Arlington)

24 Acknowledgements Windmill team (Kaushik De) Don Quijote team (Miguel Branco) ATLAS production group, Luc Goossens, CERN IT (prodDB) ATLAS software distribution team (Alessandro de Salvo, Fred Luehring) US ATLAS testbed sites and Grid3 site administrators iGOC operations group ATLAS Database group (ProdDB Capone-view displays) Physics Validation group: UC Berkeley, Brookhaven Lab More info  US ATLAS Grid  DC2 shift procedures  US ATLAS Grid Tools & Services