Download presentation
Presentation is loading. Please wait.
Published byCarmel Alexandra Morgan Modified over 9 years ago
1
Grid Production Experience in the ATLAS Experiment Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech April 7, 2004
2
March 29, 2004 LaTech D0SAR Meeting 2 ATLAS Data Challenges Original Goals (Nov 15, 2001) Test computing model, its software, its data model, and to ensure the correctness of the technical choices to be made Data Challenges should be executed at the prototype Tier centres Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU Current Status Goals are evolving as we gain experience Computing TDR ~end of 2004 DC’s are ~yearly sequence of increasing scale & complexity DC0 and DC1 (completed) DC2 (2004), DC3, and DC4 planned Grid deployment and testing is major part of DC’s
3
March 29, 2004 LaTech D0SAR Meeting 3 ATLAS DC1: July 2002-April 2003 Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possible Worldwide collaborative activity Participation : 56 Institutes Australia Austria Canada CERN China Czech Republic Denmark * France Germany Greece Israel Italy Japan Norway * Poland Russia Spain Sweden * Taiwan UK USA * * using Grid
4
March 29, 2004 LaTech D0SAR Meeting 4 (6300)(84) 2.5x10 6 Reconstruction + Lvl1/2 14165022 4x10 6 Lumi02 Pile-up 29600125 3x10 7 Simulation Single part. 60 21 23 TB Volume of data 51000 (+6300) 3750 6000 30000CPU-days (400 SI2k) kSI2k.months 4x10 6 2.8x10 6 10 7 No. of events CPU Time Process 690 (+84) Total 50Reconstruction 78 Lumi10 Pile-up 415Simulation Physics evt. DC1 Statistics (G. Poulard, July 2003)
5
March 29, 2004 LaTech D0SAR Meeting 5 U.S. ATLAS DC1 Data Production Year long process, Summer 2002-2003 Played 2nd largest role in ATLAS DC1 Exercised both farm and grid based production 10 U.S. sites participating Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL, LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2) Generated ~2 million fully simulated, piled-up and reconstructed events U.S. was largest grid-based DC1 data producer in ATLAS Data used for HLT TDR, Athens physics workshop, reconstruction software tests...
6
March 29, 2004 LaTech D0SAR Meeting 6 U.S. ATLAS Grid Testbed BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through Magda LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TB Boston U. - prototype Tier 2, 64 nodes Indiana U. - prototype Tier 2, 64 nodes UT Arlington - new 200 cpu’s, 50 TB Oklahoma U. - OSCER facility U. Michigan - test nodes ANL - test nodes, JAZZ cluster SMU - 6 production nodes UNM - Los Lobos cluster U. Chicago - test nodes
7
March 29, 2004 LaTech D0SAR Meeting 7 U.S. Production Summary * Total ~30 CPU YEARS delivered to DC1 from U.S. * Total produced file size ~20TB on HPSS tape system, ~10TB on disk. * Black - majority grid produced, Blue - majority farm produced Exercised both farm and grid based production Valuable large scale grid based production experience
8
March 29, 2004 LaTech D0SAR Meeting 8 DC1 Production Systems Local batch systems - bulk of production GRAT - grid scripts, generated ~50k files produced in U.S. NorduGrid - grid system, ~10k files in Nordic countries AtCom - GUI, ~10k files at CERN (mostly batch) GCE - Chimera based, ~1k files produced GRAPPA - interactive GUI for individual user EDG/LCG - test files only + systems I forgot… More systems coming for DC2 Windmill GANGA DIAL
9
March 29, 2004 LaTech D0SAR Meeting 9 GRAT Software GRid Applications Toolkit developed by KD, Horst Severini, Mark Sosebee, and students Based on Globus, Magda & MySQL Shell & Python scripts, modular design Rapid development platform Quickly develop packages as needed by DC Physics simulation (GEANT/ATLSIM) Pileup production & data management Reconstruction Test grid middleware, test grid performance Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress)
10
March 29, 2004 LaTech D0SAR Meeting 10 GRAT Execution Model 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-staging 5. Batch Submission 6. Job Parameterization DC1 Prod. (UTA) Remote Gatekeeper Replica (local) MAGDA (BNL) Param (CERN) Batch Execution scratch 1,4,5,10 2 3 4 5 6 7 89 7. Simulation 8. Post-staging 9. Cataloging 10. Monitoring
11
March 29, 2004 LaTech D0SAR Meeting 11 U.S. Middleware Evolution Used for 95% of DC1 production Used successfully for simulation Tested for simulation, used for all grid-based reconstruction Used successfully for simulation (complex pile-up workflow not yet)
12
March 29, 2004 LaTech D0SAR Meeting 12 DC1 Production Experience Grid paradigm works, using Globus Opportunistic use of existing resources, run anywhere, from anywhere, by anyone... Successfully exercised grid middleware with increasingly complex tasks Simulation: create physics data from pre-defined parameters and input files, CPU intensive Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive Reconstruction: data intensive, multiple passes Data tracking: multiple steps, one -> many -> many more mappings
13
March 29, 2004 LaTech D0SAR Meeting 13 New Production System for DC2 Goals Automated data production system for all ATLAS facilities Common database for all production - Oracle currently Common supervisor run by all facilities/managers - Windmill Common data management system - Don Quichote Executors developed by middleware experts (Capone, LCG, NorduGrid, batch systems, CanadaGrid...) Final verification of data done by supervisor
14
March 29, 2004 LaTech D0SAR Meeting 14 Windmill - Supervisor Supervisor development/U.S. DC production team UTA: Kaushik De, Mark Sosebee, Nurcan Ozturk + students BNL: Wensheng Deng, Rich Baker OU: Horst Severini ANL: Ed May Windmill web page http://www-hep.uta.edu/windmill http://www-hep.uta.edu/windmill Windmill status version 0.5 released February 23 includes complete library of xml messages between agents includes sample executors for local, pbs and web services can run on any Linux machine with Python 2.2 development continuing - Oracle production DB, DMS, new schema
15
March 29, 2004 LaTech D0SAR Meeting 15 Windmill Messaging supervisor agent executor agent XML switch (Jabber Server) XMPP (XML) XMPP (XML) Web server SOAP All messaging is XML based Agents communicate using Jabber (open chat) protocol Agents have same command line interface - GUI in future Agents & web server can run at same or different locations Executor accesses grid directly and/or thru web services
16
March 29, 2004 LaTech D0SAR Meeting 16 Intelligent Agents Supervisor/executor are intelligent communication agents uses Jabber open source instant messaging framework Jabber server routes XMPP messages - acts as XML data switch reliable p2p asynchronous message delivery through firewalls built in support for dynamic ‘directory’, ‘discovery’, ‘presence’ extensible - we can add monitoring, debugging agents easily provides ‘chat’ capability for free - collaboration among operators Jabber grid proxy under development (LBNL - Agarwal) Jabber Server Jabber Clients XMPP
17
March 29, 2004 LaTech D0SAR Meeting 17 Core Windmill Libraries interact.py - command line interface library agents.py - common intelligent agent library xmlkit.py - xml creation (generic) and parsing library messages.py - xml message creation (specific) proddb.py - production database methods for oracle, mysql, local, dummy, and possibly other options supervise.py - supervisor methods to drive production execute.py - executor methods to run facilities
18
March 29, 2004 LaTech D0SAR Meeting 18 Capone Executor Various executors are being developed Capone - U.S. VDT executor by U. of Chicago and Argonne Lexor - LCG executor mostly by Italian groups NorduGrid, batch (Munich), Canadian, Australian(?) Capone is based on GCE (Grid Computing Environment) (VDT Client/Server, Chimera, Pegasus, Condor, Globus) Status: Python module Process “thread” for each job Archive of managed jobs Job management Grid monitoring Aware of key parameters (e.g. available CPUs, jobs running)
19
March 29, 2004 LaTech D0SAR Meeting 19 Capone Architecture Message interface Web Service Jabber Translation level Windmill CPE (Capone Process Engine) Processes Grid Stub DonQuixote from Marco Mambelli Message protocols Translation Web Service CPE Jabber Windmill ADA Stub Grid DonQuixote
20
March 29, 2004 LaTech D0SAR Meeting 20 Windmill Screenshots
21
March 29, 2004 LaTech D0SAR Meeting 21
22
March 29, 2004 LaTech D0SAR Meeting 22 Web Services Example
23
March 29, 2004 LaTech D0SAR Meeting 23 Conclusion Data Challenges are important for ATLAS software and computing infrastructure readiness Grids will be the default testbed for DC2 U.S. playing a major role in DC2 planning & production 12 U.S. sites ready to participate in DC2 Major U.S. role in production software development Test of new grid production system imminent Physics analysis will be emphasis of DC2 - new experience Stay tuned
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.