The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield

The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield davide.costanzo@cern.ch

Giant detector, giant computing

ATLAS Computing Grid based, multi-tier computing model: –Tier-0 at CERN. First step processing (within 24 hours), storage of Raw data, first-pass calibration –Tier-1. About 10 worldwide. Reprocessing, data storage (real data and simulation), … –Tier-2. Regional facilities. Storage of Analysis Object Data, simulation, … –Tier-3. Small clusters, users’ desktops. 3 different “flavors” of grid middleware: –LCG in Europe, Canada and Far East –OSG in the US –Nordugrid in Scandinavia and few other countries

Event Processing Data Flow Raw Data Objects (RDO) Detector Reconstruction Event Summary Data (ESD) Combined Reconstruction Analysis Object Data (AOD) Detector Output. Bytestream object view Simulation Output - Tracks, Segments - Calorimeter Towers - …. Analysis Objects: - Electron, Photon, - Muon, TrackParticle, - …. User Analysis Size/event 100 KBytes 500 Kbytes (target for stable data taking) ~3 MBytes

Simplified ATLAS Analysis Ideal Scenario: –Read AOD and create ntuple –Loop over ntuple and make histograms –Use root, make plots –go to ICHEP (or other conference) Realistic Scenario: –Customization in the AOD building stage –Different analysis have different needs Start-up Scenario: –Iterations needed on some data sample to improve Detector Reconstruction Distributed event processing (on the Grid) –Data sets “scattered” across several grid systems –Need distributed analysis Several times/day Few times/week Once a month?

ATLAS and the Grid: Past experience 2002-3 Data Challenge 1 –Contribution from about 50 sites. First use of the grid –Prototype distributed data management system 2004 Data Challenge 2 –Full use of the grid –ATLAS middleware not fully ready –Long delays, simulation data not accessible –Physics Validation not possible. Events not used for physics analysis 2005 “Rome Physics Workshop” and combined test beam –Centralized job definition –First users’ exposure to the Grid (deliver ~10M validated events) –Learn pros and cons of Distributed Data Management (DDM)

ATLAS and the Grid: Present (and Future) 2006 Computing System Commissioning (CSC) and Calibration Data Challenge –Use subsequent bug-fix sw releases to ramp-up the system (Validation) –Access (distributed) database data (eg calibration data) –Decentralize job definition –Test distributed analysis system 2006-7 Collection of about 25 physics notes –Use events produced for CSC –Concentrate on techniques to estimate Standard model background –Prepare physicists for the LHC challenge 2006 and beyond. Data taking –ATLAS is already taking cosmic data –Collider data is about to start –Exciting physics is behind the corner

ATLAS Distributed Data Management ATLAS reviewed all its own Grid systems during the first half of 2005 A new Distributed Data Management System (DDM) was designed: –A hierarchical definition of datasets –Central dataset catalogues –Data-blocks as units of file storage and replication –Distributed file catalogues –Automatic data transfer mechanisms using distributed services (dataset subscription system) The DDM system allows the implementation of the basic ATLAS Computing Model concepts, as described in the Computing Technical Design Report (June 2005): –Distribution of raw and reconstructed data from CERN to the Tier-1s –Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis –Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further distribution and/or processing

ATLAS DDM organization

ATLAS Data Management Model Tier-1s send AOD data to Tier-2s Tier-2s produce simulated data and send them to Tier-1s In the ideal world (perfect network communication hardware and software) we would not need to define default Tier-1—Tier-2 associations In practice, it turns out to be convenient (robust) to partition the Grid so that there are default (not compulsory) data paths between Tier-1s and Tier-2s In this model, a number of data management services are installed only at Tier-1s and act also on their “associated” Tier-2s

Job Management: Productions Once we have data distributed in the correct way we can rework the distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them) –This was not the case previously, as jobs were sent to free CPUs and had to copy the input file(s) to the local WN, from wherever in the world the data happened to be Next: make better use of the task and dataset concepts –A “task” acts on a dataset and produces more datasets –Use bulk submission functionality to send all jobs of a given task to the location of their input datasets –Minimize the dependence on file transfers and the waiting time before execution –Collect output files belonging to the same dataset to the same SE and transfer them asynchronously to their final locations

ATLAS Production System (2006) EGEENorduGridOSG EGEE exe EGEE exe NG exe OSG exe super prodDB (jobs) DMS (Data Management) Python DQ2 Eowyn Tasks PanDADulcinea LexorLexor-CG LSF exe super Python T0MS

Job Management: Analysis A system based on a central database (job queue) is good for scheduled productions (as it allows proper priority settings), but too heavy for user tasks such as analysis Lacking a global way to submit jobs, a few tools have been developed to submit Grid jobs in the meantime: –LJSF (Lightweight Job Submission framework) can submit ATLAS jobs to the LCG/EGEE Grid –Pathena (parallel version of atlas sw framework – athena) can generate ATLAS jobs that act on a dataset and submits them to PanDA on the OSG Grid The ATLAS baseline tool to help users to submit Grid jobs is Ganga –First ganga tutorial given to ATLAS 3 weeks ago –Ganga and pathena integrated to submit jobs to different grids

ATLAS Analysis Work Model Local system (shell) Prepare JobOptions  Run Athena (interactive or batch)  Get Output Local system (Ganga) Job book-keeping Get Output Local system (Ganga) Prepare JobOptions Find dataset from DDM Generate & submit jobs Grid Run Athena Local system (Ganga) Job book-keeping Access output from Grid Merge results Local system (Ganga) Prepare JobOptions Find dataset from DDM Generate & submit jobs ProdSys Run Athena on Grid Store o/p on Grid 1. Job Preparation 2. Medium-scale testing 3. Large scale running

Distributed analysis use cases Statistics analyses (eg W mass) on several million event datasets: –All data files may not be kept on a local disk –Jobs are sent on AODs on the grid to make ntuples for analysis –Parallel processing required Select a few interesting (candidate) events to analyze (eg H→4ℓ): –Information on AODs may not be enough. –ESD files accessed to make a lose selection and copy candidate events on a local disk Use cases to be exercised in the coming Computing System Commissioning tests

From managed production to Distributed Analysis Central managed production is now “routine work”: –Request a dataset to a physics group convener –Physics groups collect requests –Physics coordination keeps tracks of all requests and pass them to computing operation team –Pros: Well organized, uniform software used, well documented –Cons: Bureaucratic! Takes time to get what you need… Delegate definition of jobs to physics and combined performance working groups: –Remove a management layer –Still requires central organization to avoid duplication of effort –Accounting and priorities? Job definition/submission for every ATLAS user: –Pros: you get what you want –Cons: no uniformity, some duplication of effort

Resource Management In order to provide a usable global system, a few more pieces must work as well: –Accounting at user and group level –Fair share (job priorities) for workload management –Storage quotas for data management Define ~25 groups and ~3 roles in VOMS: –Perhaps they are not trivial –Perhaps they must force re-thinking of some of the current implementations In any case we cannot advertise a system that is “free for all” (no job priorities, no storage quotas) –Therefore we need these features “now”

Conclusions ATLAS is currently using GRID resources for MC based studies and real data from Combined test-beam and Cosmic Rays –A user community is emerging –Continue to review critical components to make sure we have everything we need Now we need stability and reliability more than new functionality –New components may be welcome in production, if they are shown to provide better performance than existing ones, but only after thorough testing in pre-production service instances The challenge of data taking is still in front of us! –Simulation exercises can teach us several lessons, but they are just the beginning of the story…

The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield

Similar presentations

Presentation on theme: "The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield

Similar presentations

Presentation on theme: "The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield"— Presentation transcript:

Similar presentations

About project

Feedback