Download presentation
Presentation is loading. Please wait.
Published byHenry Wilcox Modified over 8 years ago
1
2007-11-22Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do you find out what is in a particular dataset? General answers: –Go for coffee and ask. –Web pages. –Databases. Different solutions for each experiment. Examples: –Babar –D0 –ATLAS Experiment Metadata
2
2007-11-22Joe Foster 2 Babar Bookkeeping Dave Bailey
3
2007-11-22Joe Foster 3 Why? The bookkeping system is to keep track of data produced that have successfully passed a chain of checks and are declared good to be used by users. The information is organized in datasets. The idea of a datasets is that users don't need to know about the production details - such as good and bad runs, releases. Production systems insert data directly into the bookkeeping. The information in the tables is self consistent, users shouldn't need to go and look for information from other systems. The history of each dataset is maintained. There is support for merged collections (produced from more than one run)
4
2007-11-22Joe Foster 4 How? Information held in dedicated databases –Oracle at SLAC –MySQL at sites around the world Database keeps track of data that is available and also what is on disk at each site –The “on disk” information is local to each site –Consistent user experience everywhere using perl scripts to query the database (hides SQL queries) –Structure is held in database schema (table relationships) All databases are “open access” so users at any site can query the database at another site to check the status of files and see what’s available locally
5
2007-11-22Joe Foster 5 Using the Database Important point is that ALL tools are database driven –E.g. Copying data from SLAC to Manchester Mark the data in the local database for import Import data –Process queries database to find out what to get –Updates the status of files when successfully copied Make data available to users –Once imported, data is uploaded to, in our case, xrootd on the Tier2 –Status updated in the database to reflect this Users can now query the local database to see what is available
6
2007-11-22Joe Foster 6 Details...
7
2007-11-22Joe Foster 7 Experiment Metadata in ATLAS Computing model is Grid based. Scope of this section: –Still only MC data. –Conditions and calibration databases not covered. –Finding datasets for given process, cuts. –Finding contents, software version, provenance of a dataset. Sources of information –Colleagues. –Dataset names: trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000604 –Grid tools DQ2 –Web pages: DC3 Requests Panda Monitor Atlas Metadata Interface CSC reco datasets
8
2007-11-22Joe Foster 8 Info in Dataset Names Example: trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000604 Logical File Name of the dataset. Convention not really standard, but usually helps. Trigger info Misalignments applied MC release 12 Run number Trigger? GeneratorsReconstructed Analysis Object Data Recons version
9
2007-11-22Joe Foster 9 Don Quijote 2 (DQ2) Line mode interface to ATLAS Grid tasks and datasets. –Search and list by logical file name of dataset. Example: dq2_ls valid1_misal1_mc12.005200.* Lists dataset name, and optionally names its files + size. Doesn’t tell you what is in the files. –Download datasets for local processing. dq2_get Some sites don’t recognise ATLAS credentials yet! https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDataManagement
10
2007-11-22Joe Foster 10 WWW: ATLAS Computing Commissioning Requests http://jarguin.home.cern.ch/jarguin/dc3requests.html Pages for –High priority samples –Standard model + calibration –Beyond standard model + Higgs –Single Particles For each request for MC production: –Process –Category + subcategory –Cuts –Filter efficiency –Cross section –Number of events –Simulated luminosity –Generator –Data set number ( = ‘run number’). search dataset names –Requester –Link to documentation, etc. No info on individual datasets.
11
2007-11-22Joe Foster 11 PANDA Monitor PANDA: Production ANd Distributed Analysis system –System for running jobs and data access over the Grid. –Designed by US ATLAS for OSG, but now has links to many sites on LCG, NorduGrid, etc. Panda monitor is a web interface to a big database. http://panda.atlascomp.org –Task and job monitoring Subtasks at different sites Input task (provenance) Datasets produced Configuration data Status –Dataset catalog List of Logical File Names Replicas –Dataset replication (‘subscriptions’). –Fairly flexible search options.
12
2007-11-22Joe Foster 12 LFN of dataset Task name Provenance Subtask ID Configuration
13
2007-11-22Joe Foster 13 Part of Panda Dataset Listing Subtask ID
14
2007-11-22Joe Foster 14 Atlas Metadata Interface (AMI) ‘Official’ ATLAS metadata interface. https://atlastagcollector.in2p3.fr:8443/AMI/servlet/net.hep.atlas.Database.Bookkeeping.AMI.Servlet.Command –Links many data sources in flexible way. –Dataset search on properties of the data, or just dataset name. –Links from search results to, eg, provenance.
15
2007-11-22Joe Foster 15 AMI Advanced Search
16
2007-11-22Joe Foster 16 AMI Search Result
17
2007-11-22Joe Foster 17 AMI Provenance Search Result
18
2007-11-22Joe Foster 18 CSC reco datasets http://www-f9.ijs.si/cgi-bin/csc/csc_reco_datasets.cgi Lists of current and recent Computing System Commisioning (CSC) tasks for many Physics processes. Some provenance information listed too. Links to Panda monitor for subtasks. Quick and easy way to find recent data.
19
2007-11-22Joe Foster 19 CSC reco datasets pages Run numbers
20
2007-11-22Joe Foster 20 CSC reco Results for Run 5200 Links to Panda monitor for subtasks Task names
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.