Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming.

Similar presentations


Presentation on theme: "AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming."— Presentation transcript:

1 AMI S.A. Datasets… Solveig Albrand

2 AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming a whole. A number of things connected in temporal of spatial succession, or by natural production or formation. A collection of instruments, tools, or machines used together in a particular operation. Just a few of the definitions of sets in the Shorter Oxford Dictionary

3 AMI S.A. Applied to ATLAS Our production is TDAQ or Monte-Carlo Our operation is moving from one ATLAS site to another. An ATLAS dataset is a number of files which have been produced together, or which are usefully grouped together for transport.

4 AMI S.A. The things we put in the sets Our things are in general files. (usually of binary data, but not always) What we really want out of the datasets is not the files themselves but the events in the files. It’s just that we have to transport files. The connection between files and events is quite “natural” in Monte Carlo production.

5 AMI S.A. Dataset Definition Document “ A set of data produced under the same logical conditions and is a minimal portion of data movable across GRID by ATLAS Distributed Data Management system, and is expected to consist of uniform files suitable for processing with the same application in the transformation chain “ Atlas Dataset Definition Document

6 AMI S.A. Monte-Carlo Production TASK (EVGEN) TASK (SIMUL) Task = « a set of jobs » EVNTS HITS LOG

7 AMI S.A. Notion of “Task” A “Task” is a transformation of the events in one or more dataset of a given type, into one or more datasets of other types which is usually (but not necessarily) different from the input type. Note that if more than one type is produced by a task, then we define an output dataset for each type, because the input of a succeeding step will be defined as a unique type.

8 AMI S.A. AMI Provenance Diagram

9 AMI S.A. What about real data? Discussions are on-going about how datasets will be formed for real data, and even for commissioning. For ctb in 2004: 1 run = 1 dataset of “RAW” type, then from each “RAW” dataset several “recon” tasks produced ESD. This was in pre-DDM days.

10 AMI S.A. DDM requires “A set of data produced under the same logical conditions and is a minimal portion of data movable across GRID by ATLAS Distributed Data Management system.” It seems that one CSC run is too small to be moved across the grid, so several runs are grouped together, according to the metadata. New VERSIONS of the dataset must be defined as runs become validated, or not.

11 AMI S.A. Tiles & Larg Green runs Blue runs Red runs larg.000038.BarrelP3C.Pedestal.high.v000001 larg.000050.EC_Installation.Trigger.high.v000001

12 AMI S.A. How will the datasets be formed? TDAQ will write a certain amount of metadata into the header of each file. Probably this should be written into a database also – Surely we should not have to open each file to decide which dataset it belongs to?

13 AMI S.A. Event Collections C.f. Caitrina’s talk yesterday. We are interested in the events, but we can only transport events in files. The files should be transparent to the user. Note that the SAME set of files can be required for several DIFFERENT event collections. (How will we tell DDM this? (Perhaps we don’t need to)

14 AMI S.A. 2 Collections, same set of files

15 AMI S.A. Other datasets For Monte-Carlo production to get the “cross-section” calculated by EVGEN need to parse the log files. (Done by AMI). Need only look at one log file per task. Either get ALL evgen logs for all evgen tasks OR – make a “secondary” dataset – first two evgen log files of each task primary evgen log dataset, and open a subscription to this dataset on some site accessible to AMI Actually, even doing this we end up transporting rather more than we need to, because in fact the “log” datasets contain the whole sandbox, and we only need just the “log” file output by the job.

16 AMI S.A. Conditions DB Some trials have been made of transport of snapshots of the conditions DB to ATLAS sites, using DDM.

17 AMI S.A. How many Datasets are we expecting? Used the computing model in 2 ways: Raw data + analysis model  128 million Storage Estimate  N Files (22350000000) then nFiles/dataset  22 million But 42 million is just as good an answer as any….. 22350000000


Download ppt "AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming."

Similar presentations


Ads by Google