Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical.

Similar presentations


Presentation on theme: "A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical."— Presentation transcript:

1 A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical Planning Meeting 28 June 2015

2 Reproducing here two I/O requirements for the future framework:  Input and output infrastructure must be capable of respecting semantic constraints on data organization, such as not interleaving events from different runs or run segments (luminosity blocks).  The framework needs to provide sufficient bookkeeping to ensure that all events in semantically meaningful units have been processed, and may be required to provide more detailed bookkeeping in jobs that filter events. The I/O layer should facilitate such accounting, and should provide a means to associate metadata with event samples. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 2

3 A bit of background  ATLAS I/O infrastructure has from its inception been designed to support processing (semantically meaningful) collections of events –The events in a given run and stream and processing, say, or –The events that pass a particular trigger in the good-for-physics luminosity blocks of a given run and stream and processing –And so on  Its design has supported the view that files are not fundamental, but artifacts of storage –Physics information/metadata about a collection of events does not depend upon whether they are stored in one file or N files or whether the events are individually scattered across a distributed object store, and –Algorithmic processing should be insensitive to whether the next event comes from the same file or a different one  Old-timers and people who know the code well will realize that this has been true for well over a decade  That is why one sees the “Implicit Collection” terminology when processing a file: the events that happen to be in that file constitute, implicitly, an event collection –But whether they constitute a semantically meaningful set (in the physics sense) is another, separate matter 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 3

4 But history is not ours alone  The ATLAS data management infrastructure took a different view –Datasets are collections of files (not collections of events that happen to reside in a certain list of files); files are fundamental; one must ensure that well-organized production results in sensible file organization; physics information is encapsulated in names of and metadata about files and file sets; …  Many other components and people took the same view –The “event collection” view is fine for mathematicians and others who care about theory, perhaps, but let’s be practical  And of course the grid and its supporting infrastructure were just emerging, and it might well have been risky to set one’s sights too high BUT Fast forward to distributed object stores and pending US ASCR proposals about science-aware data delivery  Perhaps what was too ambitious for the collaboration when we introduced and supported such notions is no longer so. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 4

5 Okay; but what’s the connection with metadata?  All of this is reviewed here because the metadata infrastructure was designed to follow the same principles as the I/O infrastructure: physics metadata are properly associated with collections of events rather than the (incidental) list of files that host them  This is why, when people get into the nitty-gritty of file-based processing, they can see both a BeginInputFile incident (opening a physical file) and a BeginEventCollection (aka BeginTagFile) (and they are generally baffled by this seeming redundancy): –The file is, in standard ATLAS use, only implicitly (not of necessity!) the event collection being used as input, and when this happens, the collection’s metadata and the file’s metadata may (but not must) coincide  The difference is important if you care about physics: In a cross-section calculation, the metadata you need corresponds to the set of luminosity blocks from which your events were selected, not the union of all the luminosity blocks in all files that contain some portion of your selection. –The infrastructure has correctly supported the distinction That’s why TAG-based selections used as input to an Athena job get the luminosity bookkeeping and other metadata right. –That capability may need to be reinvented someday. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 5

6 Metadata is propagated, too  Output of metadata was designed to be similarly general –The placement (caching) of output metadata in the/an output event file is a job configuration choice, not a restriction of the (core) infrastructure  Metadata cached in input files are made available on transitions across file boundaries via the use of incidents –An entirely appropriate strategy in (serial) Athena, as these file transitions are asynchronous to transitions of the Gaudi state machine And Gaudi/Athena has a reasonably well-defined incident-handling infrastructure Albeit a weak state machine model … but I digress  Client tools (type-specific metadata handlers) are provided with sufficient information (incidents AND state transitions??) to make their own decisions* about how and when and where to propagate metadata * Give them enough rope and …? 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 6

7 So where does that put us?  All of this should stand us in good stead for getting the metadata and bookkeeping right in a multithreaded framework, and provide a good start in distributed event processing (e.g., event server and its successors), but much work remains  Incident-driven handling made great sense for serial Athena, but –Not so much when people want to reuse metadata and metadata tools downstream, in non-Athena analysis –Use of incidents needs to be rethought in a multithreaded framework, too, and not just for metadata This is already underway (see the future framework requirements document)  However nostalgically one (okay; I) might recall the relatively clean conceptual foundation, the reality is that –ATLAS has sometimes used in-file metadata as a big open bag –Type-specific metadata tools have taken shortcuts and have built in dependencies and assumptions about files and their use and processing that are not inherent in the underlying core infrastructure –Whatever one may think of their design, these metadata and these tools accomplish genuinely useful things. Legacy code is not useless code. It matters. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 7

8 Furthermore …  While it’s well and good to say that output event files are not the only possible place to store metadata about an event collection, if not there, then where? –An auxiliary file, for example, might be easy at the job level (and straightforward at the dataset level by association), but data management and delivery infrastructure to date has not supported such associations, or (alternatively) inhomogeneous datasets Within a dataset, what if one of these files is not like the others? N event files, 1 (or M) metadata file(s)? Tricky for all components involved (and a production system may care more than the DDM itself) … but not conceptually impossible –Are we reaching a point where an updated approach to dataset-level metadata cataloging might be plausible? And while we’re on the subject, are extensible {name, value} pair catalogs good enough to support physics use cases? –… and there are alternative strategies, too (store/associate with each event sufficient metadata to reconstruct the collection’s metadata over the union of events?)  There is a very rich program of R&D ahead here 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 8

9 Enough of theory for today: in-file metadata in practice  Early on there were several modest conceptual principles articulated regarding in- file and out-file metadata, and a taxonomy of sorts (not just an enumeration), and a bit more –That was then. This is now—but we should not lose sight of this, and of ensuring that we have a reasonably solid conceptual foundation  In-file metadata have been the boon and the bane of robust and efficient transform and job configuration and initialization –And the bane sometimes of robust and efficient file merging Where newer approaches are being investigated  Demands for additional in-file metadata are increasing significantly.  Next talks will deal with reality: –Metadata, metadata representation, and file peeking, and putting these things on a firm foundation both conceptually and in practice –The sometimes-harsh realities of dealing with real metadata and real metadata content, propagation, access, extensibility, and supporting tools 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 9


Download ppt "A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical."

Similar presentations


Ads by Google