A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical.

Slides:

Advertisements

Similar presentations

United Nations Statistics Division

Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Metadata Considerations for ATLAS Distributed Computing David Malon Argonne National Laboratory ADC Technical Interchange Meeting University.

7-1 INTRODUCTION: SoA Introduced SoA in Chapter 6 Service-oriented architecture (SoA) - perspective that focuses on the development, use, and reuse of.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.

Technical Architectures

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Component-Based Software Engineering Introducing the Bank Example Paul Krause.

Chapter 1 and 2 Computer System and Operating System Overview

Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.

Control of Personal Information in a Networked World Rebecca Wright Boaz Barak Jim Aspnes Avi Wigderson Sanjeev Arora David Goodman Joan Feigenbaum ToNC.

Course Instructor: Aisha Azeem

Architectural Styles SE 464 / ECE 452 / CS 446 Chang Hwan Peter Kim Based on slides prepared by Michał Antkiewicz June 24, 2006.

DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.

Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)

AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming.

FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.

By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.

Persistence Technology and I/O Framework Evolution Planning David Malon Argonne National Laboratory 18 July 2011.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

Andrew S. Budarevsky Adaptive Application Data Management Overview.

An RTAG View of Event Collections, and Early Implementations David Malon ATLAS Database Group LHC Persistence Workshop 5 June 2002.

Toward a Next-generation I/O Framework David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical Planning Meeting.

Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:

Chapter 2 Database Environment Chuan Li 1 © Pearson Education Limited 1995, 2005.

A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.

Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

I/O Infrastructure Support and Development David Malon ATLAS Software Technical Interchange Meeting 9 November 2015.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.

Discretionary Access Control Models Adith Srinivasan.

Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.

AMI, Metadata, and Software Infrastructure David Malon 30 August 2010 ATLAS AMI and Metadata Workshop.

General requirements for BES III offline & EF selection software Weidong Li.

I/O Strategies for Multicore Processing in ATLAS P van Gemmeren 1, S Binet 2, P Calafiura 3, W Lavrijsen 3, D Malon 1 and V Tsulaia 3 on behalf of the.

PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.

TAGS in the Analysis Model Jack Cranshaw, Argonne National Lab September 10, 2009.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

Copyright © 2004, Keith D Swenson, All Rights Reserved. OASIS Asynchronous Service Access Protocol (ASAP) Tutorial Overview, OASIS ASAP TC May 4, 2004.

Summary of persistence discussions with LHCb and LCG/IT POOL team David Malon Argonne National Laboratory Joint ATLAS, LHCb, LCG/IT meeting.

T3g software services Outline of the T3g Components R. Yoshida (ANL)

CPT Week, November , 2002 Lassi A. Tuura, Northeastern University Core Framework Infrastructure Lassi A. Tuura Northeastern.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.

David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.

Next-Generation Navigational Infrastructure and the ATLAS Event Store Abstract: The ATLAS event store employs a persistence framework with extensive navigational.

David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.

I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)

I/O and Metadata Jack Cranshaw Argonne National Laboratory November 9, ATLAS Core Software TIM.

1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.

Chapter 27 Network Management Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.

Metadata and Supporting Tools on Day One David Malon Argonne National Laboratory Argonne ATLAS Analysis Jamboree Chicago, Illinois 22 May 2009.

Athena I/O Component Refactorization - Overview Random notes about todays agenda. Peter Van Gemmeren (Argonne National Laboratory (US))

for the Offline and Computing groups

Introduction to Configuration Management

Chapter 15 QUERY EXECUTION.

Alignment of Part 4B with ISAE 3000

Objective of This Course

SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691

Chapter 5 Architectural Design.

Intelligent Tutoring Systems

Presentation transcript:

A Framework-oriented Bridge to Metadata Discussions David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical Planning Meeting 28 June 2015

Reproducing here two I/O requirements for the future framework:  Input and output infrastructure must be capable of respecting semantic constraints on data organization, such as not interleaving events from different runs or run segments (luminosity blocks).  The framework needs to provide sufficient bookkeeping to ensure that all events in semantically meaningful units have been processed, and may be required to provide more detailed bookkeeping in jobs that filter events. The I/O layer should facilitate such accounting, and should provide a means to associate metadata with event samples. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 2

A bit of background  ATLAS I/O infrastructure has from its inception been designed to support processing (semantically meaningful) collections of events –The events in a given run and stream and processing, say, or –The events that pass a particular trigger in the good-for-physics luminosity blocks of a given run and stream and processing –And so on  Its design has supported the view that files are not fundamental, but artifacts of storage –Physics information/metadata about a collection of events does not depend upon whether they are stored in one file or N files or whether the events are individually scattered across a distributed object store, and –Algorithmic processing should be insensitive to whether the next event comes from the same file or a different one  Old-timers and people who know the code well will realize that this has been true for well over a decade  That is why one sees the “Implicit Collection” terminology when processing a file: the events that happen to be in that file constitute, implicitly, an event collection –But whether they constitute a semantically meaningful set (in the physics sense) is another, separate matter 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 3

But history is not ours alone  The ATLAS data management infrastructure took a different view –Datasets are collections of files (not collections of events that happen to reside in a certain list of files); files are fundamental; one must ensure that well-organized production results in sensible file organization; physics information is encapsulated in names of and metadata about files and file sets; …  Many other components and people took the same view –The “event collection” view is fine for mathematicians and others who care about theory, perhaps, but let’s be practical  And of course the grid and its supporting infrastructure were just emerging, and it might well have been risky to set one’s sights too high BUT Fast forward to distributed object stores and pending US ASCR proposals about science-aware data delivery  Perhaps what was too ambitious for the collaboration when we introduced and supported such notions is no longer so. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 4

Okay; but what’s the connection with metadata?  All of this is reviewed here because the metadata infrastructure was designed to follow the same principles as the I/O infrastructure: physics metadata are properly associated with collections of events rather than the (incidental) list of files that host them  This is why, when people get into the nitty-gritty of file-based processing, they can see both a BeginInputFile incident (opening a physical file) and a BeginEventCollection (aka BeginTagFile) (and they are generally baffled by this seeming redundancy): –The file is, in standard ATLAS use, only implicitly (not of necessity!) the event collection being used as input, and when this happens, the collection’s metadata and the file’s metadata may (but not must) coincide  The difference is important if you care about physics: In a cross-section calculation, the metadata you need corresponds to the set of luminosity blocks from which your events were selected, not the union of all the luminosity blocks in all files that contain some portion of your selection. –The infrastructure has correctly supported the distinction That’s why TAG-based selections used as input to an Athena job get the luminosity bookkeeping and other metadata right. –That capability may need to be reinvented someday. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 5

Metadata is propagated, too  Output of metadata was designed to be similarly general –The placement (caching) of output metadata in the/an output event file is a job configuration choice, not a restriction of the (core) infrastructure  Metadata cached in input files are made available on transitions across file boundaries via the use of incidents –An entirely appropriate strategy in (serial) Athena, as these file transitions are asynchronous to transitions of the Gaudi state machine And Gaudi/Athena has a reasonably well-defined incident-handling infrastructure Albeit a weak state machine model … but I digress  Client tools (type-specific metadata handlers) are provided with sufficient information (incidents AND state transitions??) to make their own decisions* about how and when and where to propagate metadata * Give them enough rope and …? 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 6

So where does that put us?  All of this should stand us in good stead for getting the metadata and bookkeeping right in a multithreaded framework, and provide a good start in distributed event processing (e.g., event server and its successors), but much work remains  Incident-driven handling made great sense for serial Athena, but –Not so much when people want to reuse metadata and metadata tools downstream, in non-Athena analysis –Use of incidents needs to be rethought in a multithreaded framework, too, and not just for metadata This is already underway (see the future framework requirements document)  However nostalgically one (okay; I) might recall the relatively clean conceptual foundation, the reality is that –ATLAS has sometimes used in-file metadata as a big open bag –Type-specific metadata tools have taken shortcuts and have built in dependencies and assumptions about files and their use and processing that are not inherent in the underlying core infrastructure –Whatever one may think of their design, these metadata and these tools accomplish genuinely useful things. Legacy code is not useless code. It matters. 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 7

Furthermore …  While it’s well and good to say that output event files are not the only possible place to store metadata about an event collection, if not there, then where? –An auxiliary file, for example, might be easy at the job level (and straightforward at the dataset level by association), but data management and delivery infrastructure to date has not supported such associations, or (alternatively) inhomogeneous datasets Within a dataset, what if one of these files is not like the others? N event files, 1 (or M) metadata file(s)? Tricky for all components involved (and a production system may care more than the DDM itself) … but not conceptually impossible –Are we reaching a point where an updated approach to dataset-level metadata cataloging might be plausible? And while we’re on the subject, are extensible {name, value} pair catalogs good enough to support physics use cases? –… and there are alternative strategies, too (store/associate with each event sufficient metadata to reconstruct the collection’s metadata over the union of events?)  There is a very rich program of R&D ahead here 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 8

Enough of theory for today: in-file metadata in practice  Early on there were several modest conceptual principles articulated regarding in- file and out-file metadata, and a taxonomy of sorts (not just an enumeration), and a bit more –That was then. This is now—but we should not lose sight of this, and of ensuring that we have a reasonably solid conceptual foundation  In-file metadata have been the boon and the bane of robust and efficient transform and job configuration and initialization –And the bane sometimes of robust and efficient file merging Where newer approaches are being investigated  Demands for additional in-file metadata are increasing significantly.  Next talks will deal with reality: –Metadata, metadata representation, and file peeking, and putting these things on a firm foundation both conceptually and in practice –The sometimes-harsh realities of dealing with real metadata and real metadata content, propagation, access, extensibility, and supporting tools 28 June 2015 David Malon, US ATLAS Technical Planning Meeting 9