Emergent Semantics: Towards Self-Organizing Scientific Metadata Bill Howe, David Maier Oregon Health and Science University
Oregon Health and Science University “The file ‘anim-sal_estuary_7.gif’ is a data product derived from the output of the ELCIRC simulation program run for the period January 8-15 2002. The image shows salinity (practical salinity units) in the estuary region of the domain. It’s actually an animation, where each frame is a horizontal slice 7 meters below the mean sea level. There are 96 frames, each representing 15 minutes.” program = ELCIRC simStart = 1/8/02 simEnd = 1/15/02 region = estuary variable = salinity timesteps = 96 plottype = animation These descriptors are not standard. Some users may record different descriptors for different purposes. How can extract direct, immediate benefit from the knowledge domain experts have about the file? 9/20/2018 Oregon Health and Science University
Environmental Observation and Forecasting System Daily forecasts and 1000s of ad hoc hindcasts One simulation involves ~20k files: inputs, parameters, outputs, derived data products This scale mandates: query access rather than simple filesystem browsing Automation everywhere 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Tasks Collect metadata. Organize collected metadata. Publish organized metadata for querying. 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Challenges Metadata is scattered in file paths within file headers in “nearby” files Metadata requirements change frequently new simulation codes new data product types new users, internal and external Variable = “Salinity” Depth = “7” …/anim-sal_estuary_7.gif Type = “Animation” Region = “Estuary” 9/20/2018 Oregon Health and Science University
Oregon Health and Science University “Obvious” Solution Data Managers work with Domain Experts design a relational schema, load data, test, repeat file But: Large up-front cost to DB design Slow return on investment Use cases unknown Significant change is anticipated DB languages/APIs not necessarily within scientists’ skill set data product region 9/20/2018 Oregon Health and Science University
Alternative Solution: Steps 1-3 Harvest metadata via simple collection scripts written by the domain experts Use RDF as a schema-independent metadata representation Use RDBMS technology for storage and management 1. Collection scripts filesystem 3. db 2. rdf 9/20/2018 Oregon Health and Science University
Oregon Health and Science University A Narrower Interface SQL statements Database APIs Load Strategies Data formats/models rich schema filesystem Collection scripts generic schema filesystem RDF triples 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Generic RDF Schema subject property object file://forecasts/2003-184/images/anim-sal_estuary_7.gif property:region estuary property:variable salt property:plottype animation property:source file://forecasts/2003-184/run/1_salt.63 Variations to improve performance exist. Use integer keys for subjects, properties and objects. Apply efficient integer processing routines. Scalability? 9/20/2018 Oregon Health and Science University
Is Generic RDF Good Enough? “Find files with region, plottype, and variable descriptors” SELECT r.subject as file, r.object as region, p.object as plottype, v.object as variable FROM statements r, statements p, statements v WHERE r.subject = p.subject AND p.subject = v.subject AND r.property = ‘property:region’ AND p.property = ‘property:plottype’ AND v.property = ‘property:variable’ 3 self-joins! With 60 million descriptors, these joins unacceptable. 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Decomposed Data So we can query the RDF directly, but… …no grouping structures to aid query formulation and processing. Automatically infer groupings from the RDF data, observing that related files often share signatures. Let users impose groupings using a web interface (like views) db ... <isofar.gif, type, isoline>, <isofar.gif, region, far>, <animsal.gif, timesteps, 10>, <animsal.gif, var, salt>, filesystem plot animation 9/20/2018 Oregon Health and Science University
Alternative Solution: Steps 4-6 Partition descriptors into equivalence classes based on file signatures Expose signatures via the web to facilitate browsing and querying Recompute signature extents as new metadata is integrated 4. partition data 5. publish to the web db website 6. query and browse via profiles 9/20/2018 Oregon Health and Science University
Oregon Health and Science University The set of properties defined for a particular file 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Signatures A file’s signature is just the set of properties used to describe it. If signatures were fixed, we might derive a relational schema from them. Instead, we need to respond to changes 4. partition data db find signatures compute signature extents 9/20/2018 Oregon Health and Science University
Example: Consolidate Files with Similar Signatures Modify schema (DM) Transfer tuples from A to B (DM) Modify collection programs Modify extraction routines (DE) Modify Internal organization (DE) Modify SQL statements (DM) 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Alternative Change two lines in a collection script (DE) Assert(fileA, “animation”, “”) Assert(fileA, “plottype”, “animation”) Assert(fileB, “plottype”, “animation”) Reload data (Automatic) Recompute Signatures (Automatic) Republish data (Automatic) 9/20/2018 Oregon Health and Science University
Oregon Health and Science University Benefits Narrow interface between data creators and data managers Metadata exploitable prior to finalizing a thorough schema Derived schema can adapt to changing requirements automatically Profiles constitute emergent semantics: meaning is assigned after data is collected. 9/20/2018 Oregon Health and Science University