Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Metadata Groups Unpacking the Elements- Keith G Jeffery

Similar presentations


Presentation on theme: "The Metadata Groups Unpacking the Elements- Keith G Jeffery"— Presentation transcript:

1 The Metadata Groups Unpacking the Elements- Keith G Jeffery

2 RDA Metadata Principles
The only difference between metadata and data is mode of use Metadata is not just for data, it is also for users, software services, computing resources Metadata is not just for description and discovery; it is also for contextualisation (relevance, quality, restrictions (rights, costs)) and for coupling users, software and computing resources to data (to provide a Virtual Research Environment) Metadata must be machine-understandable as well as human understandable for autonomicity (formalism) Management (meta)data is also relevant (research proposal, funding, project information, research outputs, outcomes, impact…)

3 FAIR Principles To be Findable:
F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier. To be Accessible: A1  (meta)data are retrievable by their identifier using a standardized communications protocol. A1.1 the protocol is open, free, and universally implementable. A1.2 the protocol allows for an authentication and authorization procedure, where necessary. A2 metadata are accessible, even when the data are no longer available.

4 FAIR Principles To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data. To be Re-usable: R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards.

5 Metadata Element Set - and URL of ‘unpacking’ (1)
Unique Identifier (for later use including citation) { Location (URL) { Description { Keywords (terms) { Temporal coordinates { Spatial coordinates { Originator (organisation(s) / person(s)) { Project {

6 Metadata Element Set - and URL of ‘unpacking’ (2)
Facility / equipment { Quality { Availability (licence, persistence) { Provenance { Citations { Related publications (white or grey) { Related software { Schema { Medium / format {

7 Unique Identifier Identifies the digital object of interest
It might be useful to constrain representations of identifiers of well-known types Should it be PUUID: permanent/peristent universal unique ID? Managed system or generated? How to handle PUUIDs of versions, fragments of digital object? All metadata elements should be referentially and functionally related to the PUUID of the digital object Allows for elements to have formal structure (syntax) and terms to have declared meaning (semantics) Ensures elements have a relationship to the digital object represented by the PUUID E.g. an organisation or person exists independently of the dataset of which they are the owner/creator/manager

8 Location URL (locator, not identifier URI)
Atomic or unpacked semantically Protocol (http, ftp, mailto, jdbc, etc.) Subdomain (www or other) Domain name (name.com, name.de, etc.) Port number (80 or other) Directory (path to the page, if none is provided, server uses root web directory) Page (if no page is provided, server uses default page) How to handle locations of versions, fragments?

9 Description Is dataset name or title part of description?
Abstract: Text that needs a qualifier of the language.   Keywords (terms) see that sheet  Classification:  is this keywords?  Schema or controlled vocabulary needs to be declared Examples: Dewey Library of Congress National Library of Medicine Universal Decimal Classification Multilingual versions?

10 Keywords Keywords may come from controlled or uncontrolled vocabularies.  are controlled keywords classification? (note there may be other classification systems used for other elements e.g. for quality, media…) For this the meaning of a “term” must be machine understandable. This means a dereferentiable ID is needed The keyword should have a relationship identifier or role of the term to the object e.g. “aboutness” This means that a set of standard relations should be defined

11 Temporal Coordinates Time scales (depend on research fields) Forms
< nano seconds (Particle Physics), Seconds / minutes, Hours (climate/weather), Days,Years (history), Millenia (geoscience).. See: Forms Date Timestamp Time interval (start/end time) Historical period Period Format (UTC?) Units Errors on time stamp and period  a date/time interval represents all the forms?

12 Spatial Coordinates In the context of Astronomy Geospatial
IVOA recommendation: Space-Time Coordinate Metadata for the Virtual Observatory (Version 1.33) [ pdf ] Geospatial ISO19115 (139 for XML linearization) Note it has itself complex elements with structured attributes

13 Originator (organisation, person)
Originator is a roles of agents (people, corporate bodies and computational agents) in the data creation process role: creator, publisher, author, ? What’s difference between creator and author in the research data context. Who is publisher - the researcher who deposits data, or the organization that maintains the repository. If it’s the latter, we suggest the information be generated automatically. See PRO, the Publishing Roles Ontology 4e599 as example of how such roles are described for publication. Does a roles ontology exist for research data? If not, should we create one? Research roles include data collector, analyst, etc, and creator does not sufficiently reflect those functions. Need to define relationship (role and temporal duration) between dataset and originator with defined role terms

14 Project Project name (full) Project name (abbreviated) Grant number Program? Funder (see Originator?) Name ID (e.g. FundRef) Grant beginning date Grant end date Investigators (see Originator) Principal Co Project URI: of limited value since ephemeral? (CERIF has detail here that might be useful) When the title of the data set is different from the project name, where the title information should be recorded?  see comments under Description Note many of these unpacked sub-elements require relationships multilinguality

15 Facility/Equipment Facility
A Facility provides a capability via the provision of services to serve a specific function.  Facilities can be physical or virtual. Facilities, like equipment,  are artifacts designed, built, operating or installed to serve a specific function affording a convenience or service. Facilities are owned or run by organisations. E.g.  A research vessel,  an analytical facility, a space or ground-based telescope A virtual facility could be a data sharing network. Relationships facility/equipment and each to organisation, person, publications…

16 Facility/Equipment Equipment
A physical item used within a research process for a specific purpose, say for preparation of a sample, or taking of measurements.   Or a computing system. Does this include the instruments installed on a facility? Yes.  So if you separate the concepts you need to have ‘facility’ as a possible metadata attached to an equipment.  So a piece of equipment would normally (but not always) be contained within a facility (e.g. maybe not for chemical or biology equipments which provide measures). Consider substituting “Instrument” for “Equipment”. This would allow including items such as surveys in the metadata.

17 Data Quality From TeD-T: rda.esc.rzg.mpg.de/index.php/Data_Quality Data quality (DQ) is a multi-dimensional construct perception and/or a judgment of data's fitness or trustworthiness to serve intended research uses in a given context From DUL ity → Data Quality could be categorized by pre-defined values or classification from a particular vocabulary or scheme. RDA-DQV (Data Quality Vocabulary)? The problem here is How to categorize a perception!! :-O Perception is a process of recognizing and interpreting sensory stimuli (G)

18 Data Quality Data quality review often includes matching (statistics on) the data to the metadata. For instance, are there missing values? Are the missing values accounted for? This is a stage in data curation. However, researchers have a different concept of what entails data quality, and this element should be renamed to avoid confusion and user-friendliness. Quality may include availability (persistence, access),  see next slide contextualisation, Provenance  see later slide

19 Availability (licence, persistence)
Persistence includes backup, recovery, mirroring, fragmentation, media migration Versioning, provenance Licence includes right to read, copy, write Usually (in open science) with acknowledgement/citation

20 Provenance The PROV-O ontology is a good starting point for Provenance concepts Entity, Activity, Agent  note relationships Maybe in specific areas the provenance description can be simplified. Relationship provenance to metadata catalogs? Is provenance within the catalog or separate? Relationship provenance to logs? Does provenance rely on logs or replicate them (with more contextual information)

21 Citation An open question that came up is whether we should support a defined way to communicate the reason for the citation  i.e. the role between the object being cited and the object citing with attendant relationship (e.g. person). There is substantial overlap with the “related publications’ and “related software” documents/considerations. Wondering about how citations in different styles (MLA, ALA...) should be referred to?  classification Classical: There are well established ways to cite scientific publication, classically by specifying attributes like authors, title, publisher, journal name, volume & issue number, pages, … and respective variants for books, book chapters, thesis types,    how to cite datasets from other objects? The ID is probably OK but what about the role? Fragments of datasets? Versions? Identifier based: More recently it became common to refer to (sometimes in the context of citation also)

22 Related Publications (white or grey)
Suggestion to use essentially DC or DCAT metadata Problem of referential and functional integrity (experience in EPOS with DataCite) The related publication is not referentially or functionally dependent on the dataset or other digital object – it exists independently It is all about relationships Role, temporal duration RDF version of DC or DCAT could be used

23 Related Software There may be many kinds of software related to a
Dataset Software generating it (especially simulation) Software processing it (especially analytics) Software validating it Software Software on which the software object is dependent e.g. libraries Software dependent on the software object Software of the infrastructure to execute the software object (operating system)

24 Schema Used for validation of the dataset - constraints
Equivalent for software object validating source code Used to connect the dataset to executing software Data structure constraints

25 Medium/Format Medium Format
Versions of a digital object may be on different media Kinds of medium – classification system / enumerated list of terms Format The structure and encoding of the digital object May be implicit in schema (but not all digital objects have schema)

26 Overall Remarks Still lacking detail despite a lot of work
More work to do Some characteristics emerging: Relationships between and within elements Need for classification on many elements For properties of the entity (e.g. medium) For roles in relationships of the entity with others e.g. dataset <-> person ‘owner’


Download ppt "The Metadata Groups Unpacking the Elements- Keith G Jeffery"

Similar presentations


Ads by Google