Download presentation
Presentation is loading. Please wait.
1
Responsible Citizenship of the World of Science
Using Persistent, Unique Identifiers for Samples Wim Hugo SAEON, ICSU-WDS
4
Too Large and Complex to be Useful to Science…
The Complete Web: every piece of information at a physical network node is potentially in multiple relationships with every other. This enormous graph is many times larger than the physical internet (1) and is not practically useful for science. Formal Meta-Data: very few links are formally specified, eliminating almost all of the potential links between pieces of information to favour only a very rigid collection. (1) Fensel, D. and van Harmelen, F. (2007). Unifying Reasoning and Search to Web Scale, IEEE Computer Society, /07. f
6
The LOD “Cloud” 2011
7
The LOD “Cloud” April 2014
9
Credibility of Science
Access to original and complete data sets for reproducibility Re-usability declines with time Availability declines with age
10
Rationale Reduction in Complexity of the Semantic Web Citability and Incentivisation Re-usability (Interoperability) Reproducability Discoverability
11
Solution: Considerations
12
Charles Babbage ( )
13
Sir Robert Peel ( ) The British Parliament, after investing £ 20,000 in the Difference Engine project, was treated to a demo. £ 2,400,000 in 2017 “Can you set the machine to calculate the time at which it will be of some use???”
14
Information Technology …
15
Systems Engineering Technology Drivers Patterns Use Cases
Design Considerations and Architecture Other (Science) Drivers Solution(s) Implementation
16
Persistent Identifiers and Science
The Fabric of Science The Process of Science The Language of Science
17
Persistent Identifiers and Science
The Fabric of Science The Process of Science The Language of Science Mostly Captured in Metadata Mostly Captured in Metadata Mostly Captured in Data
18
The Fabric of Science ICSU-WDS Knowledge Network
Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI) Projects Use, Caveats, Lineage, Provenance, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS ICSU-WDS Knowledge Network
19
The Process of Science ICSU-WDS Knowledge Network Sample, Specimen,
Member “Standard Variables” Standard Transformation Real World/ Events Processing Observations, Media “Analysis Ready Data” Analysis and Workflows “Publication Ready Output” ICSU-WDS Knowledge Network
20
The Language of Scientific Data
Unstructured Normalised Graph Ontology (Resolution uncertainty) Time Vocabularies (“stable”) Spatial/ Location Coverage (Temporal, Spatial, Topic) Registries (“unstable”) Dimensions (External Entities) Topic Variable Non-Standardised “Controlled” Variables
21
Framework: Elements of a Solution
Domain-Specific or Community-Specific PIDs Governance Process Framework Technical Guidance Conceptual Framework and Metamodels End Users and Systems Certification
22
Design Consideration #1
Precision, Vocabularies, PIDs, and LOD Precision is Critical in Formal, Structured Data if used for dimensions Precision is desirable for other data Graph edges have a weight distribution … assertion Creativity is impossible if precision is perfect Guidance on required on best practices Value judgment on usability/ trust Some progress in this symposium Technical Guidance
23
Design Consideration #2
Single Point of Resolution Option 1 – voluntary/ community indexing Option 2 – dedicated resolver Option 3 – ‘publication’ metadata Agreement on minimum index metadata Governance and sustainability No progress thus far Governance | Certification
24
Option 1 - Hybrid Graph Solution:
“Scholix” Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI…) Projects Use, Caveats, Lineage, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS
25
The Fabric of Science ICSU-WDS Knowledge Network
Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI, …) Projects Use, Caveats, Lineage, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS ICSU-WDS Knowledge Network
26
Design Consideration #3
Single Point(s) of Failure Persistent Identifiers are now fundamental to research data infrastructure and cannot be allowed to fail. Yet may of these services are community-driven, poorly funded, and sometimes rely on voluntary contributions to function Identify critical elements of RDI and fund this reliably and into the long term Certify services and manage to maturity Some progress in this symposium Governance | Certification
27
Design Consideration #4
Conceptual Models and Semantics Conceptual Model – “Because it is a sample” Conceptual Model – “Because of the domain” Conceptual Model – “Because of the owner” Perspective from Statistics Develop a Conceptual Model that allows Protocol-specific, Organisation-Specific, and Domain-specific metadata Influences resolution and higher-level sample metadata Hopefully constructed from existing metadata standards without major modification Good progress in this symposium Conceptual/ Metamodel
28
A Bit of DataCite Schema
RelatedIdentifier Suggestion to add sample-related vocabulary Supplementary Materials Suggestion to add ‘Sample Metadata’ as a link Keyword/ Subject Element Infinitely extensible
29
Design Consideration #5
Definition of Critical Dimensions for Data Families Sample Identifier is Often Required and Must Become Common Practice in Datasets Vocabulary or PID for each Dimension Develop a Conceptual Model for each Data Family and Discipline/ Standard Variable Influences metadata and data content standards No progress thus far Best Practice/ Technical Guidance
30
Generic Dimensions of Data
Sample S Spatial Coverage XYZ Temporal Coverage: T Topic or Semantic/ Ontological Coverage D: Demographic P: Phenomenon mostly physical, chemical, or other contextual data B: Biological Tx: Species and Taxonomy (with some extensions) Al: Allele/ Genome/ Phylogenetic Ch: Characteristics, Traits, and, and Life Stages Each unique combination of these, supported by a vocabularies/ ontology is a generic data family Continuous or Near-Continuous: Uppercase Discrete or dispersed: Lowercase Best Practice/ Technical Guidance
31
Some Generic Data Families and Crosswalk Requirements
Typical Dimensions/ Content Typical Infrastructure Typical Syntax/ Schema Object xyz, t, P/C, S DDI “Sparse” NetCDF XYZ, T, P, S OPeNDAP Multi-dimensional S-DB Traditional Spatial XYz, t, P, S WxS O&M Signals XYZ, T, P/ B, S SensorThings General Structured, Media, Objects Object xyz, t, P/ B, S CSV, PDF, ZIP GBIF Index XYz, t, Tx, S Species Occurrence, … DwC GenBank XYZ, T, Al, S Genetic FTP/ ASN.1 Now Implementing: ✪ Array Databases/ Virtual Cubes for Everything WCS
32
Simple or Core Information Model
Genes and Alleles Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
33
Example: Taxon Abundance, Presence and Absence
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
34
Example: Phylogenetic Data
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
35
Example: Morphology Best Practice/ Technical Guidance
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
36
Example: Biome Definition, Ecosystem Services
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
37
Design Consideration #6
Identifiers for the Fabric of Science Certification of the Process of Science PIDs for All Important Agents/ Objects Repositories for Objects and Artifacts Registries and Name Services World Data System/ DSA considering extension beyond data Community Assessments – Rating-driven (TripAdvisor, …) Comment-driven (GitHub, ...) Good progress in this symposium Certification
38
Design Consideration #7
Solution Granularity Too specialised: reduced utility and much duplication Too generalised: all communities miss critical aspects Two-tiered solution? Some progress in this symposium Certification
39
“Granularity” of a Solution
Universalise? Generalise Specialise “change the code” “change the configuration” “change the reference model”
40
“Granularity” of a Solution
41
Actions Who takes this forward? RDA Interest Group
WG: Sample – Conceptual Model/ Metadata WG: Develop Two-tiered Solution? WG: Criteria for Trusted Name Services and Registries WG: Best Practices in Respect of Sample Management ICSU-WDS/ DSA/ Institutions (IGSN, Pangaea, DataCite, …) Develop criteria for trusted sample repositories Implement certification infrastructure Implement Scholix Generalisation – institution required
42
“Metadata is a tax in the Data World
“Metadata is a tax in the Data World. R They may not like it, but responsible citizens pay their taxes. Citizens cannot expect services and infrastructure to be built on their behalf if they do not pay tax.”
43
?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.