Responsible Citizenship of the World of Science Using Persistent, Unique Identifiers for Samples Wim Hugo SAEON, ICSU-WDS
Too Large and Complex to be Useful to Science… The Complete Web: every piece of information at a physical network node is potentially in multiple relationships with every other. This enormous graph is many times larger than the physical internet (1) and is not practically useful for science. Formal Meta-Data: very few links are formally specified, eliminating almost all of the potential links between pieces of information to favour only a very rigid collection. (1) Fensel, D. and van Harmelen, F. (2007). Unifying Reasoning and Search to Web Scale, IEEE Computer Society, 1089-7801/07. http://www.cs.vu.nl/~frankh/postscript/IEEE-IC07.pd f
The LOD “Cloud” 2011
The LOD “Cloud” April 2014
Credibility of Science Access to original and complete data sets for reproducibility Re-usability declines with time Availability declines with age http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000308#pone-0000308-g002 http://www.sciencedirect.com/science/article/pii/S0960982213014000
Rationale Reduction in Complexity of the Semantic Web Citability and Incentivisation Re-usability (Interoperability) Reproducability Discoverability
Solution: Considerations
Charles Babbage (1791-1871)
Sir Robert Peel (1788-1850) The British Parliament, after investing £ 20,000 in the Difference Engine project, was treated to a demo. £ 2,400,000 in 2017 “Can you set the machine to calculate the time at which it will be of some use???”
Information Technology …
Systems Engineering Technology Drivers Patterns Use Cases Design Considerations and Architecture Other (Science) Drivers Solution(s) Implementation
Persistent Identifiers and Science The Fabric of Science The Process of Science The Language of Science
Persistent Identifiers and Science The Fabric of Science The Process of Science The Language of Science Mostly Captured in Metadata Mostly Captured in Metadata Mostly Captured in Data
The Fabric of Science ICSU-WDS Knowledge Network Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI) Projects Use, Caveats, Lineage, Provenance, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS ICSU-WDS Knowledge Network
The Process of Science ICSU-WDS Knowledge Network Sample, Specimen, Member “Standard Variables” Standard Transformation Real World/ Events Processing Observations, Media “Analysis Ready Data” Analysis and Workflows “Publication Ready Output” ICSU-WDS Knowledge Network
The Language of Scientific Data Unstructured Normalised Graph Ontology (Resolution uncertainty) Time Vocabularies (“stable”) Spatial/ Location Coverage (Temporal, Spatial, Topic) Registries (“unstable”) Dimensions (External Entities) Topic Variable Non-Standardised “Controlled” Variables
Framework: Elements of a Solution Domain-Specific or Community-Specific PIDs Governance Process Framework Technical Guidance Conceptual Framework and Metamodels End Users and Systems Certification
Design Consideration #1 Precision, Vocabularies, PIDs, and LOD Precision is Critical in Formal, Structured Data if used for dimensions Precision is desirable for other data Graph edges have a weight distribution … assertion Creativity is impossible if precision is perfect Guidance on required on best practices Value judgment on usability/ trust Some progress in this symposium Technical Guidance
Design Consideration #2 Single Point of Resolution Option 1 – voluntary/ community indexing Option 2 – dedicated resolver Option 3 – ‘publication’ metadata Agreement on minimum index metadata Governance and sustainability No progress thus far Governance | Certification
Option 1 - Hybrid Graph Solution: “Scholix” Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI…) Projects Use, Caveats, Lineage, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS http://www.scholix.org/
The Fabric of Science ICSU-WDS Knowledge Network Scholarly Publications (CrossRef?) TDRs (WDS, DSA, DataCite*) Samples and Events People (ORCID, …) RDI Outputs/ Online Resources Coverage (Temporal, Spatial, Topic) Data Citations (DataCite) Institutions (GRID, ISRI, …) Projects Use, Caveats, Lineage, Methods Initiatives Licenses (Creative Commons) Networks Platforms, Instruments, Deployments, Sites, … * Including re3data, DataBib Funders (?) Exists Started Not Now WDS ICSU-WDS Knowledge Network
Design Consideration #3 Single Point(s) of Failure Persistent Identifiers are now fundamental to research data infrastructure and cannot be allowed to fail. Yet may of these services are community-driven, poorly funded, and sometimes rely on voluntary contributions to function Identify critical elements of RDI and fund this reliably and into the long term Certify services and manage to maturity Some progress in this symposium Governance | Certification
Design Consideration #4 Conceptual Models and Semantics Conceptual Model – “Because it is a sample” Conceptual Model – “Because of the domain” Conceptual Model – “Because of the owner” Perspective from Statistics Develop a Conceptual Model that allows Protocol-specific, Organisation-Specific, and Domain-specific metadata Influences resolution and higher-level sample metadata Hopefully constructed from existing metadata standards without major modification Good progress in this symposium Conceptual/ Metamodel
A Bit of DataCite Schema RelatedIdentifier Suggestion to add sample-related vocabulary Supplementary Materials Suggestion to add ‘Sample Metadata’ as a link Keyword/ Subject Element Infinitely extensible
Design Consideration #5 Definition of Critical Dimensions for Data Families Sample Identifier is Often Required and Must Become Common Practice in Datasets Vocabulary or PID for each Dimension Develop a Conceptual Model for each Data Family and Discipline/ Standard Variable Influences metadata and data content standards No progress thus far Best Practice/ Technical Guidance
Generic Dimensions of Data Sample S Spatial Coverage XYZ Temporal Coverage: T Topic or Semantic/ Ontological Coverage D: Demographic P: Phenomenon mostly physical, chemical, or other contextual data B: Biological Tx: Species and Taxonomy (with some extensions) Al: Allele/ Genome/ Phylogenetic Ch: Characteristics, Traits, and, and Life Stages Each unique combination of these, supported by a vocabularies/ ontology is a generic data family Continuous or Near-Continuous: Uppercase Discrete or dispersed: Lowercase Best Practice/ Technical Guidance
Some Generic Data Families and Crosswalk Requirements Typical Dimensions/ Content Typical Infrastructure Typical Syntax/ Schema Object xyz, t, P/C, S DDI “Sparse” NetCDF XYZ, T, P, S OPeNDAP Multi-dimensional S-DB Traditional Spatial XYz, t, P, S WxS O&M Signals XYZ, T, P/ B, S SensorThings General Structured, Media, Objects Object xyz, t, P/ B, S CSV, PDF, ZIP GBIF Index XYz, t, Tx, S Species Occurrence, … DwC GenBank XYZ, T, Al, S Genetic FTP/ ASN.1 Now Implementing: ✪ Array Databases/ Virtual Cubes for Everything WCS
Simple or Core Information Model Genes and Alleles Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
Example: Taxon Abundance, Presence and Absence Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
Example: Phylogenetic Data Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
Example: Morphology Best Practice/ Technical Guidance Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
Example: Biome Definition, Ecosystem Services Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Best Practice/ Technical Guidance
Design Consideration #6 Identifiers for the Fabric of Science Certification of the Process of Science PIDs for All Important Agents/ Objects Repositories for Objects and Artifacts Registries and Name Services World Data System/ DSA considering extension beyond data Community Assessments – Rating-driven (TripAdvisor, …) Comment-driven (GitHub, ...) Good progress in this symposium Certification
Design Consideration #7 Solution Granularity Too specialised: reduced utility and much duplication Too generalised: all communities miss critical aspects Two-tiered solution? Some progress in this symposium Certification
“Granularity” of a Solution Universalise? Generalise Specialise “change the code” “change the configuration” “change the reference model”
“Granularity” of a Solution
Actions Who takes this forward? RDA Interest Group WG: Sample – Conceptual Model/ Metadata WG: Develop Two-tiered Solution? WG: Criteria for Trusted Name Services and Registries WG: Best Practices in Respect of Sample Management ICSU-WDS/ DSA/ Institutions (IGSN, Pangaea, DataCite, …) Develop criteria for trusted sample repositories Implement certification infrastructure Implement Scholix Generalisation – institution required
“Metadata is a tax in the Data World “Metadata is a tax in the Data World. R They may not like it, but responsible citizens pay their taxes. Citizens cannot expect services and infrastructure to be built on their behalf if they do not pay tax.”
?