GB22 TRAINING EVENT FOR NODES – 4 OCTOBER 2015 Session 02: 2015 Data Publishing Landscape Laura Russell
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
DATA PUBLISHING LANDSCAPE DiGIR/TAPIR in high use to publish biodiversity data Idea for simple, compressed text-based file for publishing introduced at TDWG GBIF introduces IPT 1.0 GBIF redevelops IPT GBIF introduces IPT 2.0 Data Publishing taught at Nodes training Nodes and aggregators begin to install and use IPTs Occurrence and checklist type datasets along with IPT installations show continued growth 2011
DATA PUBLISHING LANDSCAPE - STATISTICS
DATA PUBLISHING LANDSCAPE - STATISTICS
DATA PUBLISHING LANDSCAPE 2015 The continued GBIF commitment to improving access to biodiversity data Refinement and expansion of standards and publishing software Evolving social norms Most data still published with simple occurrence core Portals do not contain the features to support richer data Many institutions still need convincing to publish biodiversity data
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
WHAT IS BIODIVERSITY DATA? Digital text or multimedia data record detailing facts about the instance of occurrence of an organism, i.e. on the what, where, when, how and by whom of the occurrence and the recording.
WHAT IS DATA PUBLISHING? “Publishing” refers to making biodiversity datasets publicly accessible and discoverable, in a standardized form, via an access point, typically a web address (a URL). IPT ∞
BIODIVERSITY DATA TYPES Checklists Occurrences Metadata
BIODIVERSITY DATA TYPES – SAMPLE DATA Samples
DATA STANDARDS ABCD Access to Biological Collection Data (2005) DwC Darwin Core (2009) AC Audubon Core Multimedia Resources Metadata Schema (2013) NCD Natural Collection Descriptions (Draft)
DARWIN CORE recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence. The primary collector or observer, especially one who applies a personal identifier (recordNumber), should be listed first. Examples: "José E. Crespo", "Oliver P. Pearson | Anita K. Pearson”
SIMPLE DARWIN CORE SIMPLEDWC is a specification for one particular way to use the Darwin Core terms - to share data about taxa and their occurrences in a simply structured way - and is probably what is meant if someone suggests to "format your data according to the Darwin Core".
DARWIN CORE ARCHIVE A Darwin Core Archive (DwCA) is the text representation of data formatted to Darwin Core. A DwCA is a compressed file containing a minimum of three files.
STAR SCHEMA Ext 2 Core Ext 1 Ext 3 meta.xml EML.xml + DwC Archive Ext 4 Ext 5
MAPPING CORES Taxon Core The category of information pertaining to taxonomic names, taxon name usages, or taxon concepts. Released April 2015, this version removes terms dcterms:source and dcterms:rights, and adds dcterms:license. 43 terms. Occurrence Core The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.). Released July 2015, this version removes terms dcterms:source, dcterms:rights, dwc:individualID, dwc:occurrenceDetails, and adds dcterms:license, dwc:organismQuantity, dwc:organismQuantityType, dwc:organismID, dwc:organismName, dwc:organismScope, dwc:associatedOrganisms, dwc:organismRemarks, dwc:parentEventID, dwc:sampleSizeValue, dwc:sampleSizeUnit. 169 terms. Event The category of information pertaining to a sampling event. Issued 29 May terms
EXTENSIONS Darwin Core does not provide terms for every possible type of data. 22 registered 25 under development Examples Audubon Media Description (aka Audubon Core) Darwin Core Identification History Darwin Core Measurement or Facts
STAR SCHEMA EXAMPLE - OCCURRENCE Media Occurrence Core Geographical Determination meta.xml EML.xml + DwC Archive Occurrence Germoplasm
STAR SCHEMA EXAMPLE - CHECKLIST Literature Taxon Core Description Occurrences meta.xml EML.xml + DwC Archive Checklist Vernacular Distribution Types
STAR SCHEMA EXAMPLE - SAMPLE Event Core Occurrences Measurement/Fact meta.xml EML.xml + DwC Archive Samples Relevé
DATA NORMALIZATION What is data normalization? Reasons to normalize a database Normal forms
DATA QUALITY Tools Should you work on improving the data? Importance of feedback
DATA PUBLISHING METHODS
DATA PUBLISHING METHODS – POLLS To be explained in the live session…
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
PROMOTION OF DATA PUBLISHING Topic of discussion at the Nodes Training in Berlin in Core element in the day-to-day work of Node Managers.
PROMOTION OF DATA PUBLISHING - BARRIERS Psychological & cultural barriers 1.Lack of knowledge 2.Lack of understanding 3.Lack of will 4.Perceived data value 5.Privacy concerns 6.Lack of authorization 7.Lack of time / planning 8.Lack of capacity 9.Lack of funding 10.Lack of infrastructure Institutional barriers Capacity barriers Practical barriers
PROMOTION OF DATA PUBLISHING - RESTRICTIONS 1.Refuse to share. 2.Refuse to share until they have exhausted the planned use of the data. 3.Will only share their data for a fee. 4.Will only share data under specific restrictions. 5.Agree to share data openly.
PROMOTION OF DATA PUBLISHING - STRATEGIES 1.Facilitate access to financial support. 2.Call upon commitments or legal mandates. 3.Call upon open access / moral principles. 4.Show the benefits of a better data management. 5.Show the benefit for their scientific careers. 6.Peer pressure. 7.Start / support big digitization programmes. 8.Start / support data repatriation efforts.
PROMOTION OF DATA PUBLISHING – DISCUSSION Challenges Not wanting to publish and/or not wanting to publish all the data Technical threshold of an IPT Restrictive licensing of data Strategies Start smaller – meta data only Promote one-off publishing with multiple exposures Provide hosted IPTs to eliminate technical threshold Illustrate licensing with telling examples. Promote and organize trainings to bring reluctant publishers in with an easier “sell” like data papers.
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
USE CASES - INTRODUCTION Explore four use cases based on current publishing practices Literature Observation data Natural history collections Checklists Complete two exercises Definition of publishing strategies Publish datasets
USE CASES: DATA FROM LITERATURE Blue Group
USE CASE 2: OBSERVATIONAL DATA Green Group Red Group
USE CASE 3: NATURAL HISTORY COLLECTION DATA Yellow Group
USE CASE 4: TAXONOMIC CHECKLISTS Purple Group
INDEX Data publishing landscape Biodiversity data publishing Data types Data standards Data normalization and data quality Data publishing methods Promotion of data publishing Use cases
GB22 TRAINING EVENT FOR NODES – 4 OCTOBER 2015 Session 02: 2015 Data Publishing Landscape Laura Russell