Online tools and standards for Biodiversity data in the Semantic Web Dr Dimitris Koureas Biodiversity Informatics Group | Department of Life Sciences The Natural History Museum London
What is the semantic web? Slide adjusted from Page R. presentation in pro-iBiosphere
link, What is the semantic web? Slide adjusted from Page R. presentation in pro-iBiosphere
What is the semantic web? Slide adjusted from Page R. presentation in pro-iBiosphere
is a author of person Fred book What is the semantic web? Slide adjusted from Page R. presentation in pro-iBiosphere
The Semantic web: “The future of the web …and always will be” – Peter Norvig (Google) What is the semantic web? Slide adjusted from Page R. presentation in pro-iBiosphere
Biodiversity informatics The study of the transformation and communication of information in Life and Earth sciences provides the means (generating and enhancing the necessary infrastructure)
Research vs Infrastructure Slide adapted from Patterson D. 2013, Tempe, Arizona
vs Infrastructure Discovery Ephemeral Individualistic Massive redundancy Optional Risk taking Slide adapted from Patterson D. 2013, Tempe, Arizona Research
vs Infrastructure Discovery Ephemeral Individualistic Massive redundancy Optional Risk taking Implementation Communal / agreed Essential Persistent Robust & reliable Adaptable Slide adapted from Patterson D. 2013, Tempe, Arizona Research
What are the current challenges in Biodiversity informatics?
Publications based on countless specimens, images, maps, keys and datasets Current taxonomic data production Typically generated by small communities for “local” research projects Figure from Costello M.J et al, 2013 doi: /science
15-20k new spp. described annually (2M total) 1 30k nomenclatural acts (12M total) 1 20k phylogenies (750k total) 2 31k taxa sequenced (360k taxa total) 3 800k BioMed papers (40M total pp. of taxonomy) 4 Countless specimens, images, maps, keys and datasets Our current taxonomic data production Figures from 1) Zhang, Zootaxa , 1-4; 2) Web-of-Science; 3) Genbank and 4) PubMed. 1.8 M described spp. (17M names) 300M pages (over last 250 years) 1.5-3B specimens
Estimates of 7.5 million species still undescribed 1 1 How Many Species Are There on Earth and in the Ocean? Mora C et al. doi: /journal.pbio Now imagine that…
Biodiversity informatics landscape Key problems Landscape is complex, fragmented & hard to navigate Many audiences (policy makers, scientists, amateurs, citizen scientists) Many scales (global solutions to local problems) Figure adapted from Peterson et al, Syst. & Biodiv doi: /
Science is carried out “locally” By local scientists Being part of local infrastructures Having local funders Science is global It needs global standards Global workflows Cooperation of global players BUT
Expected volume of taxonomic and biodiversity data Need of extracting, aggregating and linking data on a global level
Cyndy Parr, Rob Guralnick, Nico Cellinese and Rod Page. TREE doi: /j.tree This requires data, information & knowledge to be… Digital Not printed paper Openly accessible Not behind barriers (e.g. paywalls) Linked-up Not in silos “ Link together evolutionary data … by developing analytical tools and proper documentation and then use this framework to conduct comparative analyses, studies of evolutionary process and biodiversity analyses” To achieve this…
Hour-glass motif for big data infrastructure Data re-use Data generation Data pool Slide adapted from Patterson D. 2013, Tempe, Arizona
Big data world with re-use data AggregationVisualizationAnalysisManipulation ModelsObservationsExperimentsProcessed Data re-use Data generation Data pool
AggregationVisualizationAnalysisManipulation ModelsObservationsExperimentsProcessed Data re-use Data generation Data pool Big data world with re-use data
Nodes interconnected Slide adapted from Patterson D. 2013, Tempe, Arizona
But how many biodiversity informatics projects are out there?
At least 679 ! But how many biodiversity informatics projects are out there? Sources: EDIT, TDWG & ViBRANT 2013 Categories: Data Aggregator - a web site that collates data from a variety of sources (digital and hardcopy) and presents it in one form Data Indexer - a web site that provides lists or indexes of other sites that provide data Data Provider - a web site that provides data directly from research or other studies Data Standards - a web site that contributes to formulating or developing standards for data Facilitator - a web site that facilitates the provision of data by other projects or web sites
GBIF: Our global leader in occurrence data Aggregators
EU-NOMEN - PESI Aggregators
Making taxonomy digital, open & linked Aggregators
Scratchpads are an integrated system to Enter, Curate, Mark-up, Link and Publish data taxonomic workflow in a single virtual environment
A Scratchpad is a website that holds data for you and your community The Scratchpads concept Your data External data & services
65,000 unique visitors/month Per month unique visitors to Scratchpads sites 580 Scratchpads Communities by 8,185 active registered users covering 55,607 taxa in 653,274 pages. In total more than 1,300,000 visitors
Researchers can assemble, test, and analyse their data records in BOLD before uploading them to: International Nucleotide Sequence Database Collaboration (DDBJ, ENA, GenBank) BOLD Barcode of Life Data Systems Facilitators
Biodiversity literature openly available to the world as part of a global biodiversity community Biodiversity Heritage Library BHL > 40 M pages of legacy literature Providers
Standard Exchange formats
Darwin Core (DwC) Primarily used as a specimen records metadata standard Standard Exchange formats
Access to Biological Collection Data (ABCD) highly detailed and aims to provide a complete set of data elements for natural history collection items Standard Exchange formats
Audubon Core Multimedia Resources Metadata Schema The Audubon Core metadata schema ("AC") is a representation-neutral metadata vocabulary for describing biodiversity-related multimedia resources and collections. Standard Exchange formats
Taxonomic Concept Transfer Schema (TCS) Mechanism to exchange data concerning the names of organisms Standard Exchange formats
Standards facilitate systems interoperability
UPIDs to identify content Identifiers A key to find something in a database. We need Unique Identifiers
/ We need Unique Identifiers
We need Unique Identifiers
Can a taxonomic name be used as a UPID? Is it Unique? Is it Persistent? Is it an Identifier? Are taxonomic names enough for communication between Scientists? YES Are taxonomic names enough for communication between machines? CAN BE IF We need Unique Identifiers
For example: Page R., Brief Bioinform (2008) 9 (5): doi: /bib/bbn022 We need Unique Identifiers
ONLY IF Name reconciliation Patterson, D. J. et al Names are key to the big new biology. TREE 25: doi: /j.tree We need Unique Identifiers
The need for Controlled Vocabularies and Ontologies Knowledge Organisation Systems Google has done it: Ontologies Plant anatomical and structural development Ontology
Deans A. et al. Time to change how we describe biodiversity, Trends in Ecology & Evolution 2012 doi: /j.tree Example of ontology usage
Examples of integrated projects
How are all this relevant to my work ? What should I take home ?
Repositories #bigdata Repositories #bigdata Providers Data silos Community
The four nodes of data workflow 1. We collect and generate data 2. We curate, link and structure data 3. We analyse data 4. We publish data
Data curation Data curation Data analysis Data analysis Data publishing Data publishing The four nodes of data workflow Data collection & generation Data collection & generation What are the bottlenecks in the workflow ?
Data curation Data curation Data analysis Data analysis Data publishing Data publishing What we need is… Data collection & generation Data collection & generation a seamless workflow
Old Joke: A drunk is crawling around a lamp post on his hands and knees. A cop comes along … Cop: What are you doing? Drunk: Looking for my car keys. Cop: Are you sure you dropped them here? Drunk: No, I dropped them in the alley. Cop: So why are you looking here? Drunk: Because the light’s better. Old Joke
Science is a ‘light’s better’ endeavor in that research effort is not directed at areas where the work is technically infeasible. Research is directed where real, interpretable results may be obtained. We do, in fact, conduct research where the light’s better. But, when the light changes, so does science. With better illumination, we look in new areas. We find new things… Old Joke
Addressing the challenges of biodiversity informatics “…the field [of biodiversity informatics] appears to be growing in a void of overarching, motivating questions, effectively making it a set of technologies in search of questions to address.” Peterson et al, Syst. & Biodiv doi: /