GLOBAL BIODIVERSITY INFORMATION FACILITY ECAT Programme Update David Remsen & Markus Döring.

Slides:



Advertisements
Similar presentations
EMu Online Data Sources Brad Lickman For Taxonomy and Geolocation (and Vocabulary Control)
Advertisements

Katia Cezón GBIF Spain, Coordination Unit Real Jardín Botánico, Madrid 2014 Mentoring Project 2014 France-Portugal-Spain DATA QUALITY WORKFLOW.
How to publish genomic Data papers based on BOL data - Biodiversity Data Journal Lyubomir Penev Bulgarian Academy of Sciences & Pensoft Publishers ViBRANT.
Don’t make me think Biodiversity data publishing made easy Vince Smith, Alice Heaton, Laurence Livermore, Simon Rycroft, Ben Scott & Lyubomir Penev* The.
To share data, all providers must agree upon a data standard.
Making small data big! The Biodiversity Data Journal (BDJ) Lyubomir Penev, Teodor Georgiev, Pavel Stoev, David Roberts, Vincent Smith ViBRANT.
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior.
Integrated Taxonomic Information System Janet Gomon, Deputy Director, ITIS Smithsonian Institution Museum of Natural History The.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer August G Informatics Infrastructure and Portal (IIP)
Esri UC 2014 | Technical Workshop | Leveraging Metadata Standards for Supporting Interoperability in ArcGIS Aleta Vienneau, David Danko.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer September G A Darwin-Core Archive solution to publishing and.
BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.
The EDIT Platform for Cybertaxonomy as an information broker in name infrastructures Andreas Kohlbecker 1, Yde de Jong 2, Cherian Mathew 1, Lorna Morris.
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet.
GEOSS Common Infrastructure: A practical tour Doug Nebert U.S. Geological Survey September 2008.
II Course on GBIF Node Management Arusha, Tanzania 31 st October and 1 st November 2008 Tim ROBERTSON Systems Architect GBIF Secretariat Data Publishing.
The Encyclopedia of Life: A Web Site for Every Species James Edwards Executive Director, EOL Barcode of Life Conference Taipei 20 September 2007.
Scratchpads Publication Module - A paradigm shift in publishing RBG Kew, Seminar,
GLOBAL BIODIVERSITY INFORMATION FACILITY The Global Biodiversity Information Facility (GBIF ): The distributed architecture Samy Gaiji Head of Informatics.
Controlled Vocabularies (Term Lists). Controlled Vocabs Literally - A list of terms to choose from Aim is to promote the use of common vocabularies so.
Use case lessons: Components of the SEEK architecture Robert K. Peet University of North Carolina.
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer October DarwinCore Archives – Simplified Format for publishing.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
GLOBAL BIODIVERSITY INFORMATION FACILITY Cataloging and using Taxonomic Data The Global Names Architecture David Remsen Senior Programme Officer, ECAT.
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition Tools and Resources to Assess and Enhance Fitness-For-Use.
GLOBAL BIODIVERSITY INFORMATION FACILITY TDWG 2009, Montpelier, November 12, 2009 Dag Endresen (NordGen)Samy Gaiji (GBIF) Dag Endresen (NordGen) & Samy.
Standards and tools for publishing biodiversity data Yu-Huang Wang June 25, 2012.
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort Christian Gendreau, Université de Montréal / Canadensys David P. Shorthouse, Université.
GBIF Publishing Platform May Core publishing focus Primary Biodiversity Data (Specimens & Observations, Ecological Data) - Core data type is an.
GLOBAL BIODIVERSITY INFORMATION FACILITY Éamonn Ó Tuama Senior Programme Officer, IDA 21 June Metadata publishing with the IPT.
Resolving the publishing bottleneck and increasing data interoperability in biodiversity science Lyubomir Penev, Teodor Georgiev, Pavel Stoev, David Roberts,
BIEN Confederated DB (S) Analytical DB(s) Heterogeneous source database(s) of Plots/Specimens/Occurrences Synonymy Names Reference taxonomy *** *** Feedback.
A curation interface for reconciliation of species names for India. Thomas Vattakaven and R. Prabhakar, India Biodiversity Portal, Strand Life Sciences,
Scratchpads The virtual research environment for biodiversity data Simon Rycroft, Dave Roberts, Vince Smith, Alice Heaton, Katherine Bouton, Laurence Livermore,
Encyclopedia of Life Established May 2007 First version of portal went online Feb year goals –Assemble infinitely expandable web pages for all.
Experts Workshop on the IPT, v. 2, Copenhagen, Denmark The Pathway to the Integrated Publishing Toolkit version 2 Tim Robertson Systems Architect Global.
An Introduction to Scratchpads: Making your data work for you Laurence Livermore Natural History Museum, London Joinville, Brazil.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.
GBIF Data Access and Database Interoperability 2003 Work Programme Overview Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability.
Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.
Dag Endresen Knowledge Systems Engineer GBIF New Orleans (Louisiana, USA) 20 October 2011 Biodiversity Information Standards, TDWG.
BIS TDWG Conference, New Orleans, 2011 GBIF: the challenges of intra- and inter-operability at large scales David Remsen Senior Programme Officer Global.
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Fábio Lang da Silveira – This talk on behalf of OBIS International Committee and OBIS North & South America Nodes USP – Zoology.
Worldwide Protein Data Bank Common D&A Project Sequence Processing Modular Demo May 6, 2010 Project Deliverable.
IABIN Executive Committee / Coordinating Institution Meeting GBIF and IABIN: status and opportunities in 2011 Juan Bello, Mélianie Raymond & Alberto González-Talaván.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
Global Biodiversity Information Facility GLOBAL BIODIVERSITY INFORMATION FACILITY Hannu Saarenmaa EC CHM & GBIF European Regional Nodes Meeting Copenhagen,
The New GBIF Data Portal Web Services and Tools Donald Hobern GBIF Deputy Director for Informatics October 2006.
IABIN Species and Specimens Thematic Network (SSTN) IABIN Executive Committee/Coordinating Institution Meeting. Tierras Enamoradas, Costa Rica. February.
Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.
GBIF - ECAT  Electronic Catalogue of Names of Known Organisms  Program Officer;  Per de Place Bjørn 
GLOBAL BIODIVERSITY INFORMATION FACILITY Vishwas Chavan Senior Programme Officer for DIGIT 10 th Meeting of the GBIF Participant Node Managers Committee.
Laura Russell VertNet Meherzad Romer NatureServe Canada John Wieczorek
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen Senior Programme Officer, ECAT 3 Oct th Nodes Meeting.
GBIF Governing Board 20 Module 6B: New GBIF Tools II 2013 Portal and NPT Startup Daniel Amariles IT Leader, National Biodiversity Information System of.
GBIF NODES Committee Meeting Copenhagen, Denmark 4 th October 2009 The GBIF Integrated Publishing Toolkit Alberto GONZÁLEZ-TALAVÁN Programme Officer for.
Coordination and Policy Development in Preparation for a European Open Biodiversity Knowledge Management System Supported by the European Commission through.
The IPT user interface and data quality tools
Flanders Marine Institute (VLIZ)
knowledge organization for a food secure world
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
GLOBAL BIODIVERSITY INFORMATION FACILITY
A review of online data resources
Presentation transcript:

GLOBAL BIODIVERSITY INFORMATION FACILITY ECAT Programme Update David Remsen & Markus Döring

ECAT Goals l GBIF provides a simple and extensible solution for publishing taxonomic checklists l Published data used to improve access and data interoperability within the portal l Published data supports taxonomic name services l Name services support development of tools that meet national and regional needs.

SCOPE of ECAT publishing l Taxonomic Catalogues l Monographs/Flora/Fauna l Annotated Species Checklists l Regional l Thematic l Nomenclators l Name Dictionaries l No taxonomy

Darwin Core Archive Format

Vocabularies.gbif.org l Community-driven l Internationalised l Vocabularies l Extensions l Tested l Ready for release See Spanish PageSpanish Page

Extensions l Extend the DwC l For Occurrence-level l For Species-level l Draft l Add relevant vocabs. l Review l Publish!

Terms of Bionomenclature l Taxonomic Std Reference l Print Publication l Online Reference l Semantic l Supports vocabulary building l April Go to websiteto website

Publishing Checklists to GBIF l Integrated Publishing Toolkit (next version) l Full & “lite” l Direct DWC Output from Sources l HIT Adapters for existing sources l Spreadsheets l Desktop Applications l Refactoring existing online Tools (ITIS, EDIT)

HIT Adapters

DatabaseClassificationSynonymsVernacularDistrib. Catalogue of Life 2009Yes ITISYes Tree of LifeYes--- USDA PlantsYes GRIN GermPlasm TaxonomyYes - NCBI TaxonomyYes Palaeobiology DatabaseYes -- See Example DWC Archive Output View the Project Wiki page with links to all source ScriptsProject Wiki page source Scripts

Publishing by Spreadsheet l Simple l Validated l Developing countries l Conforms to existing workflow

Publishing by Spreadsheet l Forms and auto-complete l Metadata and data l Occurrence data l Species Checklists l Embedded vocabularies

Desktop Application l Desktop Application l Publishes DwCA l Currently used GBIFS l ~100 sources l 600,000 records l 90 languages l Could be deployed

DwCA Validating Tool View the DwCA ValidatorDwCA Validator

Published DWC Archive files l Current Status l Manually Curated l 82 ECAT sources l 14Taxonomic authority files l 64 Vernacular Name Lists l 2 Nomenclatural Lists l 2 Thematic Lists l 5,800 occurrence classifications l 15M different usages l 11,454,896 unique names assigned to 4.8M name groups l 4,612,444 canonical names

Importing Data

ChecklistBank Command Line Tool l Bundles many tasks into 1 executable jar l adding/deleting/exporting resources, (pre)importing, lexical grouping, nub build l * to be used by HIT module l * importing in 3 steps: l 1) preimport terms l 2) import into isolated db schema l 3) accepting import into public schema

Checklist Data Qualities 1. Highly relational taxonomic data, almost all records linked in a tree hierarchy + basionym 2. Wrong or missing records destroy dataset integrity, not just a single record! 3. Different to flat, unrelated occurrence records Syntactically damaged sources wrong mappings wrong character encodings end of line breaks or tabs within data Data Quality broken referential integrity bad names (e.g. «Unallocated Family») missing or unused controlled vcabularies, e.g. «art» for rank species Names can be published in several ways ScientificName ScientificName + Authorship Genus + Authorship Genus + SpeciesEpitheton (+ Rank + InfraspecificEpitheton)+ Authorship Classifications can be published in several ways Normalised via parentNameUsageID Normalised via parentNameUsage Denormalised via Kingdom,Phylum,Class,Order,Family,Genus

Checklist Bank Model l Lexical Group l Gerardia paupercula var. borealis (Pennell) Deam l Gerardia paupercula (Gray) Britt. var. borealis (Pennell) Deam l Gerardia paupercula (A.Gray) Britton var. borealis (Pennell) Deam l Gerardia paupercula borealis l Gerardia paupercula borealis (Pennell) Deam l Nomenclatural Group l Gerardia paupercula var. borealis (Pennell) Deam l Agalinis paupercula var. borealis Pennell

Taxonomic Backbone (Nub) What it is How it is built

Composite Taxonomic Backbone l Largest integrated taxonomy in the world l 200 million occurrences l One taxonomic hierarchy

Nub Relevance l Nub Management Classification is used for l provide hierarchy of names l crosswalking between taxonomies l All biodiversity data is aligned via names l Considerable variation in higher taxa l => Maps & Statistics l External linkages, e.g. EOL maps l More details: Cronquist classification Mimosaceae: 3,200 species Caesalpiniaceae: 2,000 species Fabaceae: 14,000 species “Modern” classification Fabaceae: 19,200 species Mimosoideae: 3,200 species Cæsalpinioideae: 2,000 species Faboideae: 14,000 species

Nub Components

Nub Building l Regular Checklist Resource l Lexical Grouping l Canonical homonyms l Authorship matching difficult l => canonical names + kingdom l Ignore noisy occurrence derived only names? l Nub Assembling l 8 CoL kingdoms l Each LexGroup becomes a nub usage l Contradicting classifications l Intermediate rank synonyms l Select preferred, wellformed name l Stable IDs l Rated sources, nomenclatural resources for names, taxonomic for classification Subphylum in ANIMALIA Vertebrata Vertebrate Vertebrata Cuvier, 1812 Algae genus in PLANTAE Vertebrata Vertebrata Gray Vertebrata S.F. Gray, 1821

Nub Building

Admin Console View the Admin ConsoleAdmin Console

Discovery: Portal and Services

Checklist Bank Portal l 82 ECAT Resources l 14 Taxonomic Catalogues l 64 Vernacular Name lists l 2 Thematic Lists l 2 Nomenclators Go to PortalPortal

Checklist Bank Web Services l Checklist Service l Name Usage Resolver l Name Usage Service l Name Usage Navigation Service l Name String Service l Image Service Go to API PagePage

Name Parser l Uses l Comparing l Matching l GBIF Backbone l “Did you mean” Try GBIF Name ParserName Parser

Name Recognition Services View GBIF Name Recognition Tools l Updated Service l March 2010 l DWC API l Uses l IAIA parsing l Adding names to metadata l Checklists from documents

“TaxonTagger” tools View TaxonTaggerTaxonTaggerSample document (Butterfly list)document

Using Name Services: Data Entry Google Docs: Live ExampleLive Example

Taxonomic Indexing l Mining names from publishing RSS feeds l IAIA reports l KNB Knowledge network l Mapping to Species lists l “Any red-listed species in this set of IAIA reports.” Name Parser API TaxonFinder API Checklist Bank API

Other 2010/11 l Mapping Services l Linking a data collection to a specific taxonomic authority l Taxonomic Validation and Annotation of Occurrence data. l Linking to Community Species Pages