Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Introduction to UIS and RICYT data collection tools and guidelines CARIBBEAN REGIONAL WORKSHOP ON SCIENCE, TECHNOLOGY AND INNOVATION.
The DART-Europe E-theses Portal Martin Moyle Digital Curation Manager UCL Library Services, UK ETD 2009, University of Pittsburgh, June.
Cristina Villaverde GBIF Spain, Coordination Unit Real Jardín Botánico, Madrid 2014 Mentoring Project 2014 France-Portugal-Spain GBIF.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
BGBM - Biodiversity Informatics04 June 2013 How the specimen data is organised and published at BGBM.
GEOLocate. GEOLocate – Automated Georeferencing Desktop application for automated georeferencing of natural history collections data Locality description.
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior.
The Natural History Museum Speaker: Charles Hussey Science Data Co-ordinator Department of Information and Library Systems
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
SCIENTIFIC SOLUTIONS Thomson ResearchSoft Paul Torpey April 8, 2005.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer September G A Darwin-Core Archive solution to publishing and.
National libraries and identity in the Semantic Web Gordon Dunsire BNE, Madrid, 14 Dec 2011.
Federated Networks of Open Access Repositories in Mexico and Latin America Rosalina Vázquez Tapia, Autonomous University of San Luis Potosí.
Fourth Annual Summit | Feb | Tucson, AZ Scratchpads for community involvement for natural history collections Dr Dimitris Koureas Biodiversity.
II Course on GBIF Node Management Arusha, Tanzania 31 st October and 1 st November 2008 Tim ROBERTSON Systems Architect GBIF Secretariat Data Publishing.
Interoperable Digitised Content “Discover, search, extract, link, associate, and view digitised content” Les Carr.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Breakouts. Penguins: Skunks: Cacti: Beetles: Classroom A - Suzanne Classroom C - Chris Lecture Hall 2 - Connie Ward Lecture Hall - Marie (Theme: Content.
Data Management David Nathan & Peter Austin & Robert Munro.
Franck Theeten 1, Patricia Mergen 1, Olivier Bakasanda 2, Jörg Holetschek 3, Patricia Kelbert 3, Motonobu Kasajima 2, Garin Cael 1, Charles Kahindo 4 1.
GLOBAL BIODIVERSITY INFORMATION FACILITY TDWG 2009, Montpelier, November 12, 2009 Dag Endresen (NordGen)Samy Gaiji (GBIF) Dag Endresen (NordGen) & Samy.
Documenting Inuit Knowledge Using Distributed Information and Multimedia Interfaces Knowledge Preservation and Sharing through Partnership Pulsifer, Peter.
Eurocris Membership Meeting Lisbon 9-11 November 2005 Sérgio Tenreiro de Magalhães Luís Amaral University.
Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort Christian Gendreau, Université de Montréal / Canadensys David P. Shorthouse, Université.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
210 mm Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository Jindřich MynarzJindřich Mynarz, Ctibor ŠkutaCtibor.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
An OAI-Compliant Federated Physics Digital Library for the NSDL Department of Computer Science Old Dominion University, Norfolk, VA In Collaboration.
Database Concepts Track 3: Managing Information using Database.
An introduction to data exchange protocols in TDWG Renato De Giovanni TDWG 2008.
BIS TDWG Conference, New Orleans, 2011 GBIF: the challenges of intra- and inter-operability at large scales David Remsen Senior Programme Officer Global.
The Avian Knowledge Network and some of the lessons learned from the birding community Denis Lepage Senior Scientist.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
Aaike De Wever BioFresh: Integrating Freshwater Biodiversity Data © J. Freyhof, A. Hartl
II Course on GBIF Node Management Arusha, Tanzania 31 st October and 1 st November 2008 Tim ROBERTSON Systems Architect GBIF Secretariat The GBIF Data.
FP January 9, 2016 Autors: Ján hreňo, Marek Skokan, Tomáš Sabol1 Project Access-eGov Ján Hreňo - Marek Skokan - Tomáš Sabol
IABIN Executive Committee / Coordinating Institution Meeting GBIF and IABIN: status and opportunities in 2011 Juan Bello, Mélianie Raymond & Alberto González-Talaván.
Marine Metadata Interoperability Acknowledgements Ongoing funding for this project is provided by the National Science Foundation.
Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.
Virtual Biodiversity ViBRANT Vocabularies, Standards, merging and linking Data Olaf Banki University of Amsterdam ViBRANT Virtual Biodiversity.
Discovering libraries’ gold through collection-level descriptions ELAG 2014, Bath Valentine Charles Data specialist.
Public Libraries Survey Data File Overview. 2 What We’ll Talk About PLS: Public Library Survey State level data Public library data (Administrative Entities)
Laura Russell VertNet Meherzad Romer NatureServe Canada John Wieczorek
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen Senior Programme Officer, ECAT 3 Oct th Nodes Meeting.
II Course on GBIF Node Management Arusha, Tanzania 31 st October and 1 st November 2008 GBIF Training Materials and Future Plans Alberto GONZÁLEZ-TALAVÁN.
A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK ABSTRACT SUMMARY Rachel Carriere, Thu-Mai Christian,
COST Action and European GBIF Nodes Anne-Sophie Archambeau.
Coordination and Policy Development in Preparation for a European Open Biodiversity Knowledge Management System Supported by the European Commission through.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
By: Jamie Morgan  A wiki is a web page or collection of web pages which you and your students can access to contribute or modify content without having.
Sample-based data publication; reflections on semantics and logic 1(1) Hanna - GBIF Finland Lepidoptera collection of Hannu SaarenmaaPublicNo (but DwC.
Summon® 2.0 Discovery Reinvented
The Guiana Shield Challenge
The IPT user interface and data quality tools
Flanders Marine Institute (VLIZ)
AIT Austrian Institute of Technology
Markup of Educational Content
ESS.VIP VALIDATION An ESS.VIP project for mutual benefits
The InWEnt Blended-learning approach; GC21 as an e-learning and Blended-learning platform 22/02/2019 An introduction course on InWEnt Blended-learning.
Database Design Hacettepe University
Lesson 3 Bioinformatics Laboratory
Bird of Feather Session
Products and services for digital library
Web archives as a research subject
Presentation transcript:

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet

Game plan Introduction to Canadensys Data Canadensys Canadensys processing solutions Numbers from Canadensys Hopes and expectations

A Network Of people and collections

Canadensys Headquarters Université de Montréal Biodiversity Centre Canadensys Headquarters Université de Montréal Biodiversity Centre

data.canadensys.net/vascan

data.canadensys.net/ipt

data.canadensys.net/explorer

Data quality related activities From an aggregator perspective

During data entry Help to avoid typographical errors Help to convert verbatim data Actor : data entry person

Before publication Actor : data publisher Detect file character encoding issue Detect duplicate or missing IDs Previous Activity: Data entry

During aggregation Process data: validation, cleaning Produce structured reports : quality control Actor : data aggregator Previous Activity: Before publication

After aggregation Allow and facilitate community feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation

Canadensys tools during data entry data.canadensys.net/tools

Why do we process data? Enrich our Explorer, Provide structured reports to data providers Help identify records that need re-examination Help to improve data entry procedure

Data processing

Processing solutions Narwhals to the rescue Narwhal image Public Domain

The narwhal-processor approach ● Single field processing to allow complex processing (combined fields) ● Processors with common interface ease integration and usage ● Collaboration

Data usability before processing

Data usability after processing 7% of provided country text

Data usability after processing 7% of provided country text 16% of provided state/province text

Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates

Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates 42% of provided dates

Data usability including processed data

Projects With Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National Biodiversity Network, BioVeL … GBIF libraries Most nodes have their own data quality routine

Hopes and expectations

Maintain taxonomic authority files Maintain country, province and city lists We do not want to

Efficiently use specialized resources/services Provide report, quality indices We prefer to

Help from Semantic Web Data in other languages (French, Spanish, …) should not be flagged as error Misspellings should be shared as a common resource (e.g. SKOS) Understand historical data (e.g. collected in USSR in 1980)

Reporting and log DarwinCore annotations for processed data Shared vocabulary for structured reports and quality indices

Summary Tools available for sharing Use, review, contribute Opportunity for broad coordination and increased efficiencies

Thanks Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal

Contact Gulo gulo, Larry Master (

Multi-field processing

1.Get information on coordinates 45.5, Compare with processed data 3.Assert that these coordinates are in Montréal