Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet
Game plan Introduction to Canadensys Data Canadensys Canadensys processing solutions Numbers from Canadensys Hopes and expectations
A Network Of people and collections
Canadensys Headquarters Université de Montréal Biodiversity Centre Canadensys Headquarters Université de Montréal Biodiversity Centre
data.canadensys.net/vascan
data.canadensys.net/ipt
data.canadensys.net/explorer
Data quality related activities From an aggregator perspective
During data entry Help to avoid typographical errors Help to convert verbatim data Actor : data entry person
Before publication Actor : data publisher Detect file character encoding issue Detect duplicate or missing IDs Previous Activity: Data entry
During aggregation Process data: validation, cleaning Produce structured reports : quality control Actor : data aggregator Previous Activity: Before publication
After aggregation Allow and facilitate community feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation
Canadensys tools during data entry data.canadensys.net/tools
Why do we process data? Enrich our Explorer, Provide structured reports to data providers Help identify records that need re-examination Help to improve data entry procedure
Data processing
Processing solutions Narwhals to the rescue Narwhal image Public Domain
The narwhal-processor approach ● Single field processing to allow complex processing (combined fields) ● Processors with common interface ease integration and usage ● Collaboration
Data usability before processing
Data usability after processing 7% of provided country text
Data usability after processing 7% of provided country text 16% of provided state/province text
Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates
Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates 42% of provided dates
Data usability including processed data
Projects With Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National Biodiversity Network, BioVeL … GBIF libraries Most nodes have their own data quality routine
Hopes and expectations
Maintain taxonomic authority files Maintain country, province and city lists We do not want to
Efficiently use specialized resources/services Provide report, quality indices We prefer to
Help from Semantic Web Data in other languages (French, Spanish, …) should not be flagged as error Misspellings should be shared as a common resource (e.g. SKOS) Understand historical data (e.g. collected in USSR in 1980)
Reporting and log DarwinCore annotations for processed data Shared vocabulary for structured reports and quality indices
Summary Tools available for sharing Use, review, contribute Opportunity for broad coordination and increased efficiencies
Thanks Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
Contact Gulo gulo, Larry Master (
Multi-field processing
1.Get information on coordinates 45.5, Compare with processed data 3.Assert that these coordinates are in Montréal