Download presentation
Presentation is loading. Please wait.
Published bySilvia Hunter Modified over 9 years ago
1
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet
2
Game plan Introduction to Canadensys Data quality @ Canadensys Canadensys processing solutions Numbers from Canadensys Hopes and expectations
3
A Network Of people and collections
4
Canadensys Headquarters Université de Montréal Biodiversity Centre Canadensys Headquarters Université de Montréal Biodiversity Centre
5
data.canadensys.net/vascan
6
data.canadensys.net/ipt
7
data.canadensys.net/explorer
8
Data quality related activities From an aggregator perspective
9
During data entry Help to avoid typographical errors Help to convert verbatim data Actor : data entry person
10
Before publication Actor : data publisher Detect file character encoding issue Detect duplicate or missing IDs Previous Activity: Data entry
11
During aggregation Process data: validation, cleaning Produce structured reports : quality control Actor : data aggregator Previous Activity: Before publication
12
After aggregation Allow and facilitate community feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation
13
Canadensys tools during data entry data.canadensys.net/tools
14
Why do we process data? Enrich our Explorer, http://data.canadensys.net Provide structured reports to data providers Help identify records that need re-examination Help to improve data entry procedure
15
Data processing
16
Processing solutions Narwhals to the rescue Narwhal image Public Domain
17
The narwhal-processor approach ● Single field processing to allow complex processing (combined fields) ● Processors with common interface ease integration and usage ● Collaboration https://github.com/Canadensys/narwhal-processor
18
Data usability before processing
19
Data usability after processing 7% of provided country text
20
Data usability after processing 7% of provided country text 16% of provided state/province text
21
Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates
22
Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates 42% of provided dates
23
Data usability including processed data
24
Projects With Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National Biodiversity Network, BioVeL … GBIF libraries Most nodes have their own data quality routine
25
Hopes and expectations
26
Maintain taxonomic authority files Maintain country, province and city lists We do not want to
27
Efficiently use specialized resources/services Provide report, quality indices We prefer to
28
Help from Semantic Web Data in other languages (French, Spanish, …) should not be flagged as error Misspellings should be shared as a common resource (e.g. SKOS) Understand historical data (e.g. collected in USSR in 1980)
29
Reporting and log DarwinCore annotations for processed data Shared vocabulary for structured reports and quality indices
30
Summary Tools available for sharing Use, review, contribute Opportunity for broad coordination and increased efficiencies
31
Thanks Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
32
Contact http://www.canadensys.net http://github.com/Canadensys @Canadensys Gulo gulo, Larry Master (www.masterimages.org)
33
Multi-field processing
34
1.Get information on coordinates 45.5,- 73.5666667 2.Compare with processed data 3.Assert that these coordinates are in Montréal
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.