Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet.

Similar presentations


Presentation on theme: "Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet."— Presentation transcript:

1 Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet

2 Game plan Introduction to Canadensys Data quality @ Canadensys Canadensys processing solutions Numbers from Canadensys Hopes and expectations

3 A Network Of people and collections

4 Canadensys Headquarters Université de Montréal Biodiversity Centre Canadensys Headquarters Université de Montréal Biodiversity Centre

5 data.canadensys.net/vascan

6 data.canadensys.net/ipt

7 data.canadensys.net/explorer

8 Data quality related activities From an aggregator perspective

9 During data entry Help to avoid typographical errors Help to convert verbatim data Actor : data entry person

10 Before publication Actor : data publisher Detect file character encoding issue Detect duplicate or missing IDs Previous Activity: Data entry

11 During aggregation Process data: validation, cleaning Produce structured reports : quality control Actor : data aggregator Previous Activity: Before publication

12 After aggregation Allow and facilitate community feedback Help data publisher to integrate corrections Actor : users and community Previous Activity: Aggregation

13 Canadensys tools during data entry data.canadensys.net/tools

14 Why do we process data? Enrich our Explorer, http://data.canadensys.net Provide structured reports to data providers Help identify records that need re-examination Help to improve data entry procedure

15 Data processing

16 Processing solutions Narwhals to the rescue Narwhal image Public Domain

17 The narwhal-processor approach ● Single field processing to allow complex processing (combined fields) ● Processors with common interface ease integration and usage ● Collaboration https://github.com/Canadensys/narwhal-processor

18 Data usability before processing

19 Data usability after processing 7% of provided country text

20 Data usability after processing 7% of provided country text 16% of provided state/province text

21 Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates

22 Data usability after processing 7% of provided country text 16% of provided state/province text 4% of provided coordinates 42% of provided dates

23 Data usability including processed data

24 Projects With Data Quality Tools Atlas of living Australia GBIF Norway, GBIF Spain, National Biodiversity Network, BioVeL … GBIF libraries Most nodes have their own data quality routine

25 Hopes and expectations

26 Maintain taxonomic authority files Maintain country, province and city lists We do not want to

27 Efficiently use specialized resources/services Provide report, quality indices We prefer to

28 Help from Semantic Web Data in other languages (French, Spanish, …) should not be flagged as error Misspellings should be shared as a common resource (e.g. SKOS) Understand historical data (e.g. collected in USSR in 1980)

29 Reporting and log DarwinCore annotations for processed data Shared vocabulary for structured reports and quality indices

30 Summary Tools available for sharing Use, review, contribute Opportunity for broad coordination and increased efficiencies

31 Thanks Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal

32 Contact http://www.canadensys.net http://github.com/Canadensys @Canadensys Gulo gulo, Larry Master (www.masterimages.org)

33 Multi-field processing

34 1.Get information on coordinates 45.5,- 73.5666667 2.Compare with processed data 3.Assert that these coordinates are in Montréal


Download ppt "Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet."

Similar presentations


Ads by Google