Using Kurator Tools for Data Quality and Cleaning Biodiversity Data

Slides:



Advertisements
Similar presentations
Katia Cezón GBIF Spain, Coordination Unit Real Jardín Botánico, Madrid 2014 Mentoring Project 2014 France-Portugal-Spain DATA QUALITY WORKFLOW.
Advertisements

To share data, all providers must agree upon a data standard.
Tomer Gueta, Avi Bar-Massada and Yohay Carmel Using GBIF data to test niche vs. neutrality theories at a continental scale, and the value of data cleaning.
Development of a computer information system for wildlife conservation in Louisiana, with a prototype system for fishes Henry L. Bart Jr. and Nelson E.
Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
GEOLocate. GEOLocate – Automated Georeferencing Desktop application for automated georeferencing of natural history collections data Locality description.
Digitizing Collections of the Angelo State Natural History Collections Marcia A. Revelez Collections Manager Angelo State University.
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Wet Specimen Collections and Alcohol Management A presentation by Giselle Stanton Collection Information - Standards and Support Collection Information.
Corals and sea anemones on line: a functioning biodiversity database D. G. Fautin R. W. Buddemeier University of Kansas: Department of Ecology and Evolutionary.
Arthur ChapmanData Quality Training SABIF June 2012 Taxonomic and Nomenclature Data A. D. Chapman Data Quality.
Nelson E. Rios Tulane University Museum of Natural History Geospatially Enabling Natural History Collections Data.
Community Building and Collaborative Georeferencing using GEOLocate Nelson E. Rios & Henry L. Bart Jr. Tulane University Museum of Natural History.
Biodiversity and Climate Change
This material is based upon work supported by the National Science Foundation under Cooperative Agreement EF Any opinions, findings, and conclusions.
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet.
Ocean Biogeographic Information System. ‘Mission’ OBIS publishes primary data on marine species locations online through –It.
Universidade de São Paulo – School of Engineering Computing and Digital Systems Engineering Dept. Agricultural Automation Laboratory (LAA) BIODIVERSITY.
Solutions Summit 2014 Discrepancy Processing & Resolution Terri Sullivan.
Introduction to OBIS-USA Biological Data, Applications, & Relationships March 14, 2011.
Automated Georeferencing of Natural History Museum Data Nelson E. Rios Discussion The Tulane University Fish Collection, with 7.1 million fluid-preserved.
BUILDING HIGHWAYS IN THE INFORMATICS LANDSCAPE Ed Baker /m9.figshare
Key Components and Urgent Needs of the Global Species Information System Rainer Froese IFM-GEOMAR.
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
Serving the needs of the conservation community Global Biodiversity Information Facility.
[] Where Did Those GBIF Occurrences Come From? Providing Digital Access to NatureServe's Reference Database: Report on a Project in the Early Stages of.
Using historic data sources to calibrate and validate models of species’ range dynamics Giovanni Rapacciuolo University of California Berkeley
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition Tools and Resources to Assess and Enhance Fitness-For-Use.
Progress since the February 2005 London DNA Barcode of Life Conference Scott Miller, Chair Consortium for the Barcode of Life Smithsonian Institution.
Standards and tools for publishing biodiversity data Yu-Huang Wang June 25, 2012.
Raw Data Cleaning, Validation and Enhancement The Field Museum - Chicago, Illinois iDigBio Entomology Digitization Workshop Deborah Paul, iDigBio April.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Synopsis of current BIEN and Enquist projects managed by Martha iPlant 2014.
June 2012 Spatial Data Cleaning Species Occurrence Data Arthur D. Chapman.
A curation interface for reconciliation of species names for India. Thomas Vattakaven and R. Prabhakar, India Biodiversity Portal, Strand Life Sciences,
Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford
Niches, Interactions and Movements. Calculating a Species Distribution Range Jorge Soberon M. A. Townsend Peterson.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
 Handling ◦ documentation  Auditing ◦ Coding ◦ Scanning ◦ Final  Cleaning ◦ Excel ◦ Syntax.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Geographic data validation. Index Basic concepts Why do we need validation? How to assess geographic data Initial checks Intermediate checks Advanced.
Global Biodiversity Information Facility GLOBAL BIODIVERSITY INFORMATION FACILITY Hannu Saarenmaa EC CHM & GBIF European Regional Nodes Meeting Copenhagen,
The New GBIF Data Portal Web Services and Tools Donald Hobern GBIF Deputy Director for Informatics October 2006.
Train-the-Trainers 2 Workshop Overview August, 2013 iDigBio, Gainesville, Florida (What have we gotten ourselves into?)
On the D4Science Approach Toward AquaMaps Richness Maps Generation Pasquale Pagano - CNR-ISTI Pedro Andrade.
IABIN Species and Specimens Thematic Network (SSTN) IABIN Executive Committee/Coordinating Institution Meeting. Tierras Enamoradas, Costa Rica. February.
IDigBio Train the Trainers Georeferencing Workshop Gainesville, FL 8-12, Oct 2012.
Public Libraries Survey Data File Overview. What We’ll Talk About PLS: Public Libraries Survey State level data Public library data (Administrative Entities)
Laura Russell VertNet Meherzad Romer NatureServe Canada John Wieczorek
U.S. Department of the Interior U.S. Geological Survey Manage and Provide Information: Examples from fish health, contaminants, and water quality data.
GBIF – collaborating to promote data access for research and policy Tim Hirsch Deputy Director Global Biodiversity Information Facility (GBIF) Biodiversity.
Quality control of biodiversity data: tools & techniques Leen Vandepitte On behalf of WoRMS, EurOBIS & LifeWatch data management teams.
1 The Avian Knowledge Network: Decision Support System for Adaptive Management Leo Salas & Grant Ballard – California Avian Data Center, PRBO Conservation.
GBIF Implementation Plan Highlights
Getting to know the data, Getting to know all about the data
The IPT user interface and data quality tools
Flanders Marine Institute (VLIZ)
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
Elspeth Haston, Robyn Drinkwater, Robert Cubey & Ruth Monfries
Comments on ASFA Input Helen Wibley, FAO 2016 ASFA Advisory Board Meeting – Hanoi, Viet Nam.
Data Quality Why should I care?
Data Management: The Data Repatriation Re-integration Step or …
How to run and format reports for your Local Board of Health
GBIF Strategic Plan Alberto González-Talaván
Cody W. Thompson, Ph.D. University of Michigan
Nothing Is Perfect: Error Detection and Data Cleaning
Paul J. Morris; Museum of Comparative Zoology
Exercise 5.A: Preparing the statistical data for use in GIS software
Presentation transcript:

Using Kurator Tools for Data Quality and Cleaning Biodiversity Data Tracy Barbaro tbarbaro@oeb.harvard.edu Harvard Museum of Comparative Zoology Kurator/Encyclopedia of Life HTTP://BIODIVERSITYLITERACY.COM

Purpose Kurator tools and validators can help you check biodiversity data, clean data and standardize data before doing statistical analysis. Kurator operates in Darwin Core terms, the standards used for sharing and reusing Natural History Collections data and other biodiversity data such as observations.

How Kurator Works Checks data sets for internal consistency and validation Checks data sets against external authority resources (e.g. Global Biodiversity Information Facility or GBIF) Identifies potential problems and proposes corrections that you may apply to your data

What is Darwin Core? Darwin Core is a set of data standards and terms that allow for easier sharing, publication and use of occurrence data and specimen data. Used in a spreadsheet format Darwin Core (also called “Darwin Core Archives, or DwC-A), includes “fields” that help identify data: Examples of Darwin Core Standard Data fields: scientificName year basisofRecord day stateProvince month islandGroup decimalLatitude Locality decimalLongitude eventDate

Uses of Collections Data - Examples Examine trends in infectious diseases such as West Nile Virus and Malaria using mosquitos specimens Measure mercury levels in fish specimens to determine ecosystem health Peterson, A. Townsend, ADOLFO G. NAVARRO‐SIGÜENZA, and H. E. S. I. Q. U. I. O. BENÍTEZ‐DÍAZ. "The need for continued scientific collecting; a geographic analysis of Mexican bird specimens." Ibis 140.2 (1998): 288-294 Suarez, Andrew V., and Neil D. Tsutsui. "The value of museum collections for research and society." BioScience 54.1 (2004): 66-74. Image credits: Bird collections: By Peter J. Park - Losos JB, Arnold SJ, Bejerano G, Brodie ED III, Hibbett D, et al. (2013) Evolutionary Biology for the 21st Century. PLoS Biol 11(1): e1001466. doi:10.1371/journal.pbio.1001466, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=23745509 Mosquito: Natural History Museum London http://eol.org/data_objects/12486754 Fishes: CC BY https://commons.wikimedia.org/wiki/File:CSIRO_ScienceImage_6049_Alastair_Graham_Fish_Collection_Manager_Australian_National_Fish_Collection_examining_a_specimen_in_one_of_the_15000_or_so_jars_that_are_stored_in_the_collection.jpg Examine bird specimens for clues about evolution, food webs and ecosystem health for conservation purposes

Collections data can be “all over the map” Common Errors in Data Collections data can be “all over the map” misspelled names inconsistent dates incorrect or transposed geographic information duplicate records matching names of unrelated geographic locations Image Source: Kurator: Towards Data Curation Workflows for Mere Mortals B.Ludäscher,J. Hanken, D. Lowery,J.A. Macklin, T. McPhillips,P.J. Morris, R.A. Morris, T. Song

Finding Collections and Biodiversity Data You can access Natural History Collections and biodiversity data through iDigBio, Dryad, Vertnet, Global Biodiversity Information Facility (GBIF), Encyclopedia of Life’s (EOL) TraitBank and other open source portals. These data include specimens, human observations, machine observations, and others.

LOTS of data to sort through

Data Fit for (re)Use Our Task: Assess the quality of specimen data in datasets and “clean” the data before analyzing it. How? Data Standards Data Quality + Cleaning Diagram of chi-square test predictions of seasonal changes in Painted Bunting specimen record densities. (Source: Linck, E., Bridge, E. S., Duckles, J. M., Navarro-Sigüenza, A. G., & Rohwer, S. (2016). Assessing migration patterns in Passerina ciris using the world’s bird collections as an aggregated resource. PeerJ, 4, e1871. CC BY 10.7717/peerj.1871

Kurator Data Quality Tools Standardization (Darwinizer Tool) Georeference Validator Date Validator Field Value Counter Data File Aggregator

Kurator Tools or “Workflows” http://kurator.acis.ufl.edu/kurator-web/

Standardization -- Darwinizer Tool This tool will check your dataset for and replace them with Darwin Core terms. All of your data files (.csv) should run through the Darwinizer Tool first to standardize data fields before using the other Kurator tools. Save the new file as “filename_darwinized”.

Download Results

Data before “Darwinizing” Data after“Darwinizing” headings replaced with Darwin Core terms (e.g. “dwc:day”) Now your file is ready to be used in any of the Kurator Tools

Can you Darwinize your own data Can you Darwinize your own data? And then run Data Quality tests with Kurator? YES Example: Plant data from Dryad - Before Example: Plant data from Dryad – After being Darwinized, ready for the Kurator tools

Example: Island Biogeography Module Determine the number of mammal species collected on each of the selected islands in the Alexander Archipelago in Alaska to examine species richness Data Quality and Cleaning Use the Kurator “Darwinizer “to standardize data fields to Darwin Core terms Use the Kurator Georeference Validator to validate Georeferenced data Use Kurator Field Value Counter to weed out any records that are not mammals Image Credit: USGS Geosciences and Environmental Change Science Center, https://gec.cr.usgs.gov/archive/alaska/alexArchipelago.html Image Credit: USGS Geosciences and Environmental Change Science Center, https://gec.cr.usgs.gov/archive/alaska/alexArchipelago.html

Georeference Validation The Georeference Validator checks georeferenced data to confirm: Latitude and longitude coordinates are in the country Coordinates are within range State or Province is consistent with coordinates Water body is valid This validator creates a data quality report with errors found amendments made.

Georeference Validator Results Example

Field Value Counter Tool The Field Counter creates a report of counts for each field in a list of Darwin Core values. For example you can search for counts of locations or taxonomic ranks in your data set to find data records that are not relevant to your research. Enter in Darwin Core terms in the Field List. The auto search will provide suggested terms.

Field Counter: Review Results Review count report and then go make manual changes in our data. You can do this sorting in Excel instead, but benefits of using Field Counter is that you can automate “error finding” and get accurate counts of records. Taxon|class count

Data File Aggregator The Aggregator tool will combine two data sets files into one file, with Darwin Core headings. Example: Specimen data from iDigBio and observation data from the Global Biodiversity Information Facility (GBIF) data files.

One aggregated you can sort by type of record, date, family, geographic fields, etc., and can input the .csv file into other Kurator Tools (i.e. Date Validator)

Date Validator Tests to see if collecting event date is a valid date, test to see if it matches start + end dates, day of year, etc. and will produce a data quality report.

The Date Validator will test the following internally and make changes, or point out missing values: EventDate precision Julian year or better. EventDate precision calendar year or better. Day Consistent With Month/Year Day In Range

Questions?