Nothing Is Perfect: Error Detection and Data Cleaning

Nothing Is Perfect: Error Detection and Data Cleaning
A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …

Types of Errors in Biodiversity Data
Taxonomic data

Detection of Taxonomic Errors
Sine qua non – expert checks specimens and associated data Check names against authority lists Check names and authorities against authority lists N.B.: Check out new capabilities for automated detection and extraction of scientific names …

Spatial Error Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems

Georeferencing Errors

Georeferencing Error

Collector Itineraries

100 km

Using Ecological Information

Data Cleaning Procedures
Assemble occurrence points for each species Eliminate occurrence points one at a time (jackknife), and build models without each of the points available Identify points that are included in models only when included in the input data set included in models not even when included in the input data set Flag these points as suspect for further checking Here is the basic procedure … summed up in a manuscript presently submitted for publication to Diversity and Distributions.

Data Cleaning Test Distributional data from the Atlas of Mexican Bird Distributions for various species Select 18 points at random from those available Add two random points Simulates 10% error rate Use data-cleaning procedure to see if random points could be identified as ‘erroneous’ In that paper, we constructed a test of the ability of the method to detect points that we KNEW were erroneous. We took good, clean data from the Mexican Atlas (could just as well be taken from the distributed Species Analyst facility, once that facility is richer in avian data), and added 10% random points (2 out of 20 points). These are our ‘error’ points that we wish to recover.

Example – Crax rubra Successfully identified the
This map shows 18 known points for Crax rubra in Mexico, overlaid on the results of the predictive analysis … darker shades of red indicate greater model agreement on prediction of presence. The approach successfully identified 8 out of 10 such random points across 5 species for which tests were developed. Note the points indicated by blue arrows are either NOT predicted, or are predicted at low confidence levels… these are precisely the two random points that were introduced into the analysis as a test. Successfully identified the 2 random points included in the model

Example – Rauvolfia paraensis
Here is another test, based on a rainforest tree’s distribution in the Amazon region of South America (collaboration with Ingrid Koch, of UNICAMP, Campinas, Brazil). The point that was identified as an outlier (blue arrow) is now under study as likely representing a species new to science. Identified one point as outlier. Proved to be an undescribed species

Error Flagging Never possible to clean completely—what matters is signal to noise ratio No substitute for inspection and detailed study by specialists HOWEVER, we can Detect records with internal inconsistencies that clearly represent error in some field Detect records with high probability of including errors owing to unusual characteristics Flag those records for later checking and correction

Nothing Is Perfect: Error Detection and Data Cleaning

Similar presentations

Presentation on theme: "Nothing Is Perfect: Error Detection and Data Cleaning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nothing Is Perfect: Error Detection and Data Cleaning

Similar presentations

Presentation on theme: "Nothing Is Perfect: Error Detection and Data Cleaning"— Presentation transcript:

Similar presentations

About project

Feedback