Nothing Is Perfect: Error Detection and Data Cleaning A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …
www.gbif.org/prog/digit/data_quality/URL1124374342
Types of Errors in Biodiversity Data Taxonomic data
Detection of Taxonomic Errors Sine qua non – expert checks specimens and associated data Check names against authority lists Check names and authorities against authority lists N.B.: Check out new capabilities for automated detection and extraction of scientific names … http://jbi.nhm.ku.edu
Spatial Error Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems
Georeferencing Errors
Georeferencing Error
Collector Itineraries
100 km
Using Ecological Information
Data Cleaning Procedures Assemble occurrence points for each species Eliminate occurrence points one at a time (jackknife), and build models without each of the points available Identify points that are included in models only when included in the input data set included in models not even when included in the input data set Flag these points as suspect for further checking Here is the basic procedure … summed up in a manuscript presently submitted for publication to Diversity and Distributions.
Data Cleaning Test Distributional data from the Atlas of Mexican Bird Distributions for various species Select 18 points at random from those available Add two random points Simulates 10% error rate Use data-cleaning procedure to see if random points could be identified as ‘erroneous’ In that paper, we constructed a test of the ability of the method to detect points that we KNEW were erroneous. We took good, clean data from the Mexican Atlas (could just as well be taken from the distributed Species Analyst facility, once that facility is richer in avian data), and added 10% random points (2 out of 20 points). These are our ‘error’ points that we wish to recover.
Example – Crax rubra Successfully identified the This map shows 18 known points for Crax rubra in Mexico, overlaid on the results of the predictive analysis … darker shades of red indicate greater model agreement on prediction of presence. The approach successfully identified 8 out of 10 such random points across 5 species for which tests were developed. Note the points indicated by blue arrows are either NOT predicted, or are predicted at low confidence levels… these are precisely the two random points that were introduced into the analysis as a test. Successfully identified the 2 random points included in the model
Example – Rauvolfia paraensis Here is another test, based on a rainforest tree’s distribution in the Amazon region of South America (collaboration with Ingrid Koch, of UNICAMP, Campinas, Brazil). The point that was identified as an outlier (blue arrow) is now under study as likely representing a species new to science. Identified one point as outlier. Proved to be an undescribed species
Error Flagging Never possible to clean completely—what matters is signal to noise ratio No substitute for inspection and detailed study by specialists HOWEVER, we can Detect records with internal inconsistencies that clearly represent error in some field Detect records with high probability of including errors owing to unusual characteristics Flag those records for later checking and correction