Nothing Is Perfect: Error Detection and Data Cleaning

Slides:



Advertisements
Similar presentations
ISPM 6: Guidelines for Surveillance
Advertisements

What is a Flora? Peter Hovenkamp. What is not a Flora? Labwork/ecology paper Species selection on non-taxonomic criteria No identification tool Character.
Natural history collections and locality data. Biodiversity Data in Natural History Collections 1 billion specimens in 1600 natural history collections.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Introduction to Statistics
Catalogue of Life, Reading, UK, 29 March 2007 Consortium for the Barcode of Life (CBOL): Linking Molecules to the Catalogue of Life David E. Schindel,
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
28 Feb 2006Digi - Paul Dauncey1 In principle change from simulation output to “raw” information equivalent to that seen in real data Not “reconstruction”,
A COMPARISON OF APPROACHES FOR VERIFYING SOUTHWEST REGIONAL GAP VERTEBRATE-HABITAT DISTRIBUTION MODELS J. Judson Wynne, Charles A. Drost and Kathryn A.
Building Biodiversity Information Infrastructure: Anticipating Avian Influenza Spread Patterns A. Townsend Peterson University of Kansas.
CE 428 LAB IV Error Analysis (Analysis of Uncertainty) Almost no scientific quantities are known exactly –there is almost always some degree of uncertainty.
Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data Gernot Liebchen Bheki Twala Martin.
Value of a coordinate: geographic analysis of agricultural biodiversity Andy Jarvis, Julian Ramirez, Nora Castañeda, Samy Gaiji, Luigi Guarino, Hector.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Butterfly Monitoring: experiences with citizen scientists
Eastern Bearded-dragon (Pogona barbata) – Toowoomba, Australia © Arthur D. Chapman Principles of Data Quality Australian Biodiversity Information Services.
QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.
Methods and Tools to Integrate Biodiversity into Land Use Planning
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Acknowledgements The work presented in this poster was carried out within the LIGO Scientific Collaboration (LSC). The methods and results presented here.
Synopsis of current BIEN and Enquist projects managed by Martha iPlant 2014.
June 2012 Spatial Data Cleaning Species Occurrence Data Arthur D. Chapman.
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Role of Spatial Database in Biodiversity Conservation Planning Sham Davande, GIS Expert Arid Communities Technologies, Bhuj 11 September, 2015.
Definition of an Observation In general, an observation represents the measurement of some attribute, of some thing, at a particular time and place. Observations.
Niches, Interactions and Movements. Calculating a Species Distribution Range Jorge Soberon M. A. Townsend Peterson.
Enrique Martínez-Meyer
Distributed Biodiversity Information Databases A. Townsend Peterson.
 Data Quality Resources in Species Occurrence Digitization Allan Koch Veiga Etienne Americo Cartolano Jr Antonio Mauro Saraiva Agricultural Automation.
The Effects of Spatial Patterns on Canopy Cover Estimated by FVS (Forest Vegetation Simulator) A Thesis Defense by Treg Christopher Committee Members:
Train-the-Trainers 2 Workshop Overview August, 2013 iDigBio, Gainesville, Florida (What have we gotten ourselves into?)
Statistics Outline I.Types of Error A. Systematic vs. random II. Statistics A. Ways to describe a population 1. Distribution 1. Distribution 2. Mean, median,
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
Geography is part of our everyday lives. Geography Matters!
 1 Species Richness 5.19 UF Community-level Studies Many community-level studies collect occupancy-type data (species lists). Imperfect detection.
Rick Walker Evaluation of Out-of-Tolerance Risk 1 Evaluation of Out-of-Tolerance Risk in Measuring and Test Equipment Rick Walker Fluke - Hart Scientific.
Research methods revision The next couple of lessons will be focused on recapping and practicing exam questions on the following parts of the specification:
©Ian Sommerville 2000Software Engineering, 6th edition. Chapter 19Slide 1 Verification and Validation l Assuring that a software system meets a user's.
Dr.V.Jaiganesh Professor
Using Kurator Tools for Data Quality and Cleaning Biodiversity Data
Student Fees are Due.
Introduction to measurements
PPT 206 Instrumentation, Measurement and Control
Expanding and Scaling Lifemapper Computations Using CCTools
MODELING THE CURRENT AND FUTURE DISTRIBUTIONS OF
Digital Signal Processing for ultrasonic Testing
Patterns, Practicality & Preservation
Automatic Picking of First Arrivals
Bringing Organism Observations Into Bioinformatics Networks
ZIMS Studbooks Data Tracking, Reports, and Tools
Ricardo Scachetti Pereira CRIA
Species Holdings Where is Everybody?
Locate and Label: Bolivia Brazil Columbia Cuba Haiti Mexico Panama
An Early Detection and Rapid Assessment Network for Plant Invasions
Checking and Editing AquaMap Outputs
Joanna Romaniuk Quanticate, Warsaw, Poland
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Results Questionnaire
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland et al (2016)
Consequences of the Oculomotor Cycle for the Dynamics of Perception
Walter Jetz, Dustin R. Rubenstein  Current Biology 
Consequences of the Oculomotor Cycle for the Dynamics of Perception
Chapter 5 Review.
Check Cruising Social : Science by Kim Iles : SITCA
Online Course Change Request Submission
Strategi Memperbaiki dan Menyiapkan Naskah (Manuscript) Hasil Review
More on Maxent Env. Variable importance:
Online Course Change Request Submission
Presentation transcript:

Nothing Is Perfect: Error Detection and Data Cleaning A. Townsend Peterson STOLEN SHAMELESSLY FROM Arthur Chapman …

www.gbif.org/prog/digit/data_quality/URL1124374342

Types of Errors in Biodiversity Data Taxonomic data

Detection of Taxonomic Errors Sine qua non – expert checks specimens and associated data Check names against authority lists Check names and authorities against authority lists N.B.: Check out new capabilities for automated detection and extraction of scientific names … http://jbi.nhm.ku.edu

Spatial Error Geographic references are invaluable in enabling analysis of biodiversity data, but are also extremely prone to problems

Georeferencing Errors

Georeferencing Error

Collector Itineraries

100 km

Using Ecological Information

Data Cleaning Procedures Assemble occurrence points for each species Eliminate occurrence points one at a time (jackknife), and build models without each of the points available Identify points that are included in models only when included in the input data set included in models not even when included in the input data set Flag these points as suspect for further checking Here is the basic procedure … summed up in a manuscript presently submitted for publication to Diversity and Distributions.

Data Cleaning Test Distributional data from the Atlas of Mexican Bird Distributions for various species Select 18 points at random from those available Add two random points Simulates 10% error rate Use data-cleaning procedure to see if random points could be identified as ‘erroneous’ In that paper, we constructed a test of the ability of the method to detect points that we KNEW were erroneous. We took good, clean data from the Mexican Atlas (could just as well be taken from the distributed Species Analyst facility, once that facility is richer in avian data), and added 10% random points (2 out of 20 points). These are our ‘error’ points that we wish to recover.

Example – Crax rubra Successfully identified the This map shows 18 known points for Crax rubra in Mexico, overlaid on the results of the predictive analysis … darker shades of red indicate greater model agreement on prediction of presence. The approach successfully identified 8 out of 10 such random points across 5 species for which tests were developed. Note the points indicated by blue arrows are either NOT predicted, or are predicted at low confidence levels… these are precisely the two random points that were introduced into the analysis as a test. Successfully identified the 2 random points included in the model

Example – Rauvolfia paraensis Here is another test, based on a rainforest tree’s distribution in the Amazon region of South America (collaboration with Ingrid Koch, of UNICAMP, Campinas, Brazil). The point that was identified as an outlier (blue arrow) is now under study as likely representing a species new to science. Identified one point as outlier. Proved to be an undescribed species

Error Flagging Never possible to clean completely—what matters is signal to noise ratio No substitute for inspection and detailed study by specialists HOWEVER, we can Detect records with internal inconsistencies that clearly represent error in some field Detect records with high probability of including errors owing to unusual characteristics Flag those records for later checking and correction