Enrique Martínez-Meyer

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Statistical Techniques I EXST7005 Start here Measures of Dispersion.
GARP Genetic Algorithm for Rule-set Production
Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Hypothesis Testing making decisions using sample data.
Ecological niche and gradients. Why are there so many species? How is it that so many species can co-exist? Why are some species common and others rare?
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
The Comparison of the Software Cost Estimating Methods
Maxent interface.
x – independent variable (input)
BA 555 Practical Business Analysis
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
A quick introduction to the analysis of questionnaire data John Richardson.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Experimental Evaluation
The Bell Shaped Curve By the definition of the bell shaped curve, we expect to find certain percentages of the population between the standard deviations.
Inferential Statistics
OUR Ecological Footprint …. Ch 20 Community Ecology: Species Abundance + Diversity.
Evaluating Classifiers
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Determining Sample Size
So in conclusion, what can we say about Abiotic factors, especially climate? (1)Abiotic factors influence species’ distribution and abundance But it also.
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
by B. Zadrozny and C. Elkan
Ecological Niche Factor Analysis-ENFA
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Ryan DiGaudio Modified from Catherine Jarnevich, Sunil Kumar, Paul Evangelista, Jeff Morisette, Tom Stohlgren Maxent Overview.
Statistical Process Control (SPC)
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Candidate KBA Identification: Modeling Techniques for Field Survey Prioritization Species Distribution Modeling: approximation of species ecological niche.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Museum and Institute of Zoology PAS Warsaw Magdalena Żytomska Berlin, 6th September 2007.
Data Collection and Processing (DCP) 1. Key Aspects (1) DCPRecording Raw Data Processing Raw Data Presenting Processed Data CompleteRecords appropriate.
Dimensionality Reduction Motivation I: Data Compression Machine Learning.
K. Kolomvatsos 1, C. Anagnostopoulos 2, and S. Hadjiefthymiades 1 An Efficient Environmental Monitoring System adopting Data Fusion, Prediction & Fuzzy.
Niches, distributions… and data Miguel Nakamura Centro de Investigación en Matemáticas (CIMAT), Guanajuato, Mexico Warsaw, November 2007.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
MARLAP Chapter 20 Detection and Quantification Limits Keith McCroan Bioassay, Analytical & Environmental Radiochemistry Conference 2004.
C82MST Statistical Methods 2 - Lecture 1 1 Overview of Course Lecturers Dr Peter Bibby Prof Eamonn Ferguson Course Part I - Anova and related methods (Semester.
Data Analysis.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Statistics. Descriptive Statistics Organize & summarize data (ex: central tendency & variability.
Ryan DiGaudio Modified from Catherine Jarnevich, Sunil Kumar, Paul Evangelista, Jeff Morisette, Tom Stohlgren Maxent Overview.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Statistical methods for real estate data prof. RNDr. Beáta Stehlíková, CSc
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Testing Predictive Performance of Ecological Niche Models A. Townsend Peterson, STOLEN FROM Richard Pearson.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
OUR Ecological Footprint …. Fall 2008 IB Workshop Series sponsored by IB academic advisors Study Abroad for IB Majors Thursday, October 30 4:00-5:00PM.
SUR-2250 Error Theory.
Introduction to species distribution Models
Robert Anderson SAS JMP
Evaluating Classifiers
MSA / Gage Capability (GR&R)
Engineering Measurements
1 2 3 INDIAN INSTITUTE OF TECHNOLOGY ROORKEE PROJECT REPORT 2016
CJT 765: Structural Equation Modeling
Collaborative Filtering Nearest Neighbor Approach
Empirical studies testing the determinants of community structure
K. Kolomvatsos1, C. Anagnostopoulos2, and S. Hadjiefthymiades1
More on Maxent Env. Variable importance:
Presentation transcript:

Enrique Martínez-Meyer Ecological Niche Modelling: Inter-model Variation. Best-subset Models Selection Enrique Martínez-Meyer

The problem: We want to represent the geographic distribution of species under the following circumstances: Most occurrence data available for the vast majority of species are asymmetric (i.e. presence-only) Sampling effort across most species’ distributional ranges is uneven, thus occurrence datasets are eco-geographically biased Environmental variables encompass relatively few niche dimensions, and we do not know what variables are relevant for each species

More problems: Many algorithms do not handle asymmetric data (e.g. GLM, GAM) Some of the algorithms that does handle asymmetric data do not handle nominal environmental variables (e.g. soil classes) [e.g. Bioclim, ENFA] Many stochastic algorithms present different solutions to a problem, even under identical parameterization and input data (e.g. GARP) We do not know the ‘real’ distribution of species, so we do not know when models are making mistakes (mainly over-representing distributions), and when are filling knowledge gaps

We have to live with all those problems, so we need a way to make the best decision possible Anderson, Lew and Peterson (2003) developed a procedure to detect the best-subset models among a given amount of varying models

There are at least two strategies to do so: To evaluate model quality you need to: 1. Generate an ‘independent’ set of data There are at least two strategies to do so: - Collect new data - Split your data into two sets Regardless of the method, you end up with two data sets, one for training the model, and one for testing the model

2. Generate a model with the training data

3. Quantify error components with a confusion matrix d c Predicted Absent b a Predicted Present Actually Absent Actually Present a & d = correct predictions b = commission error (false positives, overprediction) c = omission error (false negatives, underprediction)

What does omission mean? In general, omission error can be considered a ‘hard’ (true) error. However, under some circumstances, a presence record might not be so: 1. Identification of the taxon is wrong 2. Georeferencing of the locality is wrong 3. A record represents individuals outside of their ecological niche; e.g. ‘sink’ (rather than ‘source’) populations; individuals in transit or vagrant, etc.

What does commission mean? Model overprediction may or may not be an error, so commission is not a ‘hard’ error. Prediction in areas where a species does not have a confirmed record is caused by different factors: 1. The area is suitable for the species, but no sampling effort has been made. The species may be there. 2. The area is suitable for the species, but historical (barriers, dispersal capability) or biotic (competition, predation) factors have impeded the species to occupy it, or went extinct. 3. The area is unsuitable: True commission error

Some stochastic algorithms (like GARP) produce somehow different models with the same input data. If we produce several models, we can calculate their errors and plot them in an omission/commission space Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 For species with a fair number of occurrence data this is a typical curve

Distribution of a species in an area Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 High Omission Low Commission Distribution of a species in an area Zero Omission High Commission Zero Omission No Commission Overfitting

Models with high omission error are definitively bad The question now is, which of these models are good and which ones are bad? Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 Models with high omission error are definitively bad

The question now is, which of these models are good and which ones are bad? Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 Region of the best models Median overprediction overfitting

Implementation in Desktop GARP Having enough occurrence data, you can split them into training and testing datasets. When this is the case, it is convenient to select Extrinsic in the Omission Measure option. Otherwise, if you have 100% for training, you have to select Intrinsic

Implementation in Desktop GARP In the Omission threshold section, if you select Hard means that you will use an absolute value in the omission axis of the plot. You set that value in the % omission box Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 Then you have to select the number of models that you want DG to select under that hard omission threshold

Implementation in Desktop GARP When you select Soft means that you will select certain number of models (in percentage), indicated in the % distribution box, with the least omission. This is useful when you are running more than one species at a time Commission Index (% of area predicted present) Omission Error (% of occurrence points outside the predicted area) 100 In this case, the Total models under hard omission threshold box does not apply

Implementation in Desktop GARP Finally, in the Commission threshold box you indicate the number of models (in percentage) closer to the Median in the Commission Index axis that you want to be selected from the remaining models, after filtering with the omission criteria 100 Omission Error (% of occurrence points outside the predicted area) Median When the Omission threshold is in Soft, the Commission threshold value is relative to the % distribution value 100 Commission Index (% of area predicted present)

Implementation in Desktop GARP 100 Omission Error (% of occurrence points outside the predicted area) Median When the Omission threshold is in Hard, the Commission threshold value is relative to the Total models under hard omission threshold value 100 Commission Index (% of area predicted present)