Robert Plant != Richard Plant

Slides:



Advertisements
Similar presentations
Analysis of variance and statistical inference.
Advertisements

Probabilistic analog of clustering: mixture models
Brief introduction on Logistic Regression
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Model generalization Test error Bias, variance and complexity
Robert Plant != Richard Plant. Sample Data Response, covariates Predictors Remotely sensed Build Model Uncertainty Maps Covariates Direct or Remotely.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Sparse vs. Ensemble Approaches to Supervised Learning
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Ensemble Learning (2), Tree and Forest
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Today Ensemble Methods. Recap of the course. Classifier Fusion
INTRODUCTION TO Machine Learning 3rd Edition
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSC321: Lecture 7:Ways to prevent overfitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
How Good is a Model? How much information does AIC give us? –Model 1: 3124 –Model 2: 2932 –Model 3: 2968 –Model 4: 3204 –Model 5: 5436.
Machine Learning 5. Parametric Methods.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Model Comparison.
Linear model. a type of regression analyses statistical method – both the response variable (Y) and the explanatory variable (X) are continuous variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Bootstrap and Model Validation
Why Model? Make predictions or forecasts where we don’t have data.
Bagging and Random Forests
Multiple Imputation using SOLAS for Missing Data Analysis
How Good is a Model? How much information does AIC give us?
Trees Nodes Is Temp>30? False True Temp<=30° Temp>30°
Statistics in MSmcDESPOT
Machine learning, pattern recognition and statistical data modelling
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.
How to handle missing data values
Data Mining Practical Machine Learning Tools and Techniques
Direct or Remotely sensed
Model Comparison.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Regression Model Building
10701 / Machine Learning Today: - Cross validation,
Introduction to Predictive Modeling
Linear Model Selection and regularization
Cross-validation for the selection of statistical models
Lecture 12 Model Building
R & Trees There are two tree libraries: tree: original
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Pearson Correlation and R2
Presentation transcript:

Robert Plant != Richard Plant

Direct or Remotely sensed May be the same data Covariates Direct or Remotely sensed Predictors Remotely sensed Field Data Response, coordinates Sample Data Response, covariates Qualify, Prep Qualify, Prep Qualify, Prep Random split? Randomness Inputs Test Data Training Data Outputs Temp Data Processes Build Model Repeated Over and Over The Model Statistics Validate Predict Randomness Predicted Values Uncertainty Maps Summarize Predictive Map

Cross-Validation Split the data into training (build model) and test (validate) data sets Leave-p-out cross-validation Validate on p samples, train on remainder Repeated for all combinations of p Non-exhaustive cross-validation Leave-p-out cross-validation but only on a subset of possible combinations Randomly splitting into 30% test and 70% training is common

K-fold Cross Validation Break the data into K sections Test on 𝐾 𝑖 , Training remainder Repeat for all 𝐾 𝑖 10-fold is common Test 1 2 3 4 Used in rpart() 5 Training 6 7 8 9 10

Bootstrapping Drawing N samples from the sample data (with replacement) Building the model Repeating the process over and over

Random Forest N samples drawn from the data with replacement Repeated to create many trees A “random forest” “Splits” are selected based on the most common splits in all the trees Bootstrap aggregation or “Bagging”

Boosting Can a set of weak learners create a single strong learner? (Wikipedia) Lots of “simple” trees used to create a really complex tree "convex potential boosters cannot withstand random classification noise,“ 2008 Phillip Long (at Google) and Rocco A. Servedio (Columbia University)

Boosted Regression Trees BRTs combine thousands of trees to reduce deviance from the data Currently popular More on this later

Sensitivity Testing Injecting small amounts of “noise” into our data to see the effect on the model parameters. Plant The same approach can be used to model the impact of uncertainty on our model outputs and to make uncertainty maps Note: This is not the same as sensitivity testing for model parameters

Jackknifing Trying all combinations of covariates

Extrapolation vs. Prediction From model Modeling: Creating a model that allows us to estimate values between data Extrapolation: Using existing data to estimate values outside the range of our data

Building Models Selecting the method Selecting the predictors (“Model Selection”) Optimizing the coefficients/parameters of the model

Direct or Remotely sensed May be the same data Covariates Direct or Remotely sensed Predictors Remotely sensed Field Data Response, coordinates Sample Data Response, covariates Qualify, Prep Qualify, Prep Qualify, Prep Random split? Randomness Inputs Test Data Training Data Outputs Temp Data Processes Build Model Repeated Over and Over The Model Statistics Validate Predict Randomness Predicted Values Uncertainty Maps Summarize Predictive Map

Model Selection Need a method to select the “best” set of predictors Really to select the best method, predictors, and coefficients (parameters) Should be a balance between fitting the data and simplicity R2 – only considers fit to data (but linear regression is pretty simple)

Simplicity Everything should be made as simple as possible, but not simpler. Albert Einstein "Albert Einstein Head" by Photograph by Oren Jack Turner, Princeton, licensed through Wikipedia

Parsimony “…too few parameters and the model will be so unrealistic as to make prediction unreliable, but too many parameters and the model will be so specific to the particular data set so to make prediction unreliable.” Edwards, A. W. F. (2001). Occam’s bonus. p. 128–139; in Zellner, A., Keuzenkamp, H. A., and McAleer, M. Simplicity, inference and modelling. Cambridge University Press, Cambridge, UK.

Parsimony Under fitting model structure …included in the residuals Over fitting residual variation is included as if it were structural Parsimony Anderson

Akaike Information Criterion AIC K = number of estimated parameters in the model L = Maximized likelihood function for the estimated model 𝐴𝐼𝐶=2𝑘 −2 ln⁡(𝐿)

AIC Only a relative meaning Smaller is “better” Balance between complexity: Over fitting or modeling the errors Too many parameters And bias Under fitting or the model is missing part of the phenomenon we are trying to model Too few parameters

Likelihood Likelihood of a set of parameter values given some observed data=probability of observed data given parameter values Definitions 𝑥= all sample values 𝑥 𝑖 = one sample value θ= set of parameters 𝑝 𝑥 𝜃 =probability of x, given θ See: ftp://statgen.ncsu.edu/pub/thorne/molevoclass/pruning2013cme.pdf

Likelihood  

-2 Times Log Likelihood

p(x) for a fair coin 𝐴𝐼𝐶=2𝑘−2 ln 𝐿 𝐿=𝑝 𝑥 1 𝜃 ∗𝑝 𝑥 2 𝜃 … 0.5 Heads 𝐿=𝑝 𝑥 1 𝜃 ∗𝑝 𝑥 2 𝜃 … 0.5 Heads Tails What happens as we flip a “fair” coin?

p(x) for an unfair coin 𝐴𝐼𝐶=2𝑘−2 ln 𝐿 𝐿=𝑝 𝑥 1 𝜃 ∗𝑝 𝑥 2 𝜃 … 0.8 Heads 𝐿=𝑝 𝑥 1 𝜃 ∗𝑝 𝑥 2 𝜃 … 0.8 Heads 0.2 Tails What happens as we flip a “fair” coin?

p(x) for a coin with two heads 𝐴𝐼𝐶=2𝑘−2 ln 𝐿 𝐿=𝑝 𝑥 1 𝜃 ∗𝑝 𝑥 2 𝜃 … 1.0 Heads 0.0 Tails What happens as we flip a “fair” coin?

Does likelihood from p(x) work? if the likelihood is the probability of the data given the parameters, and a response function provides the probability of a piece of data (i.e. probability that this is suitable habitat) we can use the probability that a specific occurrence is suitable as the p(x|Parameters) Thus the likelihood of a habitat model (while disregarding bias) Can be computed by L(ParameterValues|Data)=p(Data1|ParameterValues)*p(Data2|ParameterValues)... Does not work, the highest likelihood will be to have a model with 1.0 everywhere, have to divide the model by it’s area so the area under the model = 1.0 Remember: This only works when comparing the same dataset!

Akaike… Akaike showed that: Which is equivalent to: log ℒ 𝜃 𝑑𝑎𝑡𝑎 −𝐾≈𝐸 𝑦 𝐸 𝑥 log 𝑔 𝑥| 𝜃 (𝑦) Which is equivalent to: log ℒ 𝜃 𝑑𝑎𝑡𝑎 −𝐾=𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡− 𝐸 𝜃 I 𝑓, 𝑔 Akaike then defined: AIC = −2log ℒ 𝜃 𝑑𝑎𝑡𝑎 +2𝐾

AICc Additional penalty for more parameters 𝐴𝐼𝐶𝑐=𝐴𝐼𝐶+ 2𝑘(𝑘+1) 𝑛−𝑘−1 Recommended when n is small or k is large

𝐵𝐼𝐶=2𝑘∗𝑙𝑛(𝑛) −2 ln⁡(𝐿) BIC Bayesian Information Criterion Adds n (number of samples) 𝐵𝐼𝐶=2𝑘∗𝑙𝑛(𝑛) −2 ln⁡(𝐿)

Extra slides

Discrete: Continuous: Justification: 𝐷 𝐾𝐿 = ln⁡( 𝑃 𝑖 𝑄 𝑖 )𝑃(𝑖) 𝐷 𝐾𝐿 (𝑃| 𝑄 =− 𝑝 𝑥 log⁡(𝑞 𝑥 + 𝑝 𝑥 log⁡(𝑝(𝑥)

The distance can also be expressed as: 𝐼 𝑓,𝑔 = 𝑓 𝑥 𝑙𝑜𝑔 𝑓 𝑥 𝑑𝑥− 𝑓 𝑥 𝑙𝑜𝑔 𝑔 𝑥 𝜃 𝑑𝑥 𝑓 𝑥 is the expectation of 𝑓 𝑥 so: 𝐼 𝑓,𝑔 = 𝐸 𝑓 log 𝑓 𝑥 − 𝐸 𝑓 log 𝑔 𝑥 𝜃 Treating 𝐸 𝑓 log 𝑓 𝑥 as an unknown constant: 𝐼 𝑓,𝑔 −𝐶= 𝐸 𝑓 log 𝑔 𝑥 𝜃 = Relative Distance between g and f