Model Comparison
Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?” Strength of evidence (support) for a model is relative: Relative to other models: As models improve, support may change. Relative to data at hand: As the data improve, support may change.
Assessing alternative models Likelihood ratio tests. Akaike’s Information Criterion (AIC).
Recall the Likelihood Axiom “Within the framework of a statistical model, a set of data supports one statistical hypothesis better than other if the likelihood of the first hypothesis, on the data, exceeds the likelihood of the second hypothesis”. (Edwards 1972)
Likelihood ratio tests Statistical evidence is only relative, that is, it only applies to one model (hypothesis) in comparison with another. The likelihood ratio LA(x) /LB(x) measures the strength of evidence favoring hypothesis A over hypothesis B. Likelihood ratio tests tell us something about the strength of evidence for one model vs. another. If the ratio is very large, hypothesis A did a much better than B in predicting which value X would take, and the observation X=x is very strong evidence for A versus B. Likelihood ratio tests apply to pairs of hypotheses tested using the same dataset.
Likelihood ratio tests Ratios of log-likelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B.
An example The Data: xi = measurements of DBH on 50 adult trees yi = measurements of crown radius on those trees The Scientific Models: yi = b xi + e Linear relationship, with 1 parameters (b) and an error term (e). yi = a + b xi + e Linear relationship, with 2 parameters (a, b) and an error term (e). yi = a + b + λ xi2 +e Non-linear relationship with three parameters and an error term (e). The Probability Model: e is normally distributed, with mean = E[X] and variance estimated from the observed variance of the residuals.
Procedure Initialize parameter estimates. Using parameter estimation routine, find the parameter values that maximize the likelihood given the model and a normal error structure. Calculate difference in likelihood between models. Conduct likelihood ratio tests. Choose best model of the three candidate models.
Remember Parsimony The question is: Is the more complicated model BETTER than the simpler model?
Results For model 2 to be better than model 1, twice the difference in likelihoods between the models must be greater than the value of the chi-square distribution with 1 degree of freedom at p = 0.05 χ2(df 1) = 3.84 χ2(df 2) = 5.99
Results Model 2 > Model 3 > Model 1
This is from day 2, the exercise that you did
Model comparison Is the two population model better?
Model selection when probability distributions differ by model The model selection framework allows error structure to vary over models included in the set of candidate models but….. No component part of the likelihood function can be dropped. Scientific model must remain constant over the models being compared. We must adjust for the different number of parameters in each probability model.
An example The Data: xi = measurements of DBH on 50 trees yi = measurements of crown radius on those trees The Scientific Model: yi = a + b xi + e [2 parameters (a, b)] The Probability Models: is normally distributed, with E[x] predicted by the model and variance estimated from the observed variance of the residuals. is lognormally distributed, with E[x] predicted by the model and variance estimated from the observed variance of the residuals.
Back to the example The normal and lognormal have an equal number of parameters so we can compare the likelihoods directly. In this case, the normal probability model is supported by the data.
A second example The Data: xi = measurements of DBH on 50 trees. yi = counts of seedlings produced by trees. The Scientific Model: yi = STR*(DBH/30)b + e exponential relationship, with 1 parameter (b) and an error term (e) The Probability Models: Data follow a Poisson distribution, with E[x] and variance = λ Data follow a Negative binomial distribution with E[x] =m and variance = m + m2/k where k is the clumping parameter.
Back to the example The binomial requires estimation of one extra parameter, k generally known as the clumping parameter. Thus, twice the difference in likelihoods between the two models must be greater than χ2(df 1) = 3.84.
Information theory Information theory Probability Communication Statistics Economics Mathematics Computer Science Physics Communication
Kullblack-Leibler Information (aka. distance between 2 models) If f(x) denotes reality, we can calculate the Information lost when we use g(x) to approximate reality as: This number is the distance between reality and the model.
Interpretation of Kullblack-Leibler Information (aka Interpretation of Kullblack-Leibler Information (aka. distance between 2 models) 5 10 15 20 GAMMA 130 260 390 520 650 Count f(x) Truth 5 10 15 20 WEIBULL 130 260 390 520 650 g2(x) 5 10 15 20 LOGNORMAL 130 260 390 520 650 Approximations to truth g1(x) Measures the (asymmetric) distance between two models. Minimizing the information lost when using g(x) to approximate f(x) is the same as maximizing the likelihood.
Kullblack-Leibler Information and Truth TRUTH IS A CONSTANT Then, the relative directed distance between truth and model g
Interpretation of Kullblack-Leibler Information (aka Interpretation of Kullblack-Leibler Information (aka. distance between 2 models) Minimizing KL is the same as maximizing entropy. We want a model that does not respond to randomness but does respond to information. We maximize entropy subject to the constraints of the model used to capture information in the data. By maximizing entropy, subject to a constraint, we leave only the information supported by the data. The model does not respond to noise
Akaike’s Information Criterion Akaike defined “an information criterion” that related K-L distance and the maximized log-likelihood as follows: This is an estimate of the expected, relative distance between the fitted model and the unknown true mechanism that generated the observed data. K=number of estimable parameters ^
Information and entropy (noise)
A refresher on Shannon’s diversity index So you have been exposed to entropy theory when you looked at Shannon’s diversity index. This means in the blue it means that you maximize diversity (right) when an individual has equal probability of belonging to one species.
AIC and statistical entropy
Akaike’s Information Criterion AIC has a built in penalty for models with larger numbers of parameters. Provides implicit tradeoff between bias and variance. ^
Akaike’s Information Criterion We select the model with smallest value of AIC. This is the model “closest” to full reality from the set of models considered. Models not in the set are not considered. AIC will select the best model in the set, even if all the models are poor! It is the researcher’s (your) responsibility that the set of candidate models includes well founded, realistic models.
Akaike’s Information Criterion ^ Estimates the expected, relative distance between the fitted model and the unknown true mechanism that generated the observed data. The best model is the one with the lowest AIC. K = number of estimable parameters Built-in penalty for greater number of parameters.
AIC and small samples Unless the sample size (n) is large with respect to the number of estimated parameters (K), use of AICc is recommended. Generally, you should use AICc when the ratio of n/K is small (less than 40). Use AIC or AICc consistently in an analysis rather than mix the two criteria. Use the value of K for the global (most complicated model).
Some Rough Rules of Thumb Differences in AIC (Δi’s) can be used to interpret strength of evidence for one model vs. another. A Δ value within 1-2 of the best model has substantial and should be considered along with the best model. A Δ value within 4-7 of the best model has considerably less support. A Δ value > 10 that of the best model has virtually no support and can be omitted from further consideration.
Akaike weights Akaike weights (wi) are considered as the weight of evidence in favor of model i being the actual best model for the situation at hand given that one of the R models must be the best model for that set of R models. where Akaike weights for all set of models considered should add up to 1.
Uses of Akaike weights “Probability” that the candidate model is the best model. Relative strength of evidence (evidence ratios). Variable selection—which independent variable has the greatest influence? Model averaging.
An example The Data: xi = measurements of DBH on 50 trees yi = measurements of crown radius on those trees The Scientific Models: yi = b xi + e [1 parameter (b)] yi = a + b xi + e [2 parameters (a, b)] yi = a + b xi + γ xi2 + e [3 parameters (a, b, γ )] The Probability Model: e is normally distributed, with mean = 0 and variance estimated from the observed variance of the residuals...
Back to the example Akaike weights can be interpreted as the estimated probability that model i is the best model for the data at hand, given the set of models considered. Weights > 0.90 indicate strong inferences can be made using just that model.
Evidence ratios Evidence ratios represent the evidence about fitted models as to which is better in an information sense. These ratios do not depend on the full set of models.
Strength of evidence: AIC ratios Very strong evidence that models 2 and 3 are better models than model 1 but the ratio of model 2 to 3 is low suggesting data do not support strong inference.
Akaike weights and relative variable importance Estimates of relative importance of predictor variables can be made by summing the w of variables across all the models where the variables occur. Variables can be ranked using these sums. The larger this sum of weights, the more important the variable is.
MULTIMODEL INFERENCE AND MODEL AVERAGING Ambivalence The inability to identify a single best model is not a defect of the AIC method. It is an indication that the data are not adequate to reach strong inference. What is to be done?? MULTIMODEL INFERENCE AND MODEL AVERAGING
Strength of evidence: AIC ratios Hard to choose between model 2 and model 3 because of the low value of evidence ratio.
Multimodel Inference If one model is clearly the best (wi>0.90) then inference can be made based on this best model. Weak strength of evidence in favor of one model suggests that a different dataset may support one of the alternate models. Designation of a single best model is often unsatisfactory because the “best” model is highly variable. We can compute a weighted estimate of the parameter using Akaike weights.
Akaike Weights and Multimodel Inference Estimate parameter values for the two most likely models. Estimate weighted average of parameters across supported models. Only applicable to linear models (Jensen’s ineq). For non-linear models, we can average the predicted response value for given values of the predictor variables.
Akaike Weights and Multimodel Inference Estimate of parameter A = (0.73*1.04) +(0.27*1.31)= 1.11 Estimate of parameter B = (0.73*2.1) +(0.27*1.2)= Estimate of parameter C= (0.73*0) +(0.27*3)=
Model uncertainty Different datasets are likely to yield different parameter estimates. Variance around parameter estimates is calculated using the dataset at hand and is an underestimate of the true variance because it does not consider model uncertainty. Ideally, inferences should not be limited to one particular dataset. Can we make inferences that are applicable to a larger number of datasets?
Techniques to deal with model uncertainty Theoretical: Monte Carlo simulations. Empirical: Bootstrapping Use Akaike weights to calculate unbiased variance for the parameter estimates.
Summary: Steps in Model Selection Develop candidate models based on biological knowledge. Take observations (data) relevant to predictions of the model. Use data to obtain MLE of parameters. Evaluate evidence using AIC. Evaluate estimates of parameters relative to direct measurements. Are they reasonable and realistic?