Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 5 Model Evaluation

Similar presentations


Presentation on theme: "Lecture 5 Model Evaluation"— Presentation transcript:

1 Lecture 5 Model Evaluation
C. D. Canham Likelihood Methods in Ecology April , 2011 Granada, Spain Lecture 5 Model Evaluation

2 Elements of Model evaluation
Lecture 5 - Model Evaluation C. D. Canham Elements of Model evaluation Goodness of fit Prediction Error Bias Outliers and patterns in residuals

3 Assessing Goodness of Fit for Continuous Data
Lecture 5 - Model Evaluation C. D. Canham Assessing Goodness of Fit for Continuous Data Visual methods Don’t underestimate the power of your eyes, but eyes can deceive, too... Quantification A variety of traditional measures, all with some limitations... A good review... C. D. Schunn and D. Wallach. Evaluating Goodness-of-Fit in Comparison of Models to Data. Source:

4 Traditional inferential tests masquerading as GOF measures
Lecture 5 - Model Evaluation C. D. Canham Traditional inferential tests masquerading as GOF measures The c2 “goodness of fit” statistic For categorical data only, this can be used as a test statistic: “What is the probability that the “model” is true, given the observed results” The test can only be used to reject a model. If the model is accepted, the statistic contains no information on how good the fit is.. Thus, this is really a badness – of – fit statistic Other limitations as a measure of goodness of fit: Rewards sloppy research if you are actually trying to “test” (as a null hypothesis) a real model, because small sample size and noisy data will limit power to reject the null hypothesis Just to dispense with this at first... The traditional Chi-squared test is frequently referred to as a “goodness of fit statistic”. This is unfortunate terminology when used to refer to the traditional use of the statistic to test patterns of association in categorical data (the most common use as a statistical test). On the other hand, there are measures often referred to as “deviance” measures that are closely related to chi-squared, and they can be used with continuous variables as a form of measure of goodness of fit. Deviance will be explained in a bit...

5 Visual evaluation for continuous data
Lecture 5 - Model Evaluation C. D. Canham Visual evaluation for continuous data Graphing observed vs. predicted...

6 Lecture 5 - Model Evaluation
C. D. Canham Examples Goodness of fit of neighborhood models of canopy tree growth for 2 species at Date Creek, BC Observed Predicted Source: Canham, C. D., P. T. LePage, and K. D. Coates A neighborhood analysis of canopy tree competition: effects of shading versus crowding. Canadian Journal of Forest Research.

7 Lecture 5 - Model Evaluation
C. D. Canham Goodness of Fit vs. Bias 1:1 line These hypothetical data illustrate the differences between GOF and bias

8 R2 as a measure of goodness of fit
Lecture 5 - Model Evaluation C. D. Canham R2 as a measure of goodness of fit R2 = proportion of variance* explained by the model...(relative to that explained by the simple mean of the data) R2 has many desirable properties as a measure of goodness of fit. One note is that it can be negative (if SSE > SST) – this implies that the grand mean of the observed data is actually a better fit than your estimated model. Where expi is the expected value of observation i given the model, and obs is the overall mean of the observations (Note: R2 is NOT bounded between 0 and 1) * this interpretation of R2 is technically only valid for data where SSE is an appropriate estimate of variance (e.g. normal data)

9 R2 – when is the mean the mean?
Lecture 5 - Model Evaluation C. D. Canham R2 – when is the mean the mean? Clark et al. (1998) Ecological Monographs 68:220 In effect, this approach calculates an R2 as the percent of variance explained by your model, GIVEN the effects of site (whatever they may be) For i=1..N observations in j = 1..S sites – uses the SITE means, rather than the overall mean, to calculate R2

10 r2 as a measure of goodness of fit
Lecture 5 - Model Evaluation C. D. Canham r2 as a measure of goodness of fit r2 = squared correlation (r) between observed (x) and predicted (y) NOTE: r and r2 are both bounded between 0 and 1

11 Lecture 5 - Model Evaluation
C. D. Canham R2 vs r2 In this example, the model fits the general trend of the data, and the scatter is not that bad, but the model consistently underestimates the observations (the observations are 1.5 times larger than predicted), plus there is a systematic (added) bias of ~ 5 units. Nonetheless, if you take the bias into account, you could consider the fit (if you define fit independently of bias) reasonably good... Is this a good fit (r2=0.81) or a really lousy fit (R2=-0.39)? (it’s undoubtedly biased...)

12 Lecture 5 - Model Evaluation
C. D. Canham A note about notation... Check the documentation when a package reports “R2” or “r2”. Don’t assume they will be used as I have used them... Sample Excel output using the “trendline” option for a chart: The “R2” value of 0.89 reported by Excel is actually r2 (While R2 is actually 0.21) (If you specify no intercept, Excel reports true R2...) Excel appears to calculate r2 when an intercept is included (this makes sense in general – just not for evaluating obs. Vs. pred. Plots). When you specify no intercept, it appears to calculate R2

13 R2 vs. r2 for goodness of fit
Lecture 5 - Model Evaluation C. D. Canham R2 vs. r2 for goodness of fit When there is no bias, the two measures will be almost identical (but I prefer R2, in principle). When there is bias, R2 will be low to negative, but r2 will indicate how good the fit could be after taking the bias into account...

14 Sensitivity of R2 and r2 to data range
Lecture 5 - Model Evaluation C. D. Canham Sensitivity of R2 and r2 to data range I generated 40 data points with a normal distribution of errors, and then selected subsets of the datapoints to display in the lower two graphs...

15 Lecture 5 - Model Evaluation
C. D. Canham The Tyranny of R2 (and r2) Limitations of R2 (and r2) as a measure of goodness of fit... Not an absolute measure (as frequently assumed), particularly when the variance of the appropriate PDF is NOT independent of the mean (expected) value i.e. lognormal, gamma, Poisson, For PDFs where variance is an explicit function of the mean, you can have the paradoxical result that goodness of fit would be higher for datasets with a lower range of values of the dependent variable... (i.e. if growth is lognormally distributed, then designing the study to sample under conditions that generated lower growth rates could conceivably produce higher R2 values...)

16 Gamma Distributed Data...
Lecture 5 - Model Evaluation C. D. Canham Gamma Distributed Data... The variance of the gamma increases as the square of the mean!...

17 Lecture 5 - Model Evaluation
C. D. Canham So, how good is good? Our assessment is ALWAYS subjective, because of Complexity of the process being studied Sources of noise in the data From a likelihood perspective, should you ever expect R2 = 1?

18 Other Goodness of Fit Issues...
Lecture 5 - Model Evaluation C. D. Canham Other Goodness of Fit Issues... In complex models, a good fit may be due to the overwhelming effect of one variable... The best-fitting model may not be the most “general” i.e. the fit can be improved by adding terms that account for unique variability in a specific dataset, but that limit applicability to other datasets. (The curse of ad hoc multiple regression models...)

19 How good is good: deviance
Lecture 5 - Model Evaluation C. D. Canham How good is good: deviance Comparison of your model to a “full” model, given the probability model. For i = 1..n observations, a vector X of observed data (xi), and a vector q of j = 1..m parameters (qj): The derivation of this comes from the assertion that the ratio of the likelihoods provides a measure of the goodness of fit of the model of interest, relative to a full model with n parameters. Given that 2 * the difference in log likelihood is Chi-square distributed, the convention is to multiply the difference in log likelihood by 2. In effect, this statistic is very similar to AIC and LRTs that compare candidate models. The distinction in this case is to use the full model, PLUS the probability model (presumably using the estimate of variance calculated from the sample data evaluated with the candidate model). In effect, this test presumes that the variance estimated from the residuals of the candidate model reflects the underlying uncertainty in the process... Define a “full” model with n parameters qi = xi (qfull). Then: Nelder and Wedderburn (1972)

20 Deviance for normally-distributed data
Lecture 5 - Model Evaluation C. D. Canham Deviance for normally-distributed data Log-likelihood of the full model is a function of both sample size (n) and variance (s2) Even if your predictions (expected value) exactly match your observations, a likelihood method still assumes that there was some underlying PDF that generated the observations, and there was a finite probability (< 1) that you observed a value equal to the expected value of the function... Therefore – deviance is NOT an absolute measure of goodness of fit... But, it does establish a standard of comparison (the full model), given your sample size and your estimate of the underlying variance...

21 Lecture 5 - Model Evaluation
C. D. Canham Forms of Bias Proportional bias (slope not = 1) Systematic bias (intercept not = 0)

22 “Learn from your mistakes” (Examine your residuals...)
Lecture 5 - Model Evaluation C. D. Canham “Learn from your mistakes” (Examine your residuals...) Residual = observed – predicted Basic questions to ask of your residuals: Do they fit the PDF? Are they correlated with factors that aren’t in the model (but maybe should be?) Do some subsets of your data fit better than others?

23 Using Residuals to Calculate Prediction Error
Lecture 5 - Model Evaluation C. D. Canham Using Residuals to Calculate Prediction Error RMSE: (Root mean squared error) (i.e. the standard deviation of the residuals)

24 Predicting lake chemistry from spatially-explicit watershed data
Lecture 5 - Model Evaluation C. D. Canham Predicting lake chemistry from spatially-explicit watershed data At steady state: Where concentration, lake volume and flushing rate are observed, And input and inlake decay are estimated

25 Predicting iron concentrations in Adirondack lakes
Lecture 5 - Model Evaluation C. D. Canham Predicting iron concentrations in Adirondack lakes Results from a spatially-explicit, mass-balance model of the effects of watershed composition on lake chemistry In this example, I calculated residuals as (observed - expected). The overall fit is not bad, but there is a set of points with strong Source: Maranger et al. (2006)

26 Should we incorporate lake depth?
Lecture 5 - Model Evaluation C. D. Canham Should we incorporate lake depth? Shallow lakes are more unpredictable than deeper lakes The model consistently underestimates Fe concentrations in deeper lakes

27 Adding lake depth improves the model...
Lecture 5 - Model Evaluation C. D. Canham Adding lake depth improves the model... R2 went from 56% to 65% It is just as important that it made sense to add depth...

28 But shallow lakes are still a problem...
Lecture 5 - Model Evaluation C. D. Canham But shallow lakes are still a problem...

29 Summary – Model Evaluation
Lecture 5 - Model Evaluation C. D. Canham Summary – Model Evaluation There are no silver bullets... The issues are even muddier for categorical data... An increase in goodness of fit does not necessarily result in an increase in knowledge… Increasing goodness of fit reduces uncertainty in the predictions of the models, but this costs money (more and better data). How much are you willing to spend? The “signal to noise” issue: if you can see the signal through the noise, how far are you willing to go to reduce the noise?


Download ppt "Lecture 5 Model Evaluation"

Similar presentations


Ads by Google