Data Handling & Analysis Polynomials and model fit Andrew Jackson
Linear type data How are two measures related?
Data are the number of species (Y) recorded per time spent looking for them (X) Specifically, these data come from fisheries data Good proxy for species diversity in the marine habitat What do we do about curvature?
Clearly a straight line won’t do
… the residuals are horrible
Polynomials Polynomials are linear equations that show curvature – Quadratics Y = b 0 + b 1 X + b 2 X 2 – Cubics Y = b 0 + b 1 X + b 2 X 2 + b 3 X 3 – 5 th, 6 th order polynomials etc…
Quadratic model
Better… But not so good at lower values of x Try a more complicated model like a cubic Quadratic residuals
Note the double curvature Model appears to explain the lower values better But how sure are we of the increase at higher values? Cubic model
Better than the quadratic But still over-estimating the lowest values of x Cubic residuals
Model is – Y~log(X) Appears to explain the data very well across the full range Check the residuals… Log transform the X variable
Now these look pretty near perfect Y~log(X) residuals
The null model Consists of a mean and a variance only It gives us a benchmark against which we can test our models that include more information If we can’t do better than the null model then we don’t understand our data or system!
Residuals of the null model
Choosing between alternative models We now have a choice between 5 models – Null model (zero order polynomial, which includes an intercept only – i.e. just a mean and variance model) – Straight line (first order polynomial) – Quadratic (second order polynomial) – Cubic (third order polynomial) – First order polynomial with log(X) How do we select which one to use? – Higher order polynomials require more parameters
Parsimony as a central tenet Parsimony is the application of the most simplest explanation for a phenomenon and underpins all of science So.. We need to pick the model that – Fits the data the best, and … – Uses the least number of parameters
Likelihood of data
AIC for model selection We will use Akaike’s Information Criterion (AIC) to select the most suitable model AIC = -2Log(likelihood) + 2k – Log-likelihood gets bigger the better the fit – k is the number of parameters in the model Lower AIC = more suitable model
AIC of our models Null model Straight line Quadratic Cubic th order th order th order-77.7 log(X) So the log(x) model is the best in this case Note that adding more orders to the polynomials ceases to confer any benefit after 5 th order. Also… these get increasingly difficult to explain and relate to biological phenomena
Conclusions AIC provides an objective way to compare alternative models Lower AIC indicates a more parsimonius model Must only compare AIC on models of the exact same response variable Only provides relative, and not absolute indication of model fit – Still need to check that the model is any good – Residuals etc…