Correct statistics in ecological research

Correct statistics in ecological research
Han Y. H. Chen Lakehead University Canada

Selecting the appropriate inferential statistics
Generalized Linear Model (GLM) as a dominant method in statistics

lm includes all except path analysis
install.packages(glm2) library("glm2") install.packages(lme4) library(lme4) install.packages("ggplot2") library(ggplot2)

Read file into R mydata <-read.csv(file.choose()) attach(mydata)
summary (mydata) edit (mydata)

What are the assumptions for general linear model (lm)
Both for regression and ANOVA Independence of observations (can not be met in almost all situations!) Normality –the distributions of the residuals are normal Equality (homogeneity) of variances – the variance of data in groups (or along the x gradients) is the same Unit-treatment additivity –cannot be directly falsified Additional for regression Linearity Why this assumption is not applicable in ANOVA?

What if some assumptions can not be met?
Use Generalized Linear Models (GLM) If normality and homogenous variance can not be met Use Generalized Non-linear Models (GNM) If linearity can not be met

6 Steps to verify assumptions
Correct specification of distribution Family Variance Link Normal Gaussian identity Binomial logit, probit or cloglog Poisson log, identity or sqrt Gamma inverse, identity or log inverse.gaussian 1/mu^2 Quasi user-defined What is the distribution for species richness?

Types of distributions
Raw data is almost never as well behaved as we would like it to be Generalized extreme value distribution

A good starting place to learn R graphics

Normality

Practical example to check distributions
#Verfiy normal distribution hist(AGBw, 50) qqnorm(AGBw) shapiro.test(AGBw) #is log-transformation makes it better? logAGB<-log(AGBw+15) hist(logAGB, 50) qqnorm(mydata$logAGB) shapiro.test(mydata$logAGB) #root square transformation? rAGB<-(AGBw+15)^0.5

2. Correct form for the explanatory variable x?
Categorical or numerical Linear or nonlinear?

Non-linearity

Check non-linearity and homogenous variances
a <- ggplot(mydata, aes(y=AGBw, x=SA)) +ylab(expression(paste(Aboveground~biomass~(Mg~ha^-1~year^-1))))+xlab(expression(paste(ln~(Forest~age~(years))))) a + stat_smooth(method = "lm", size = 1.5, alpha = 0.9, colour=2)+geom_point(shape=1, size=1)+ theme_bw()+ theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())+ stat_smooth(size = 1, alpha = 0.5) c = ggplot(mydata, aes(y=AGBw, x=log(SA))) +ylab(expression(paste(Aboveground~biomass~(Mg~ha^-1~year^-1))))+xlab(expression(paste(ln~(Forest~age~(years))))) c + stat_smooth(method = "lm", size = 1.5, alpha = 0.9, colour=2)+geom_point(shape=1, size=1)+ theme_bw()+ theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())+ stat_smooth(size = 1, alpha = 0.5) Transform x to make relationships linear, start with polynomials Still does not work, go with gnm

3. Statistical independence of the n observations
All default models assume “completely randomized experimental or sampling design”, but reality is that this assumption can hardly be met in ecological data. This is why we learn Fundamentals of Experimental Design Is there a hierarchical structure in your data? Example 1 Influences of stand type (S) and soil layer (L) on nitrogen concentration? A typical example of nested design: Yijk = Si + Lj(i) + ek(ij) What if you have repeated measures over the growing season? Example Random plot effect in repeated measures m1<-lmer(AGBw~log(SA)+(1|Plot), data=mydata) m11<-lmer(AGBw~log(SA)+(log(SA)|Plot), data=mydata)

3. Statistical independence of the n observations –clustering
Is your sampling/experiment pseudo-replicated? Is there spatial auto-correlation? Suitable tests are Mantel and Moran I Note: Our Spruce Forest plots are spatially auto-correlated, subject to criticism of pseudo-replication! But this is dilemma for studies of natural experiments Nevertheless, researchers shall be ready to defend this

4. Correct specification of the variance function v
lm –Ordinary Least Square GLM – Maximum likelihoods If you have properly specified the distribution, this will take care of homogenous variance assumption!

5. Overdispersion and underdispersion
Overdispersion: The presence of greater variability (statistical dispersion) in a data set than would be expected based on a given simple statistical model Causes: model is too simple Underdispersion means that there was less variation in the data than predicted Model is too complex; data is overfitted Practical solution—AIC or BIC to select the “best” model

6. Lack of undue influence of individual observations on the fit
Deletion diagnostics: the influence of individual observations on a GLM fit is measured by the change in estimated coefficient/parameter (mean, intercept, or slope) It can be done by following this You can model how individual observations affect your model m2<-lm(AGBw~SA*ForestType, data=mydata) rstandard(m2, infl = lm.influence(m2, do.coef = FALSE), sd = sqrt(deviance(m2)/df.residual(m2)))

Take home message Simple statistical analysis is the best if it meets all assumption, but ecological reality is more complex Simple questions have been studied for so long, no niche for novelty/discovery Statistics without verifying assumptions are not reliable and do not serve the purpose

Correct statistics in ecological research

Similar presentations

Presentation on theme: "Correct statistics in ecological research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correct statistics in ecological research

Similar presentations

Presentation on theme: "Correct statistics in ecological research"— Presentation transcript:

Similar presentations

About project

Feedback