A protocol for data exploration to avoid common statistical problems Zuur et al. 2010. Methods in Ecology and Evolution 2010, 1, 3–14 doi: 10.1111/j.2041-210X.2009.00001.x Presented by Han Y. H. Chen
Selecting the appropriate inferential statistics Generalized Linear Model (GLM) as a dominant method in statistics
Simple statistics One-way ANOVA, followed by post-hoc comparison (example) http://flash.lakeheadu.ca/~hchen/R/RootLDD.R Simple regression (by example) ANCOVA MANOVA
Step 1. Are there outliners in Y and X? An observation that has a relatively large or small value compared to the majority of observations boxplot as the tool of detection Simple command in R: boxplot(y) and dotplot
Multi‐panel Cleveland dotplot for all of the morphometric variables measured
Step 2: Do we have homogeneity of variance? Homogeneity of variance is an important assumption in analysis of variance (ANOVA), other regression‐related models and in multivariate techniques like discriminant analysis The solution: transformation of the response variable to stabilize the variance, or applying statistical techniques that do not require homogeneity (generalized least squares)
Step 3: Are the data normally distributed? ANOVA and regression assume normality, but PCA does not
In linear regression, we actually assume normality of all the replicate observations at a particular covariate value But cannot be verified unless one has many replicates at each sampled covariate value
Normality assumption applies to model residuals, not raw data hist(resid(model)) qqnorm(resid(model)); qqline(resid(model)) shapiro.test(resid(model)) Remedies: bootstrapping if parametric is desired Non-parametric, Rfit package or similar
Step 4: Are there lots of zeros in the data? The effects of straw management on waterbird abundance in flooded rice fields One possible statistical analysis is to model the number of birds as a function of time, water depth, farm, field management method, temperature Because this analysis involves modelling a count, GLM (Poisson or negative binomial) is the appropriate analysis, but there are many zeros
The frequency of double zeros is very high All the blue circles correspond to species that have more than 80% of their observations jointly zero Remedies: zero inflated GLMs multivariate techniques
Step 5: Is there collinearity among the covariates? Which covariates are driving the response variable(s)? The biggest problem to overcome is often collinearity Collinearity = confusing statistical analysis Nothing is significant Dropping one covariate can make the others significant or even change the sign of estimated parameters.
Strategy for addressing collinearity Sequentially drop the covariate with the highest VIF, recalculate the VIFs and repeat this process until all VIFs are smaller than a pre‐selected threshold = 3
Step 6: What are the relationships between Y and X variables?
What are the assumptions for general linear model (lm) Both for regression and ANOVA Independence of observations (can not be met in almost all situations!) Normality –the distributions of the residuals are normal Equality (homogeneity) of variances – the variance of data in groups (or along the x gradients) is the same Additional for regression: Linearity Verifications are done on model residuals, not raw data
Scatterplots are also useful to detect observations that do not comply with the general pattern between two variables measurement errors, typing mistakes
Step 7: Should we consider interactions? see R demonstration for interaction effects
Step 8: Are observations of the response variable independent? A crucial assumption of most statistical techniques is that observations are independent of one another Pseudoreplications Spatial autocorrelation: Observations at locations close to each other have more similar characteristics than those far away Temporal autocorrelation Repeated observations on the same objects are more similar
Plot auto‐correlation functions (ACF) for regularly spaced time series Auto-correlated Not auto-correlated
Remedies for independence Know your experimental or sampling design Linear model assumes completely randomized design by default Completely randomized block design Split-plot design (nested) Repeated measures Apply correct statistical model Linear mixed effect models (packages “lme4” or “nlme”)
Take home message Simple statistical analysis is the best if it meets all assumption, but ecological reality is more complex Simple questions have been studied for so long, no niche for novelty/discovery Statistics without verifying assumptions are not reliable and do not serve the purpose
A good starting place to learn R graphics https://stats.idre.ucla.edu/r/modules/