A protocol for data exploration to avoid common statistical problems

A protocol for data exploration to avoid common statistical problems
Zuur et al Methods in Ecology and Evolution 2010, 1, 3–14 doi: /j X x Presented by Han Y. H. Chen

Selecting the appropriate inferential statistics
Generalized Linear Model (GLM) as a dominant method in statistics

Simple statistics One-way ANOVA, followed by post-hoc comparison (example) Simple regression (by example) ANCOVA MANOVA

Step 1. Are there outliners in Y and X?
An observation that has a relatively large or small value compared to the majority of observations boxplot as the tool of detection Simple command in R: boxplot(y) and dotplot

Multi‐panel Cleveland dotplot for all of the morphometric variables measured

Step 2: Do we have homogeneity of variance?
Homogeneity of variance is an important assumption in analysis of variance (ANOVA), other regression‐related models and in multivariate techniques like discriminant analysis The solution: transformation of the response variable to stabilize the variance, or applying statistical techniques that do not require homogeneity (generalized least squares)

Step 3: Are the data normally distributed?
ANOVA and regression assume normality, but PCA does not

In linear regression, we actually assume normality of all the replicate observations at a particular covariate value But cannot be verified unless one has many replicates at each sampled covariate value

Normality assumption applies to model residuals, not raw data
hist(resid(model)) qqnorm(resid(model)); qqline(resid(model)) shapiro.test(resid(model)) Remedies: bootstrapping if parametric is desired Non-parametric, Rfit package or similar

Step 4: Are there lots of zeros in the data?
The effects of straw management on waterbird abundance in flooded rice fields One possible statistical analysis is to model the number of birds as a function of time, water depth, farm, field management method, temperature Because this analysis involves modelling a count, GLM (Poisson or negative binomial) is the appropriate analysis, but there are many zeros

The frequency of double zeros is very high
All the blue circles correspond to species that have more than 80% of their observations jointly zero Remedies: zero inflated GLMs multivariate techniques

Step 5: Is there collinearity among the covariates?
Which covariates are driving the response variable(s)? The biggest problem to overcome is often collinearity Collinearity = confusing statistical analysis Nothing is significant Dropping one covariate can make the others significant or even change the sign of estimated parameters.

Strategy for addressing collinearity
Sequentially drop the covariate with the highest VIF, recalculate the VIFs and repeat this process until all VIFs are smaller than a pre‐selected threshold = 3

Step 6: What are the relationships between Y and X variables?

What are the assumptions for general linear model (lm)
Both for regression and ANOVA Independence of observations (can not be met in almost all situations!) Normality –the distributions of the residuals are normal Equality (homogeneity) of variances – the variance of data in groups (or along the x gradients) is the same Additional for regression: Linearity Verifications are done on model residuals, not raw data

Scatterplots are also useful to detect observations that do not comply with the general pattern between two variables measurement errors, typing mistakes

Step 7: Should we consider interactions?
see R demonstration for interaction effects

Step 8: Are observations of the response variable independent?
A crucial assumption of most statistical techniques is that observations are independent of one another Pseudoreplications Spatial autocorrelation: Observations at locations close to each other have more similar characteristics than those far away Temporal autocorrelation Repeated observations on the same objects are more similar

Plot auto‐correlation functions (ACF) for regularly spaced time series
Auto-correlated Not auto-correlated

Remedies for independence
Know your experimental or sampling design Linear model assumes completely randomized design by default Completely randomized block design Split-plot design (nested) Repeated measures Apply correct statistical model Linear mixed effect models (packages “lme4” or “nlme”)

Take home message Simple statistical analysis is the best if it meets all assumption, but ecological reality is more complex Simple questions have been studied for so long, no niche for novelty/discovery Statistics without verifying assumptions are not reliable and do not serve the purpose

A good starting place to learn R graphics

A protocol for data exploration to avoid common statistical problems

Similar presentations

Presentation on theme: "A protocol for data exploration to avoid common statistical problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A protocol for data exploration to avoid common statistical problems

Similar presentations

Presentation on theme: "A protocol for data exploration to avoid common statistical problems"— Presentation transcript:

Similar presentations

About project

Feedback