A protocol for data exploration to avoid common statistical problems

Slides:



Advertisements
Similar presentations
Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Advertisements

Assumptions underlying regression analysis
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Chapter 12 Simple Linear Regression
Hypothesis Testing Steps in Hypothesis Testing:
Part V The Generalized Linear Model Chapter 16 Introduction.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Statistics for Managers Using Microsoft® Excel 5th Edition
Spotting pseudoreplication 1.Inspect spatial (temporal) layout of the experiment 2.Examine degrees of freedom in analysis.
Final Review Session.
Chapter 14 Conducting & Reading Research Baumgartner et al Chapter 14 Inferential Data Analysis.
1 4. Multiple Regression I ECON 251 Research Methods.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Assumption and Data Transformation. Assumption of Anova The error terms are randomly, independently, and normally distributed The error terms are randomly,
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Business Statistics - QBM117 Statistical inference for regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Chapter 14 Inferential Data Analysis
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Simple Linear Regression Analysis
Objectives of Multiple Regression
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Chapter 12 Multiple Regression and Model Building.
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Statistical Power 1. First: Effect Size The size of the distance between two means in standardized units (not inferential). A measure of the impact of.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Blocks and pseudoreplication
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Mixed Effects Models Rebecca Atkins and Rachel Smith March 30, 2015.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Statistics……revisited
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Variance Stabilizing Transformations. Variance is Related to Mean Usual Assumption in ANOVA and Regression is that the variance of each observation is.
Differences Among Groups
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Marginal Distribution Conditional Distribution. Side by Side Bar Graph Segmented Bar Graph Dotplot Stemplot Histogram.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
WiFi password:
Analysis of Variance and Covariance
Differences Among Group Means: Multifactorial Analysis of Variance
Non-Parametric Tests 12/1.
B&A ; and REGRESSION - ANCOVA B&A ; and
Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3
Non-Parametric Tests 12/1.
PCB 3043L - General Ecology Data Analysis.
Comparing ≥ 3 Groups Analysis of Biological Data/Biometrics
Non-Parametric Tests 12/6.
Multiple Regression and Model Building
Non-Parametric Tests.
12 Inferential Analysis.
Lecture 2: Replication and pseudoreplication
Correct statistics in ecological research
Random Effects & Repeated Measures
Introduction to Statistics
Single-Factor Studies
Single-Factor Studies
What is Regression Analysis?
Relationship between two continuous variables: correlations and linear regression both continuous. Correlation – larger values of one variable correspond.
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
12 Inferential Analysis.
Fixed, Random and Mixed effects
Inferential Statistics
Presentation transcript:

A protocol for data exploration to avoid common statistical problems Zuur et al. 2010. Methods in Ecology and Evolution 2010, 1, 3–14 doi: 10.1111/j.2041-210X.2009.00001.x Presented by Han Y. H. Chen

Selecting the appropriate inferential statistics Generalized Linear Model (GLM) as a dominant method in statistics

Simple statistics One-way ANOVA, followed by post-hoc comparison (example) http://flash.lakeheadu.ca/~hchen/R/RootLDD.R Simple regression (by example) ANCOVA MANOVA

Step 1. Are there outliners in Y and X? An observation that has a relatively large or small value compared to the majority of observations boxplot as the tool of detection Simple command in R: boxplot(y) and dotplot

Multi‐panel Cleveland dotplot for all of the morphometric variables measured

Step 2: Do we have homogeneity of variance? Homogeneity of variance is an important assumption in analysis of variance (ANOVA), other regression‐related models and in multivariate techniques like discriminant analysis The solution: transformation of the response variable to stabilize the variance, or applying statistical techniques that do not require homogeneity (generalized least squares)

Step 3: Are the data normally distributed? ANOVA and regression assume normality, but PCA does not

In linear regression, we actually assume normality of all the replicate observations at a particular covariate value But cannot be verified unless one has many replicates at each sampled covariate value

Normality assumption applies to model residuals, not raw data hist(resid(model)) qqnorm(resid(model)); qqline(resid(model)) shapiro.test(resid(model)) Remedies: bootstrapping if parametric is desired Non-parametric, Rfit package or similar

Step 4: Are there lots of zeros in the data? The effects of straw management on waterbird abundance in flooded rice fields One possible statistical analysis is to model the number of birds as a function of time, water depth, farm, field management method, temperature Because this analysis involves modelling a count, GLM (Poisson or negative binomial) is the appropriate analysis, but there are many zeros

The frequency of double zeros is very high All the blue circles correspond to species that have more than 80% of their observations jointly zero Remedies: zero inflated GLMs multivariate techniques

Step 5: Is there collinearity among the covariates? Which covariates are driving the response variable(s)? The biggest problem to overcome is often collinearity Collinearity = confusing statistical analysis Nothing is significant Dropping one covariate can make the others significant or even change the sign of estimated parameters. 

Strategy for addressing collinearity Sequentially drop the covariate with the highest VIF, recalculate the VIFs and repeat this process until all VIFs are smaller than a pre‐selected threshold = 3

Step 6: What are the relationships between Y and X variables?

What are the assumptions for general linear model (lm) Both for regression and ANOVA Independence of observations (can not be met in almost all situations!) Normality –the distributions of the residuals are normal Equality (homogeneity) of variances – the variance of data in groups (or along the x gradients) is the same Additional for regression: Linearity Verifications are done on model residuals, not raw data

Scatterplots are also useful to detect observations that do not comply with the general pattern between two variables measurement errors, typing mistakes

Step 7: Should we consider interactions? see R demonstration for interaction effects

Step 8: Are observations of the response variable independent? A crucial assumption of most statistical techniques is that observations are independent of one another Pseudoreplications Spatial autocorrelation: Observations at locations close to each other have more similar characteristics than those far away Temporal autocorrelation Repeated observations on the same objects are more similar

Plot auto‐correlation functions (ACF) for regularly spaced time series Auto-correlated Not auto-correlated

Remedies for independence Know your experimental or sampling design Linear model assumes completely randomized design by default Completely randomized block design Split-plot design (nested) Repeated measures Apply correct statistical model Linear mixed effect models (packages “lme4” or “nlme”)

Take home message Simple statistical analysis is the best if it meets all assumption, but ecological reality is more complex Simple questions have been studied for so long, no niche for novelty/discovery Statistics without verifying assumptions are not reliable and do not serve the purpose

A good starting place to learn R graphics https://stats.idre.ucla.edu/r/modules/