WiFi password: 525-244-426-914.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

- Word counts - Speech error counts - Metaphor counts - Active construction counts Moving further Categorical count data.
Logistic Regression Psy 524 Ainsworth.
Assumptions. “Essentially, all models are wrong, but some are useful” George E.P. Box Your model has to be wrong… … but that’s o.k. if it’s illuminating!
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Variance and covariance M contains the mean Sums of squares General additive models.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.
Linear and generalised linear models
Regression Diagnostics Checking Assumptions and Data.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Objectives of Multiple Regression
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Linear Model. Formal Definition General Linear Model.
Regression. Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Mixed Effects Models Rebecca Atkins and Rachel Smith March 30, 2015.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Statistics……revisited
Logistic Regression Analysis Gerrit Rooks
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
Logistic Regression: Regression with a Binary Dependent Variable.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 15 Multiple Regression Model Building
Logistic Regression When and why do we use logistic regression?
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Regression Analysis AGEC 784.
Logistic Regression APKC – STATS AFAC (2016).
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Data Analysis - Lecture /04/03
Multiple Regression Prof. Andy Field.
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Generalized Linear Models
Non-Linear Models Tractable non-linearity Intractable non-linearity
Regression Diagnostics
Statistics in MSmcDESPOT
Drop-in Sessions! When: Hillary Term - Week 1 Where: Q-Step Lab (TBC) Sign up with Alice Evans.
Generalized Linear Models
Diagnostics and Transformation for SLR
Linear Regression/Correlation
Residuals The residuals are estimate of the error
Scatter Plots of Data with Various Correlation Coefficients
DCAL Stats Workshop Bodo Winter.
I271b Quantitative Methods
What is Regression Analysis?
Regression diagnostics
Lecture 12 Model Building
Product moment correlation
Checking Assumptions Primary Assumptions Secondary Assumptions
Chapter 13 Additional Topics in Regression Analysis
Linear Regression and Correlation
A protocol for data exploration to avoid common statistical problems
Diagnostics and Transformation for SLR
Regression and Correlation of Data
Generalized Additive Model
Presentation transcript:

WiFi password: 525-244-426-914

Assumptions

Assumptions assumptions about the predictors Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors assumptions about the residuals Independence

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Collinearity… … generally occurs when predictors are correlated (however, it may also occur in more complex ways, through multi-collinearity) Demo

Absence of Collinearity Baayen (2008: 182)

Collinearity Baayen (2008: 182)

“If collinearity is ignored, one is likely to end up with a confusing statistical analysis in which nothing is significant, but where dropping one covariate can make the others significant, or even change the sign of estimated parameters.” (Zuur, Ieno & Elphick, 2010: 9) Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

You check collinearity through variance inflation factors library(car) vif(xmdl) Values >10 are commonly regarded as dangerous; however, values substantially larger than 1 are dangerous I would definitely start worrying at around >4 INFORMALLY: correlations, definitely start worrying at around 0.8

Model comparison with separate models of collinear predictors xmdl1 <- lm(y ~ A) xmdl2 <- lm(y ~ B) AIC(xmdl1) AIC(xmdl2) trade-off between goodness of fit and number of parameters Akaike’s information criterion

If relative importance of (potentially) collinear predictors is of prime interest… Random forests: myforest = cforest(..., controls = data.controls) my_varimp = varimp(myforest, conditional = T) check Stephanie Shih’s tutorials

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

generated 500 x points and 500 y points, completely uncorrelated, by simply drawing them from a normal distribution

simply changing one value to 8/8

Influence diagnostics DFFit DFBeta Leverage Cook’s distance Standardized residuals Studentized residuals … and more!

Code for doing DFBetas yourself Perform leave-one-out influence diagnostics General structure of code: all.betas = c() for(i in 1:nrow(xdata)){ xmdl = lm( … , xdata[-i,]) all.betas = c(all.betas, coef(xmdl)["slope_of_interest"]) }

Influence diagnostics abuse Influence diagnostics are no justification for removing data points!!

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Q-Q plots qqnorm(residuals(xmdl));qqline(residuals(xmdl))

Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence

Zuur, A. F. , Ieno, E. N. , & Elphick, C. S. (2010) Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

(plot of residuals against fitted values) Residual plot (plot of residuals against fitted values) plot(fitted(xmdl),residuals(xmdl)) This is really bad!!!

Learning to interpret residual plots by simulating random data You can type these two lines into R again and again to train your eye!: ## Good par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,rnorm(50)) ## Weak non-constant variance for(i in 1:9) {plot(1:50,sqrt((1:50))*rnorm(50))} Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.

Learning to interpret residual plots by simulating random data ## Strong non-constant variance par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,(1:50)*rnorm(50)) ## Non-linearity for(i in 1:9)plot(1:50,cos((1:50)*pi/25)+rnorm(50)) Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.

Emphasis of graphical tools For now, forget about formal tests of deviations from normality and homogeneity; graphical methods are generally considered superior (Montgomery & Peck, 1992; Draper & Smith, 1998; Quinn & Keough, 2002; Läänä, 2009; Zuur et al., 2009) Problems with formal tests: Type II errors hard cut-offs if used in a frequentist fashion less information about the data and the model have assumptions themselves Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.

If you have continuous data that exhibits heteroskedasticity… … you can perform nonlinear transformations (e.g., log transform) … there are several variants of regression that can help you out (generalized least squares with gls(); White-Huber covariance matrices using bptest() and coeftest(); bootstrapping using the “boot” package etc.)

“A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data (2) binary data

“A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson regression (2) binary data

“A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson regression (2) binary data  logistic regression

General Linear Model Generalized Linear Model

Generalized Linear Models: Ingredients An error distribution (normal, Poisson, binomial) A linear predictor (LP) A link function (identity, log, logit)

Generalized Linear Models: Two important types Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function

The Poisson Distribution Mean = Variance

Hissing Koreans Winter & Grawunder (2012) Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formality. Journal of Phonetics, 40, 808-815.

Rates can be misleading N = Rate Time 16/s vs. 0/s could be 1 millisecond or 10 years

The basic GLM formula for Poisson The basic LM formula The basic GLM formula for Poisson the linear predictor the link function

The basic GLM formula for Poisson The basic LM formula The basic GLM formula for Poisson linear predictor

Poisson model output exponentiate log values predicted mean rate

Poisson model in R xmdl = glm(…,xdata,family=“poisson”) When using predict, you have to additionally specify whether you want predictions in LP space or response space preds = predict.glm(xmdl, newdata=mydata, type=“response”,se.fit=T)

The Poisson Distribution Mean = Variance

The Poisson Distribution Mean = Variance use negative binomial regression if variance > mean, then you are dealing with overdispersion library(MASS) xmdl.nb = glm.nb(…)

Overdispersion test xmdl.nb = glm.nb(…) library(pscl) odTest(xmdl.nb)

Generalized Linear Models: Two important types Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function

The basic GLM formula for logistic regression The basic GLM formula for Poisson regression The basic GLM formula for logistic regression the logit link function

= inverse logit function plogis()

Odds and log odds examples Probability Odds Log odds (= “logits”) 0.1 0.111 -2.197 0.2 0.25 -1.386 0.3 0.428 -0.847 0.4 0.667 -0.405 0.5 1 0.6 1.5 0.405 0.7 2.33 0.847 0.8 4 1.386 0.9 9 2.197 - So a probability of 80% of an event occurring means that the odds are “4 to 1” for it occurring What happens if the odds are 50 to 50? -> ratio is 1 If the probability of non-occurrence is higher than occurrence, fractions If the probability of occurrence is higher, positive numbers

Snijders & Bosker (1999: 212)

for probabilities: transform the entire LP with the logistic function Estimate Std. Error z value Pr(>|z|) (Intercept) -3.643 1.123 -3.244 0.001179 ** alc 16.118 4.856 3.319 0.000903 *** for probabilities: transform the entire LP with the logistic function plogis()