Unit 26: Model Assumptions & Case Analysis II: More complex designs

Slides:

Advertisements

Similar presentations

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Advertisements

Regression Analysis Simple Regression. y = mx + b y = a + bx.

Simple Linear Regression

Multivariate Data Analysis Chapter 4 – Multiple Regression.

Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,

Chapter Topics Types of Regression Models

Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.

Correlation and Regression Analysis

1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.

Relationships Among Variables

Regression and Correlation Methods Judy Zhong Ph.D.

Introduction to Linear Regression and Correlation Analysis

1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.

Overview of Meta-Analytic Data Analysis

Simple Linear Regression Models

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.

Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.

Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.

Regression Model Building LPGA Golf Performance

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

Mixed Effects Models Rebecca Atkins and Rachel Smith March 30, 2015.

Analysis Overheads1 Analyzing Heterogeneous Distributions: Multiple Regression Analysis Analog to the ANOVA is restricted to a single categorical between.

[1] Simple Linear Regression. The general equation of a line is Y = c + mX or Y =  +  X.  > 0  > 0  > 0  = 0  = 0  < 0  > 0  < 0.

28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.

Linear Models Alan Lee Sample presentation for STATS 760.

Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.

Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

> xyplot(cat~time|id,dd,groups=ses,lty=1:3,type="l") > dd head(dd) id ses cat pre post time

BPS - 5th Ed. Chapter 231 Inference for Regression.

Linear model. a type of regression analyses statistical method – both the response variable (Y) and the explanatory variable (X) are continuous variables.

Unit Goals Further review of standard repeated measures analyses via lm and lmer (conceptual and practical) Consider design issues in repeated measures.

Unit 9: Dealing with Messy Data I: Case Analysis

Multiple Regression.

Chapter 13 Simple Linear Regression

The simple linear regression model and parameter estimation

Chapter 14 Introduction to Multiple Regression

Chapter 20 Linear and Multiple Regression

Unit 3: Inferences about a Single Mean (1 Parameter models)

Inference for Least Squares Lines

Learning Objectives For two quantitative IVs, you will learn:

Chapter 12 Simple Linear Regression and Correlation

Linear Regression.

CHAPTER 7 Linear Correlation & Regression Methods

Statistics for Managers using Microsoft Excel 3rd Edition

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

Chapter 11: Simple Linear Regression

Mixed models and their uses in meta-analysis

Statistics in MSmcDESPOT

SIMPLE LINEAR REGRESSION MODEL

Checking Regression Model Assumptions

Diagnostics and Transformation for SLR

BA 275 Quantitative Business Methods

CHAPTER 29: Multiple Regression*

Checking Regression Model Assumptions

Multiple Regression.

Review of Hypothesis Testing

Chapter 12 Simple Linear Regression and Correlation

Joanna Romaniuk Quanticate, Warsaw, Poland

A Gentle Introduction to Linear Mixed Modeling and PROC MIXED

Simple Linear Regression

Basic Practice of Statistics - 3rd Edition Inference for Regression

CHAPTER 12 More About Regression

Simple Linear Regression

Product moment correlation

Inferential Statistics

Diagnostics and Transformation for SLR

Andrea Friese, Silvia Artuso, Danai Laina

Presentation transcript:

Unit 26: Model Assumptions & Case Analysis II: More complex designs

What is New since Case Analysis I in First Semester This semester we have focused on how to handle analyses when observations are not completely independent Four types of designs: Pre-test/post-test designs analyzed with ANCOVA Traditional repeated measures designs with categorical within subject factors. Analyzed with either LM on transformed differences and averages or LMER More advanced repeated measures designs with quantitative within subject factors analyzed in LMER Other designs with dependent data where observations are nested within groups (e.g., students within classrooms, siblings within families) analyzed with LMER

Pre-test/Post-test Designs How do we evaluate model assumptions and conduct case analysis in the pre-test/post test design? These designs involve standard general linear models analyzed with LM. Model assumptions and case analyses are handled as we learned Case analysis conducted first (generally) Regression outliers: visual inspection and test of studentized residuals Influence: visual inspection and thresholds for Cooks D and DFBetas Model assumptions focus on residuals: Normality: Q-Q (Quantile Comparison) plot Constant variance (across Y-hat): statitsical test and spread level plot Linearity (mean = 0 for all Y-hat): Component plus residual plot

Pre-test/Post-test design Evaluate a course designed to improve math achievement Measure math achievement at pre-test (before course) Randomly assign to course vs. control group Measure math achievement at post-test How do you analyze, why and what is the critical support for the new course?

> m = lm(Post ~ Pre + X, data=d) > modelSummary(m) lm(formula = Post ~ Pre + X, data = d) Observations: 102 Linear model fit by least squares Coefficients: Estimate SE t Pr(>|t|) (Intercept) 49.4651 17.0653 2.899 0.00462 ** Pre 0.5122 0.1679 3.050 0.00294 ** X 10.6933 3.1979 3.344 0.00117 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Sum of squared errors (SSE): 25168.0, Error df: 99 R-squared: 0.1973 FOCUS ON TEST OF PARAMETER ESTIMATE FOR X

Make Data for Pre-Post Design #Set.seed to provide same results from rnorm each time. set.seed(4321) X = rep(c(-.5,.5), 50) Pre = rnorm(length(X), 100,10) Post = 50 + 10*X + .5*Pre + rnorm(length(X),0,15) D = data.frame(X=X, Pre=Pre, Post=Post) #add two more points for demonstration purposes d[101,] = c(.5, max(Pre), max(Post)+10) d[102,] = c(-.5, min(Pre), max(Post)+10)

Regression Outliers: Studentized Residuals > modelCaseAnalysis(m,'Residuals')

Regression Outliers: Studentized Residuals > modelCaseAnalysis(m,'Residuals')

Overall Model Influence: Cooks D

Specific Influence: DFBetas

Specific Influence: Added Variable Plots

Updated Results without Outlier (102) > d1 = dfRemoveCases(d,c(102)) > m1 = lm(Post ~ Pre + X, data=d1) > modelSummary(m1) lm(formula = Post ~ Pre + X, data = d1) Observations: 101 Linear model fit by least squares Coefficients: Estimate SE t Pr(>|t|) (Intercept) 29.1225 16.0047 1.820 0.071870 . Pre 0.7062 0.1572 4.494 0.0000192 *** X 11.5260 2.8973 3.978 0.000133 *** --- Sum of squared errors (SSE): 20376.9, Error df: 98 R-squared: 0.2986

Normality of Residuals > modelAssumptions(m,‘Normal')

Constant Variance > modelAssumptions(m,'Constant') Suggested power transformation: 0.6214171 Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 0.2695912 Df = 1 p = 0.6036062

Constant Variance

Linearity

Next Example: Traditional Repeated Measures Beverage Group: Alcohol vs. No alcohol (between subjects) Threat: Shock vs. Safe (within subjects) DV is startle response Prediction is BG X Threat interaction with threat effect smaller in Alcohol than No alcohol group (i.e., a stress response dampening effect of alcohol)

Make repeated measures data set.seed(11112) N=50 Obs=2 SubID = rep(1:N, each=Obs) BG = rep(c(-.5, .5), each=N) Threat = rep(c(-.5, .5), N) STL = (100+rep(rnorm(N,0,20),each=Obs)) + ((30+rep(rnorm(N,0,3),each=Obs))*Threat) + (0*BG) + (-10*BG*Threat) + rnorm(N*Obs,0,5) dL = data.frame(SubID=SubID, BG=BG, Threat=Threat, STL=STL) #SubID 51: Big startler dL[101,] = c(51, .5, .5, 500) dL[102,] = c(51, .5, -.5, 475) #SubID 52: Atypical Threat but really big BG dL[103,] = c(52, .5, .5, 50) dL[104,] = c(52, .5, -.5, 150)

The Data in Long Format > head(dL,30) SubID BG Threat STL 1 1 -0.5 -0.5 124.99454 2 1 -0.5 0.5 156.34272 3 2 -0.5 -0.5 73.46904 4 2 -0.5 0.5 102.12539 5 3 -0.5 -0.5 96.47544 6 3 -0.5 0.5 146.70324 7 4 -0.5 -0.5 41.54630 8 4 -0.5 0.5 59.05271 9 5 -0.5 -0.5 62.43265 10 5 -0.5 0.5 105.45589 11 6 -0.5 -0.5 98.29484 12 6 -0.5 0.5 122.97132 13 7 -0.5 -0.5 70.28099 14 7 -0.5 0.5 126.56834 15 8 -0.5 -0.5 80.86480 16 8 -0.5 0.5 111.21301 17 9 -0.5 -0.5 92.98044 . . .

The Data in Long Format > tail(dL,30) SubID BG Threat STL 75 38 0.5 -0.5 62.79062 76 38 0.5 0.5 92.02443 77 39 0.5 -0.5 95.44343 78 39 0.5 0.5 121.66470 79 40 0.5 -0.5 77.00937 80 40 0.5 0.5 87.00877 81 41 0.5 -0.5 71.92629 82 41 0.5 0.5 89.20128 83 42 0.5 -0.5 99.17696 84 42 0.5 0.5 132.61511 85 43 0.5 -0.5 96.74501 86 43 0.5 0.5 124.47742 87 44 0.5 -0.5 98.62467 88 44 0.5 0.5 120.02696 89 45 0.5 -0.5 93.26754 90 45 0.5 0.5 116.84759 91 46 0.5 -0.5 118.22392 . . .

Cast to Wide Format dW = dcast(data=dL, formula= SubID+BG~Threat, value.var='STL') dW = varRename(dW,c(-.5, .5), c('Safe', 'Shock')) > head(dW,20) SubID BG Safe Shock 1 1 -0.5 124.99454 156.34272 2 2 -0.5 73.46904 102.12539 3 3 -0.5 96.47544 146.70324 4 4 -0.5 41.54630 59.05271 5 5 -0.5 62.43265 105.45589 6 6 -0.5 98.29484 122.97132 7 7 -0.5 70.28099 126.56834 8 8 -0.5 80.86480 111.21301 9 9 -0.5 92.98044 127.07305 10 10 -0.5 103.90452 143.72773 . . .

How can you test for the BG X Threat interaction using LM Calculate Shock – Safe Difference Regress difference on BG Test of BG parameter estimate is BG X Threat interaction

> dW$STLDiff = dW$Shock-dW$Safe > mDiff = lm(STLDiff ~ BG, data=dW) > modelSummary(mDiff) lm(formula = STLDiff ~ BG, data = dW) Observations: 52 Linear model fit by least squares Coefficients: Estimate SE t Pr(>|t|) (Intercept) 27.912 2.647 10.545 0.0000000000000261 *** BG -16.403 5.294 -3.098 0.00319 ** --- Sum of squared errors (SSE): 18188.9, Error df: 50 R-squared: 0.1611 What does intercept test?

> dW$STLAvg = (dW$Shock+dW$Safe)/2 > mAvg = lm(STLAvg ~ BG, data=dW) > modelSummary(mAvg) lm(formula = STLAvg ~ BG, data = dW) Observations: 52 Linear model fit by least squares Coefficients: Estimate SE t Pr(>|t|) (Intercept) 108.247 7.903 13.698 <2e-16 *** BG 11.553 15.805 0.731 0.468 --- Sum of squared errors (SSE): 162130.6, Error df: 50 R-squared: 0.0106

Your experiment focused on the BG X Threat interaction Your experiment focused on the BG X Threat interaction. You want to report the test of that parameter estimate with confidence. What do you do? Evaluate model assumptions and do case analysis on the mDiff model. This is just a simple linear regression in LM. You know exactly what to do.

Regression Outliers: Studentized Residuals modelCaseAnalysis(mDiff,'Residuals')

Model Influence: Cooks d modelCaseAnalysis(mDiff,‘Cooksd')

Specific Influence: DFBetas modelCaseAnalysis(mDiff,‘DFBETAS')

Added Variable Plots

> dW1 = dfRemoveCases(dW,c(52)) > dim(dW1) [1] 51 6 [1] 51 6 > mDiff1 = lm(STLDiff ~ BG, data=dW1) > modelSummary(mDiff1) lm(formula = STLDiff ~ BG, data = dW1) Observations: 51 Linear model fit by least squares Coefficients: Estimate SE t Pr(>|t|) (Intercept) 30.214 1.151 26.260 < 2e-16 *** BG -11.798 2.301 -5.127 0.000005 *** --- Sum of squared errors (SSE): 3307.0, Error df: 49 R-squared: 0.3492

Normality modelAssumptions(mDiff1,'Normal')

modelAssumptions(mDiff1,’Constant’) Non-constant Variance Score Test Chisquare = 6.264605 Df = 1 p = 0.01231736

Component + Residual Plot modelAssumptions(mDiff1,’Linear’)

What about the mAvg model What about the mAvg model? Should you evaluate model assumptions and conduct case analysis on that model? That model has no effect at all on the parameter estimate that tests the BG x Threat interaction. If you are interested reporting and interpreting the effects from that model (main effect of BG on startle, mean startle response), then you should of course evaluate the model. However, if not, then it is not necessary. You will find SubID 51 to be unusual in that model Classic ANOVA bound these models together in results. As such, if you evaluate both, people probably want to see the same sample in both models.

Classic Repeated Measures in LMER m = lmer(STL ~ BG*Threat + (1+Threat|SubID),data=dL, control= lmerControl(check.nobs.vs.nRE="ignore"))

Classic Repeated Measures in LMER modelSummary(m) Observations: 104; Groups: SubID, 52 Linear mixed model fit by REML Fixed Effects: Estimate SE F error df Pr(>F) (Intercept) 108.247 7.903 187.6282 50 < 2e-16 BG 11.553 15.805 0.5343 50 0.46823 Threat 27.912 2.647 111.2028 50 0.0000000000000261 BG:Threat -16.403 5.294 9.6004 50 0.00319 --- NOTE: F, error df, and p-values from Kenward-Roger approximation Random Effects: Groups Name Std.Dev. Corr SubID (Intercept) 56.1461 Threat 1.6893 0.989 Residual 13.4336 AIC: 1011.5; BIC: 1032.6; logLik: -497.7; Deviance: 995.5

Case Analysis and Model Assumptions in LMER Standards are not as well established for LMEM as for linear models. We will follow recommendations from Loy & Hoffmann (2014). Can examine residuals at both levels of the model (eij at level 1 and random effects at level 2) Can use these residuals to check for normality, constant variance, and linearity Level 2 residuals can also identify model outliers with respect to fixed effects Can calculate influence statistics (Cooks d) …BUT it’s a bit complicated at this point. Stay tuned for integration in lmSupport.

Examine level 1 residuals using LS residuals (basically residuals from fitting OLS regression in each subject) > resid1 = HLMresid(m, level = 1, type = "LS", standardize = TRUE) > head(resid1) STL BG Threat SubID LS.resid fitted std.resid 1 124.99454 -0.5 -0.5 1 0 124.99454 NaN 2 156.34272 -0.5 0.5 1 0 156.34272 NaN 3 73.46904 -0.5 -0.5 2 0 73.46904 NaN 4 102.12539 -0.5 0.5 2 0 102.12539 NaN 5 96.47544 -0.5 -0.5 3 0 96.47544 NaN 6 146.70324 -0.5 0.5 3 0 146.70324 NaN

Examine level 2 residuals using Empirical Bayes approach (default) Examine level 2 residuals using Empirical Bayes approach (default). These are simply the random effects (ranef) for each subject from the model > resid2 = HLMresid(object = m, level = "SubID") > head(resid2) (Intercept) Threat 1 37.092485 1.1026289 2 -14.316096 -0.4271915 3 18.684571 0.5583121 4 -50.843132 -1.5157798 5 -17.957169 -0.5329992 6 7.850774 0.2315537

Level 2 Outlier for Intercept varPlot(resid2[,1, VarName = ‘Intercept', IDs=rownames(resid2))

Level 2 Outlier for Threat varPlot(resid2$Threat, VarName = 'Threat', IDs=rownames(resid2))

Influence: Cooks D cooksd <- cooks.distance(m, group = "SubID") varPlot(cooksd, , IDs=rownames(resid2))

dotplot_diag(x = cooksd, cutoff = "internal", name = "cooks dotplot_diag(x = cooksd, cutoff = "internal", name = "cooks.distance") + ylab("Cook's distance") + xlab("SubID")

Quantifying Parameter Estimate Change beta_51 = as.numeric(attr(cooksd, "beta_cdd")[[51]]) names(beta_51) <- names(fixef(m)) beta_52 = as.numeric(attr(cooksd, "beta_cdd")[[52]]) names(beta_52) <- names(fixef(m)) fixef(m) (Intercept) BG Threat BG:Threat 108.24711 11.55264 27.91231 -16.40255 beta_51 #This subject increases intercept and BG 7.1822417 14.3644833 0.1017110 0.2034219 beta_52 #This subject decreases threat and increases (more negative) interaction -0.2696814 -0.5393629 -2.3021352 -4.6042704

Normality for Random Intercept ggplot_qqnorm(x = resid2[,1], line = "rlm")

Normality for Random Threat ggplot_qqnorm(x = resid2$Threat, line = "rlm")

TRUE LMEM Threat: Safe vs. Shock Trial 1-20 DV is startle Focus on Trial X Threat interaction

set.seed(11111) N=50 Obs=20 SubID = rep(1:N, each=Obs) Trial = rep(1:Obs, N) cTrial = Trial - mean(Trial) Threat = rep(c(-.5, .5),Obs/2*N) STL = (100+rep(rnorm(N,0,20),each=Obs)) + ((30+rep(rnorm(N,0,3),each=Obs))*Threat) + ((-2+rep(rnorm(N,0,.3),each=Obs))*cTrial) + ((0+rep(rnorm(N,0,.3),each=Obs))*cTrial*Threat) + rnorm(N*Obs,0,3) dL = data.frame(SubID=SubID, Trial=Trial, Threat=Threat, STL=STL)

> head(dL,30) SubID Trial Threat STL 1 1 1 -0.5 105.48386 2 1 2 0.5 130.64910 3 1 3 -0.5 100.81342 4 1 4 0.5 124.40595 5 1 5 -0.5 94.58567 6 1 6 0.5 119.85951 7 1 7 -0.5 93.41479 8 1 8 0.5 118.14759 9 1 9 -0.5 84.00524 10 1 10 0.5 118.34994 11 1 11 -0.5 84.92568 12 1 12 0.5 117.56699 13 1 13 -0.5 76.13048 14 1 14 0.5 104.35991 15 1 15 -0.5 81.41247 16 1 16 0.5 105.11951 17 1 17 -0.5 77.43389 18 1 18 0.5 103.71358 19 1 19 -0.5 70.36786 20 1 20 0.5 89.07487 21 2 1 -0.5 119.23531 22 2 2 0.5 151.81395 23 2 3 -0.5 119.06253 24 2 4 0.5 140.03145 25 2 5 -0.5 112.82707

Level 1 Residuals: Normality resid1 = HLMresid(m, level = 1, type = "LS", standardize = TRUE) ggplot_qqnorm(x = resid1$std.resid, line = "rlm")

Constant Variance plot(resid1$fitted,resid1$std.resid)

Linear plot(dL$Trial,resid1$std.resid)

Level 1 Outliers varPlot(resid1$std.resid)

Level 2: Normal Random Intercept resid2 <- HLMresid(object = m, level = "SubID") ggplot_qqnorm(x = resid2[,1], line = "rlm")

Level 2: Normal Random Threat ggplot_qqnorm(x = resid2$Threat, line = "rlm")

Level 2: Normal Random cTrial ggplot_qqnorm(x = resid2$cTrial, line = "rlm")

Level 2: Normal Random Threat:cTrial ggplot_qqnorm(x = resid2[,4], line = "rlm")

Level 2 Outliers: cTrial:Threat ggplot_qqnorm(x = resid2[,4], line = "rlm")

Influence: Cooks D cooksd <- cooks.distance(m, group = "SubID") varPlot(cooksd, IDs=rownames(cooksd))

Cooks D dotplot_diag(x = cooksd, cutoff = "internal", name = "cooks.distance") + ylab("Cook's distance") + xlab("SubID")