Unbalanced design, relative contribution of IVs, and type I and type III SS Xuhua Xia Department of Biology University of Ottawa http://dambe.bio.uottawa.ca.

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

Topic 12: Multiple Linear Regression
Xuhua Xia Multiple regression Xuhua Xia
Statistical Techniques I EXST7005 Multiple Regression.
Inference for Regression
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
Lorelei Howard and Nick Wright MfD 2008
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
ANOVA Chapter 12.
Fall 2013 Lecture 5: Chapter 5 Statistical Analysis of Data …yes the “S” word.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Announcements: Homework 10: –Due next Thursday (4/25) –Assignment will be on the web by tomorrow night.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Chapter 9 Analyzing Data Multiple Variables. Basic Directions Review page 180 for basic directions on which way to proceed with your analysis Provides.
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Warsaw Summer School 2015, OSU Study Abroad Program Regression.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Department of Cognitive Science Michael J. Kalsher Adv. Experimental Methods & Statistics PSYC 4310 / COGS 6310 Regression 1 PSYC 4310/6310 Advanced Experimental.
Simple Linear Regression ANOVA for regression (10.2)
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 Psych 5510/6510 Chapter 14 Repeated Measures ANOVA: Models with Nonindependent ERRORs Part 3: Factorial Designs Spring, 2009.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
Formula for Linear Regression y = bx + a Y variable plotted on vertical axis. X variable plotted on horizontal axis. Slope or the change in y for every.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Michael J. Kalsher PSYCHOMETRICS MGMT 6971 Regression 1 PSYC 4310 Advanced Experimental Methods and Statistics © 2014, Michael Kalsher.
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Stats Methods at IC Lecture 3: Regression.
ANOVA: Analysis of Variation
Correlation and Regression
Statistics & Evidence-Based Practice
Chapter 11 Analysis of Variance
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
ANOVA: Analysis of Variation
Chapter 14 Introduction to Multiple Regression
Chapter 4 Basic Estimation Techniques
Chapter 12 Simple Linear Regression and Correlation
Two-way ANOVA with significant interactions
Two-Way Independent ANOVA (GLM 3)
Why is this important? Requirement Understand research articles
Comparing several means: ANOVA (GLM 1)
i) Two way ANOVA without replication
Interactions and Factorial ANOVA
Multiple Regression Analysis and Model Building
Analysis of Covariance (ANCOVA)
Review. Review Statistics Needed Need to find the best place to draw the regression line on a scatter plot Need to quantify the cluster.
Statistics for the Social Sciences
Simple Linear Regression
Multiple Regression – Part II
Stats Club Marnie Brennan
CHAPTER 29: Multiple Regression*
Comparing Several Means: ANOVA
Chapter 11 Analysis of Variance
Prepared by Lee Revere and John Large
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Chapter 12 Simple Linear Regression and Correlation
LEARNING OUTCOMES After studying this chapter, you should be able to
Independent variables correlate with each other
Chapter 13 Group Differences
Prediction and Accuracy
Fixed, Random and Mixed effects
Product moment correlation
Factorial ANOVA 2 or More IVs.
Inferential Statistics
One-Factor Experiments
Introduction to Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 14, part C Goodness of Fit..
Presentation transcript:

Unbalanced design, relative contribution of IVs, and type I and type III SS Xuhua Xia Department of Biology University of Ottawa http://dambe.bio.uottawa.ca Xuhua Xia

A Test A researcher wishes to know how weight gain (WtGain) depends on GENDER (male and female) and FOOD (LoFat and HiFat). He did the experiment in a two-way ANOVA design and reported the effect size and significance tests:   Effect size: mean WtGain is 3.3333 for males and 4.6667 for females; mean WtGain is 2 for LoFat and 6 for HiFat. Significance tests: The GENDER and FOOD effects are both significant, (p < 0.0001 for both GENDER and FOOD, see ANOVA table below) Analysis of Variance Table  Response: WtGain             Df  Sum Sq Mean Sq F value    Pr(>F)    GENDER       1  13.333  13.333  28.889 1.253e-05 *** FOOD         1 106.667 106.667 231.111 1.883e-14 *** GENDER:FOOD  1   0.000   0.000   0.000         1    Residuals   26  12.000   0.462 Are you convinced that both GENDER and FOOD effects are highly significant on WtGain? Xuhua Xia

Treatment of kidney stones Treatment A: all open procedures Treatment B: percutaneous nephrolithotomy Question: which treatment is better? Treat A Treat B Success 273 289 Failure 77 61 Subtotal 350 % Success 78 82.57 C. R. Charig et al. 1986. Br Med J (Clin Res Ed) 292 (6524): 879–882. I modified some numbers to facilitate teaching. Xuhua Xia

Treatment B better than A? Observed data: TreatA TreatB Row Sum Success 276 299 575 Failure 74 51 125 Col Sum 350 350 700 %Success 78.86 85.43 Statistic Value DF Prob. ---------------------------------------------------------- Chi-square 4.71 1 0.0299 Likelihood ratio chi-square 4.73 1 0.0296 Phi 0.0821 Xuhua Xia

Simpson’s paradox Equivalent to unbalanced design in ANOVA Stone size Treat A Treat B Small Success 81 244 Failure 6 31 Subtotal 87 275 % Success 93.10 88.73 Large 195 55 68 20 263 75 74.14 73.33 Pooled 276 299 74 51 350 % success 78.86 85.43 This example is from a study of the efficacy of two treatment for kidney stones. (pointing to the first cell). Here 87 is the total number of patients in this category and 81 is the number of successes. 93% is the percentage of the success in each group. As we can see clearly, treatment A is more efficacious than treatment B in both “Small stones” group and the “Large Stones” group. However, if we pool these two groups together, we see that treatment B has greater success rate than treatment B. We thus would draw a wrong conclusion if we fail to consider the confounding effect of stone size. But can we now conclude that treatment A is better than treatment B? Such a conclusion would be highly significant because it can guide us in our choice of the treatment if we happen to have a kidney stone. Unfortunately, we cannot draw this conclusion because the success rate of both treatments changes over time. We can only say that treatment A is better than treatment B at the time of data collection and cannot provide us any guidance today. Such a conclusion, albeit scientifically correct, seems quite useless and trivial. You see that a correct conclusion is often trivial, and a potentially wrong generalization that treatment A is better than treatment B appears much more significant. So if you want your conclusions to be highly significant, don’t be too correct, because it will then be trivial. Equivalent to unbalanced design in ANOVA

Why a systems biology perspective? No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will respond to a logical and carefully thought-out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed. ... in his correct but somewhat awkward English: “”. In short, you should consider everything that is relevant. Of course his statement did not come out of a vacuum. At that time, a lot of data involving unbalanced experimental designs and multi-factor interactions have accumulated, and one is prone to draw wrong conclusions if one does not use balanced factorial designs and does not think broadly and critically. Here is one real data set to illustrate this point – Simpson’s paradox. --Ronald A. Fisher (1926). Journal of the Ministry of Agriculture of Great Britain 33: 503–513

Two-way ANOVA Balanced design Unbalanced design Some animals died WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat Balanced design Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female LoFat HiFat Male 1 5 2 6 3 7   Female Some animals died

Analysis in R nd <- read.table("WtGain.txt",header=T) attach(nd) fitANOVA <- aov(WtGain~Gender*Food) anova(fitANOVA) Analysis of Variance Table  Response: WtGain             Df  Sum Sq Mean Sq F value    Pr(>F)    GENDER       1  13.333  13.333  28.889 1.253e-05 *** FOOD         1 106.667 106.667 231.111 1.883e-14 *** GENDER:FOOD  1   0.000   0.000   0.000         1    Residuals   26  12.000   0.462 fitLM <- lm(WtGain~Gender*Food) anova(fitLM) (same result is produced)

Models y = a + b1x1 + b2x2 + b3x3 … SST, SSM, SSE How to properly evaluate relative contributions of independent variables (IVs) to SSM? Xuhua Xia

Different Types of SS SS stands for the sum of squared deviations. The variance is the mean SS (i.e., MS). Most statistical analyses are about how to explain SS in dependent variable (DV) by independent variables (IVs). SS in DV is often designated as SST (for total variation in DV). The amount of SST that can be explained by DV is SSM (for model SS). SSM/SST is the percentage of variation in DV that can be explained by IV and is equal to R2. With number of IV > 1, we need to know the relative contribution of each IV to SSM for a given model Two frequently encountered SS: Type I SS (sequential SS) Type III SS (partial or unique SS) Numerical illustration ANOVA Regression Xuhua Xia

SST, SSM & SSE in 1-way ANOVA Food Predicted SST SSM SSE LoFat 1 25 16 2 9 4 MedFat 5 6 MdeFat 8 HiFat 10 GrandMean 70 64 Xuhua Xia

SST, SSM & SSE in Regression X Y 1 2 1 3 1 4 1 5 2 3 2 4 2 5 2 6 3 4 3 5 3 6 3 7 SSTotal SSModel SSError Xuhua Xia

Partition of Variance in Regression The summation of this term is zero SSE SSM With NIV>1, we need to evaluate relative contribution of IVs to SSM Xuhua Xia

Relative contributions of IVs If IVs are not correlated, then their respective contribution to SSM are unique and not shared IF IVs are correlated, then a fraction of SST that can be explained by one IV may also be explained by another IV Type I SS and Type III SS as two differential measures of relative contribution of IVs to SSM Xuhua Xia

Type I SS Given a model of y = a + b1x1 + b2x2 + b3x3 … Imagine that you fit a series of models: Model 1: y = a + b1x1 SSM1 Model 2: y = a + b1x1 + b2x2 SSM2 Model 3: y = a + b1x1 + b2x2 + b3x3 SSM3 … Type I SS: SSM(x1) = SSM1 SSM(x2) = SSM2 - SSM1 SSM(x3) = SSM3-SSM2

Type III SS Given a model of y = a + b1x1 + b2x2 + b3x3 … Imagine that you fit a series of models: Model 1: y = a + b2x2 + b3x3 + … (without x1) SSM1 Model 2: y = a + b1x1 + b3x3 + … (without x2) SSM2 Model 3: y = a + b1x1 + b2x2 + … (without x3) SSM3 Full model: y = a + b1x1 + b2x2 + b3x3 … SSM Type III SS: SSM(x1) = SSM - SSM1 SSM(x2) = SSM - SSM2 SSM(x3) = SSM - SSM3

Type I and Type III SS y = a + b1x1 + b2x2 + b3x3 s12 u1 u2 s123 s13 SST s12 u2 u1 s123 SSE s13 s23 SSM (area covered by 3 small circles) u3 Type I SS: SSM(x1) = u1+ s12 + s13 + s123 SSM(x2) = u2+ s23 SSM(x3) = u3 Type III SS: SSM(x1) = u1 SSM(x2) = u2 SSM(x3) = u3 If x1, x2 and x3 are not correlated, then type I SS = type III SS If x1, x2 and x3 are perfectly correlated, then type III SS = 0 Xuhua Xia

IVs uncorrelated x1 x2 y 1 1 1.9978 1 2 2.9596 1 3 4.0385 1 4 4.9709 2 1 3.0045 2 2 4.0100 2 3 5.0259 2 4 6.0105 3 1 4.0628 3 2 4.9079 3 3 5.9617 3 4 6.9273 The data is generated with y = x1 + x2 +  Xuhua Xia

Significance Test in Regression x1 x2 y 1 1 1.9978 1 2 2.9596 1 3 4.0385 1 4 4.9709 2 1 3.0045 2 2 4.0100 2 3 5.0259 2 4 6.0105 3 1 4.0628 3 2 4.9079 3 3 5.9617 3 4 6.9273 fit12 <- lm(y~x1+x2) fit21 <- lm(y~x2+x1) anova(fit12) Df Sum Sq Mean Sq F value Pr(>F) x1 1 7.7872 7.7872 3567.3 5.210e-13 *** x2 1 14.6811 14.6811 6725.4 3.018e-14 *** Residuals 9 0.0196 0.0022 anova(fit21) summary(fit12) summary(fit21) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.04327 0.04672 0.926 0.378 x2 0.98931 0.01206 82.009 3.02e-14 *** x1 0.98661 0.01652 59.727 5.21e-13 *** SS for x1 and SS for x2 do not change with order of IVs in the model Type I SS = Type III SS Xuhua Xia

Regress y on x1 Not that b = 0.9866 and SSM = 7.7872. These are the same when the model includes x2, i.e., the slope for x1 and the variation in y that can be explained by x1 are not affected by the presence of x2 when x1 and x2 are not correlated. Xuhua Xia

Regress y on x2 Not that b = 0.9893 and SSM = 14.6814. These are the same when the model includes x1, i.e., the slope for x2 and the variation in y that can be explained by x2 are not affected by the presence of x1 when x1 and x2 are not correlated. Xuhua Xia

When IVs are not correlated the variation in y attributable to variation in x1 is independent of the variation in y attributable to variation in x2; the coefficient of determination, r2, for models incorporating a single x will added up to the r2 value for the model incorporating all x variables; Type I and Type III SS are equal. The slope estimate remains the same no matter how many IVs the model includes. Xuhua Xia

When IVs are correlated x1 x2 y 1 1 1.9869 1 1 2.0611 1 1 1.9961 1 1 1.9063 1 1 1.9434 1 2 3.022 1 3 3.9657 1 4 4.9659 2 1 2.908 2 2 3.9124 2 3 5.096 2 4 5.9487 3 1 4.0217 3 2 4.9465 3 3 6.0913 3 4 6.9446 3 4 7.0364 3 4 7.0185 3 4 7.0429 3 4 6.9731 Equivalent to unbalanced design in ANOVA y = x1 + x2 + 

When IVs are correlated x1 x2 y 1 1 1.9869 1 1 2.0611 1 1 1.9961 1 1 1.9063 1 1 1.9434 1 2 3.022 1 3 3.9657 1 4 4.9659 2 1 2.908 2 2 3.9124 2 3 5.096 2 4 5.9487 3 1 4.0217 3 2 4.9465 3 3 6.0913 3 4 6.9446 3 4 7.0364 3 4 7.0185 3 4 7.0429 3 4 6.9731 sum((y-mean(y))^2)  74.11052 fit12 <- lm(y~x1+x2) fit21 <- lm(y~x2+x1) anova(fit12) Df Sum Sq Mean Sq F value Pr(>F) x1 1 49.800 49.800 14220.8 < 2.2e-16 *** x2 1 24.251 24.251 6925.2 < 2.2e-16 *** Residuals 17 0.060 0.004 anova(fit21) x2 1 62.173 62.173 17754.2 < 2.2e-16 *** x1 1 11.878 11.878 3391.8 < 2.2e-16 *** summary(fit12) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04429 0.03459 -1.281 0.218 x1 1.01031 0.01735 58.239 <2e-16 *** x2 1.00522 0.01208 83.218 <2e-16 *** SS for x1 and SS for x2 change with the order of IV in the model. y = x1 + x2 + 

Regress y on x1 Note that b = 1.7642 (quite different from 1.01031 when x2 is in the model). Xuhua Xia

Regress y on x2 Note that b = 1.3726 (quite different from 1.00522 when x2 is in the model). Xuhua Xia

Relative Contributions Total SS 74.11052 SS(x1) 49.800 (Type I SS) SS(x2|x1) 24.251 (Type III SS) SS(x2) 62.173 (Type I SS) SS(x1|x2) 11.878 (Type III SS) Shared 49.8 - 11.878 = 37.922 Y = a + b x1 11.878 37.922 24.251 In stepwise regression, type III SS often determines whether a variable should be included in the regression equation or not) Y = a + b x2 Xuhua Xia

Two-way ANOVA Balanced design Unbalanced design Some animals died WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat Balanced design Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female LoFat HiFat Male 1 5 2 6 3 7   Female Some animals died

Analysis in R nd <- read.table("WtGain.txt",header=T) attach(nd) fitANOVA <- aov(WtGain~Food*Genger) anova(fitANOVA) Analysis of Variance Table  Response: WtGain Df Sum Sq Mean Sq F value Pr(>F) Food 1 120 120.000 260 4.692e-15 *** Gender 1 0 0.000 0 1 Food:Gender 1 0 0.000 0 1 Residuals 26 12 0.462

Two-way ANOVA Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat WtGain D_Male D_LoFat 1 1 1 2 1 1 3 1 1 1 0 1 2 0 1 3 0 1 5 1 0 6 1 0 7 1 0 5 0 0 6 0 0 7 0 0 Unbalanced design LoFat HiFat Male 1 5 2 6 3 7   Female rD_Male,D_LoFat = 0.333

Why a systems biology perspective? No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will respond to a logical and carefully thought-out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed. ... in his correct but somewhat awkward English: “”. In short, you should consider everything that is relevant. Of course his statement did not come out of a vacuum. At that time, a lot of data involving unbalanced experimental designs and multi-factor interactions have accumulated, and one is prone to draw wrong conclusions if one does not use balanced factorial designs and does not think broadly and critically. Here is one real data set to illustrate this point – Simpson’s paradox. --Ronald A. Fisher (1926). Journal of the Ministry of Agriculture of Great Britain 33: 503–513

Does y depends on x1? x1 x2 y 1 4 5 1 5 6 1 6 7 2 6 8 2 7 9 3 4 7 3 3 6 3 2 5 "she will often refuse to answer until some other topic has been discussed" Correlation between x1 and y is 0. The relationship between x1 and y is revealed only when x2 is included. Df Sum Sq Mean Sq F value Pr(>F) x2 1 22.4513 22.4513 5.6655e+30 < 2.2e-16 *** x1 1 8.6598 8.6598 2.1853e+30 < 2.2e-16 *** Residuals 15 0.0000 0.0000 y = x1 + x2 Need to look at both type I and type II SS to make proper statistical inference involving more than one IV Xuhua Xia

Summary When independent variables are not correlated: be happy You can create uncorrelated "latent" variables by using methods such as PCA) e.g. positively correlated variables such as food, hug, schooling, etc., constitute a latent "nurture" variable When independent variables are correlated: part of the variation in y cannot be unequivocally attributed to the variation in any particular x; the coefficient of determination, r2, for models each incorporating a single x will add up to exceed the r2 value for the model incorporating all x variables; Type I and Type III SS are unequal and both are needed to understand the contribution of IVs to SSM. Parameter estimation may be biased if you miss some IVs in your experiment. Xuhua Xia