Download presentation
Presentation is loading. Please wait.
Published byFarida Widjaja Modified over 6 years ago
1
Unbalanced design, relative contribution of IVs, and type I and type III SS
Xuhua Xia Department of Biology University of Ottawa Xuhua Xia
2
A Test A researcher wishes to know how weight gain (WtGain) depends on GENDER (male and female) and FOOD (LoFat and HiFat). He did the experiment in a two-way ANOVA design and reported the effect size and significance tests: Effect size: mean WtGain is for males and for females; mean WtGain is 2 for LoFat and 6 for HiFat. Significance tests: The GENDER and FOOD effects are both significant, (p < for both GENDER and FOOD, see ANOVA table below) Analysis of Variance Table Response: WtGain Df Sum Sq Mean Sq F value Pr(>F) GENDER 1 e-05 *** FOOD e-14 *** GENDER:FOOD 1 0.000 0.000 0.000 1 Residuals 26 0.462 Are you convinced that both GENDER and FOOD effects are highly significant on WtGain? Xuhua Xia
3
Treatment of kidney stones
Treatment A: all open procedures Treatment B: percutaneous nephrolithotomy Question: which treatment is better? Treat A Treat B Success 273 289 Failure 77 61 Subtotal 350 % Success 78 82.57 C. R. Charig et al Br Med J (Clin Res Ed) 292 (6524): 879–882. I modified some numbers to facilitate teaching. Xuhua Xia
4
Treatment B better than A?
Observed data: TreatA TreatB Row Sum Success Failure Col Sum %Success Statistic Value DF Prob. Chi-square Likelihood ratio chi-square Phi Xuhua Xia
5
Simpson’s paradox Equivalent to unbalanced design in ANOVA Stone size
Treat A Treat B Small Success 81 244 Failure 6 31 Subtotal 87 275 % Success 93.10 88.73 Large 195 55 68 20 263 75 74.14 73.33 Pooled 276 299 74 51 350 % success 78.86 85.43 This example is from a study of the efficacy of two treatment for kidney stones. (pointing to the first cell). Here 87 is the total number of patients in this category and 81 is the number of successes. 93% is the percentage of the success in each group. As we can see clearly, treatment A is more efficacious than treatment B in both “Small stones” group and the “Large Stones” group. However, if we pool these two groups together, we see that treatment B has greater success rate than treatment B. We thus would draw a wrong conclusion if we fail to consider the confounding effect of stone size. But can we now conclude that treatment A is better than treatment B? Such a conclusion would be highly significant because it can guide us in our choice of the treatment if we happen to have a kidney stone. Unfortunately, we cannot draw this conclusion because the success rate of both treatments changes over time. We can only say that treatment A is better than treatment B at the time of data collection and cannot provide us any guidance today. Such a conclusion, albeit scientifically correct, seems quite useless and trivial. You see that a correct conclusion is often trivial, and a potentially wrong generalization that treatment A is better than treatment B appears much more significant. So if you want your conclusions to be highly significant, don’t be too correct, because it will then be trivial. Equivalent to unbalanced design in ANOVA
6
Why a systems biology perspective?
No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will respond to a logical and carefully thought-out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed. ... in his correct but somewhat awkward English: “”. In short, you should consider everything that is relevant. Of course his statement did not come out of a vacuum. At that time, a lot of data involving unbalanced experimental designs and multi-factor interactions have accumulated, and one is prone to draw wrong conclusions if one does not use balanced factorial designs and does not think broadly and critically. Here is one real data set to illustrate this point – Simpson’s paradox. --Ronald A. Fisher (1926). Journal of the Ministry of Agriculture of Great Britain 33: 503–513
7
Two-way ANOVA Balanced design Unbalanced design Some animals died
WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat Balanced design Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female LoFat HiFat Male 1 5 2 6 3 7 Female Some animals died
8
Analysis in R nd <- read.table("WtGain.txt",header=T) attach(nd)
fitANOVA <- aov(WtGain~Gender*Food) anova(fitANOVA) Analysis of Variance Table Response: WtGain Df Sum Sq Mean Sq F value Pr(>F) GENDER 1 e-05 *** FOOD e-14 *** GENDER:FOOD 1 0.000 0.000 0.000 1 Residuals 26 0.462 fitLM <- lm(WtGain~Gender*Food) anova(fitLM) (same result is produced)
9
Models y = a + b1x1 + b2x2 + b3x3 … SST, SSM, SSE
How to properly evaluate relative contributions of independent variables (IVs) to SSM? Xuhua Xia
10
Different Types of SS SS stands for the sum of squared deviations. The variance is the mean SS (i.e., MS). Most statistical analyses are about how to explain SS in dependent variable (DV) by independent variables (IVs). SS in DV is often designated as SST (for total variation in DV). The amount of SST that can be explained by DV is SSM (for model SS). SSM/SST is the percentage of variation in DV that can be explained by IV and is equal to R2. With number of IV > 1, we need to know the relative contribution of each IV to SSM for a given model Two frequently encountered SS: Type I SS (sequential SS) Type III SS (partial or unique SS) Numerical illustration ANOVA Regression Xuhua Xia
11
SST, SSM & SSE in 1-way ANOVA
Food Predicted SST SSM SSE LoFat 1 25 16 2 9 4 MedFat 5 6 MdeFat 8 HiFat 10 GrandMean 70 64 Xuhua Xia
12
SST, SSM & SSE in Regression
X Y 1 2 1 3 1 4 1 5 2 3 2 4 2 5 2 6 3 4 3 5 3 6 3 7 SSTotal SSModel SSError Xuhua Xia
13
Partition of Variance in Regression
The summation of this term is zero SSE SSM With NIV>1, we need to evaluate relative contribution of IVs to SSM Xuhua Xia
14
Relative contributions of IVs
If IVs are not correlated, then their respective contribution to SSM are unique and not shared IF IVs are correlated, then a fraction of SST that can be explained by one IV may also be explained by another IV Type I SS and Type III SS as two differential measures of relative contribution of IVs to SSM Xuhua Xia
15
Type I SS Given a model of y = a + b1x1 + b2x2 + b3x3 …
Imagine that you fit a series of models: Model 1: y = a + b1x SSM1 Model 2: y = a + b1x1 + b2x SSM2 Model 3: y = a + b1x1 + b2x2 + b3x SSM3 … Type I SS: SSM(x1) = SSM1 SSM(x2) = SSM2 - SSM1 SSM(x3) = SSM3-SSM2
16
Type III SS Given a model of y = a + b1x1 + b2x2 + b3x3 …
Imagine that you fit a series of models: Model 1: y = a + b2x2 + b3x3 + … (without x1) SSM1 Model 2: y = a + b1x1 + b3x3 + … (without x2) SSM2 Model 3: y = a + b1x1 + b2x2 + … (without x3) SSM3 Full model: y = a + b1x1 + b2x2 + b3x3 … SSM Type III SS: SSM(x1) = SSM - SSM1 SSM(x2) = SSM - SSM2 SSM(x3) = SSM - SSM3
17
Type I and Type III SS y = a + b1x1 + b2x2 + b3x3 s12 u1 u2 s123 s13
SST s12 u2 u1 s123 SSE s13 s23 SSM (area covered by 3 small circles) u3 Type I SS: SSM(x1) = u1+ s12 + s13 + s123 SSM(x2) = u2+ s23 SSM(x3) = u3 Type III SS: SSM(x1) = u1 SSM(x2) = u2 SSM(x3) = u3 If x1, x2 and x3 are not correlated, then type I SS = type III SS If x1, x2 and x3 are perfectly correlated, then type III SS = 0 Xuhua Xia
18
IVs uncorrelated x1 x2 y The data is generated with y = x1 + x2 + Xuhua Xia
19
Significance Test in Regression
x1 x2 y fit12 <- lm(y~x1+x2) fit21 <- lm(y~x2+x1) anova(fit12) Df Sum Sq Mean Sq F value Pr(>F) x e-13 *** x e-14 *** Residuals anova(fit21) summary(fit12) summary(fit21) Estimate Std. Error t value Pr(>|t|) (Intercept) x e-14 *** x e-13 *** SS for x1 and SS for x2 do not change with order of IVs in the model Type I SS = Type III SS Xuhua Xia
20
Regress y on x1 Not that b = and SSM = These are the same when the model includes x2, i.e., the slope for x1 and the variation in y that can be explained by x1 are not affected by the presence of x2 when x1 and x2 are not correlated. Xuhua Xia
21
Regress y on x2 Not that b = and SSM = These are the same when the model includes x1, i.e., the slope for x2 and the variation in y that can be explained by x2 are not affected by the presence of x1 when x1 and x2 are not correlated. Xuhua Xia
22
When IVs are not correlated
the variation in y attributable to variation in x1 is independent of the variation in y attributable to variation in x2; the coefficient of determination, r2, for models incorporating a single x will added up to the r2 value for the model incorporating all x variables; Type I and Type III SS are equal. The slope estimate remains the same no matter how many IVs the model includes. Xuhua Xia
23
When IVs are correlated
x1 x2 y Equivalent to unbalanced design in ANOVA y = x1 + x2 +
24
When IVs are correlated
x1 x2 y sum((y-mean(y))^2) fit12 <- lm(y~x1+x2) fit21 <- lm(y~x2+x1) anova(fit12) Df Sum Sq Mean Sq F value Pr(>F) x < 2.2e-16 *** x < 2.2e-16 *** Residuals anova(fit21) x < 2.2e-16 *** x < 2.2e-16 *** summary(fit12) Estimate Std. Error t value Pr(>|t|) (Intercept) x <2e-16 *** x <2e-16 *** SS for x1 and SS for x2 change with the order of IV in the model. y = x1 + x2 +
25
Regress y on x1 Note that b = (quite different from when x2 is in the model). Xuhua Xia
26
Regress y on x2 Note that b = (quite different from when x2 is in the model). Xuhua Xia
27
Relative Contributions
Total SS SS(x1) (Type I SS) SS(x2|x1) (Type III SS) SS(x2) (Type I SS) SS(x1|x2) (Type III SS) Shared = Y = a + b x1 11.878 37.922 24.251 In stepwise regression, type III SS often determines whether a variable should be included in the regression equation or not) Y = a + b x2 Xuhua Xia
28
Two-way ANOVA Balanced design Unbalanced design Some animals died
WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat Balanced design Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female LoFat HiFat Male 1 5 2 6 3 7 Female Some animals died
29
Analysis in R nd <- read.table("WtGain.txt",header=T) attach(nd)
fitANOVA <- aov(WtGain~Food*Genger) anova(fitANOVA) Analysis of Variance Table Response: WtGain Df Sum Sq Mean Sq F value Pr(>F) Food e-15 *** Gender Food:Gender Residuals
30
Two-way ANOVA Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female
WtGain Gender Food 1 Male LoFat 2 Male LoFat 3 Male LoFat 1 Female LoFat 2 Female LoFat 3 Female LoFat 5 Male HiFat 6 Male HiFat 7 Male HiFat 5 Female HiFat 6 Female HiFat 7 Female HiFat WtGain D_Male D_LoFat 1 1 1 2 1 1 3 1 1 1 0 1 2 0 1 3 0 1 5 1 0 6 1 0 7 1 0 5 0 0 6 0 0 7 0 0 Unbalanced design LoFat HiFat Male 1 5 2 6 3 7 Female rD_Male,D_LoFat = 0.333
31
Why a systems biology perspective?
No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will respond to a logical and carefully thought-out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed. ... in his correct but somewhat awkward English: “”. In short, you should consider everything that is relevant. Of course his statement did not come out of a vacuum. At that time, a lot of data involving unbalanced experimental designs and multi-factor interactions have accumulated, and one is prone to draw wrong conclusions if one does not use balanced factorial designs and does not think broadly and critically. Here is one real data set to illustrate this point – Simpson’s paradox. --Ronald A. Fisher (1926). Journal of the Ministry of Agriculture of Great Britain 33: 503–513
32
Does y depends on x1? x1 x2 y 1 4 5 1 5 6 1 6 7 2 6 8 2 7 9 3 4 7 3 3 6 3 2 5 "she will often refuse to answer until some other topic has been discussed" Correlation between x1 and y is 0. The relationship between x1 and y is revealed only when x2 is included. Df Sum Sq Mean Sq F value Pr(>F) x e+30 < 2.2e-16 *** x e+30 < 2.2e-16 *** Residuals y = x1 + x2 Need to look at both type I and type II SS to make proper statistical inference involving more than one IV Xuhua Xia
33
Summary When independent variables are not correlated: be happy
You can create uncorrelated "latent" variables by using methods such as PCA) e.g. positively correlated variables such as food, hug, schooling, etc., constitute a latent "nurture" variable When independent variables are correlated: part of the variation in y cannot be unequivocally attributed to the variation in any particular x; the coefficient of determination, r2, for models each incorporating a single x will add up to exceed the r2 value for the model incorporating all x variables; Type I and Type III SS are unequal and both are needed to understand the contribution of IVs to SSM. Parameter estimation may be biased if you miss some IVs in your experiment. Xuhua Xia
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.