Download presentation
Presentation is loading. Please wait.
Published byAgatha Anderson Modified over 9 years ago
1
1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly influential FACTOR)
2
2 OBJECTIVE: To determine the impact of X on Y Mathematical Model: Y = f (x, ), where = (impact of) all factors other than X Ex: Y = Battery Life (hours) X = Brand of Battery = Many other factors (possibly, some we’re unaware of)
3
3 Statistical Model “LEVEL” OF BRAND (Brand is, of course, represented as “categorical”) Y 11 Y 12 Y 1c Y ij Y 21 Y RI 1 2 R 1 2 C Y ij = + j + ij i = 1,....., R j = 1,....., C Y Rc
4
4 Where = OVERALL AVERAGE j = index for FACTOR (Brand) LEVEL i = index for “replication” j = Differential effect (response) associated with j th level of X and ij = “noise” or “error” associated with the (particular) (i,j) th data value. Let j = AVERAGE associated with j th level of X j = j – and = AVERAGE of j.
5
5 Y ij = + j + ij By definition, j = 0 C j=1 The experiment produces R x C Y ij data values. The analysis produces estimates of c . (We can then get estimates of the ij by subtraction).
6
6 Y 11 Y 12 Y 1c Y 21 Y RI Y Rc 1 2 C Y 1 Y c (Y j ) Y 2 3 Y 1, Y 2, etc., are Column Means
7
7 Y = Y j / C = “GRAND MEAN” (assuming same # data points in each column) (otherwise, Y = mean of all the data) j=1 c
8
8 MODEL: Y ij = + j + ij Y estimates Y j - Y estimates j (= j – ) (for all j) These estimates are based on Gauss’ (1796) PRINCIPLE OF LEAST SQUARES and (I would argue) on COMMON SENSE
9
9 MODEL: Y ij = + j + ij If you insert the estimates into the MODEL, (1) Y ij = Y + (Y j - Y ) + ij. it follows that our estimate of ij is (2) ij = Y ij - Y j < <
10
10 Then, Y ij = Y + (Y j - Y ) + ( Y ij - Y j ) or, (Y ij - Y ) = (Y j - Y ) + (Y ij - Y j ) { { { (3) TOTAL VARIABILITY in Y = Variability in Y associated with X Variability in Y associated with all other factors +
11
11 If you square both sides of (3), and double sum both sides (over i and j), you get, [after some unpleasant algebra, but lots of terms which “cancel”] (Y ij - Y ) 2 = R (Y j - Y ) 2 + (Y ij - Y j ) 2 C R j=1 i=1 { { j=1 CC R j=1 i=1 TSS TOTAL SUM OF SQUARES ==== SSB C SUM OF SQUARES BETWEEN COLUMNS ++++ SSW (SSE) SUM OF SQUARES WITHIN COLUMNS ( ( ( ( ( (
12
12 ANOVA TABLE SOURCE OF VARIABILITY SSQDF Mean square (M.S.) Between Columns (due to brand) Within Columns (due to error) SSB C C - 1 MSB C SSB C C - 1 SSW (R - 1) C SSW (R-1)C = MSW = TOTAL TSS RC -1
13
13 Example: Y = LIFETIME (HOURS) BRAND 3 replications per level 1 2 3 4 5 6 7 8 1.8 4.2 8.6 7.0 4.2 4.2 7.8 9.0 5.0 5.4 4.6 5.0 7.8 4.2 7.0 7.4 1.0 4.2 4.2 9.0 6.6 5.4 9.8 5.8 2.6 4.6 5.8 7.0 6.2 4.6 8.2 7.4 5.8 SSB C = 3 ( [2.6 - 5.8] 2 + [4.6 - 5.8] 2 + + [7.4 - 5.8] 2 ) = 3 (23.04) = 69.12
14
14 (1.8 - 2.6) 2 =.64 (4.2 - 4.6) 2 =.16 (9.0 -7.4) 2 = 2.56 (5.0 - 2.6) 2 = 5.76 (5.4 - 4.6) 2 =.64 (7.4 - 7.4) 2 = 0 (1.0 - 2.6) 2 = 2.56 (4.2 - 4.6) 2 =.16 (5.8 - 7.4) 2 = 2.56 8.96.96 5.12 Total of (8.96 +.96 + + 5.12), SSW = 46.72 SSW =
15
15 ANOVA TABLE Source of Variability SSQ df M.S. BRAND ERROR 69.12 46.72 7 = 8 - 1 16 = 2 (8) 9.87 2.92 TOTAL 115.84 23 = (3 8) -1
16
16 We can show: E (MSB C ) = 2 + “V COL ” { MEASURE OF DIFFERENCES AMONG COLUMN MEANS R C-1 ( j - ) 2 { jj ( ( E (MSW) = 2 (Assuming each Y ij has (constant) standard deviation, ) (More about assumptions, Later)
17
17 E ( MSB C ) = 2 + V COL E ( MSW ) = 2 This suggests that if MSB C MSW > 1, There’s some evidence of non- zero V COL, or “level of X affects Y” if MSB C MSW < 1, No evidence that V COL > 0, or that “level of X affects Y”
18
18 With H O : Level of X has no impact on Y H I : Level of X does have impact on Y, We need MSB C MSW > > 1 to reject H O.
19
19 More Formally, H O : 1 = 2 = c = 0 H I : not all j = 0 OR H O : 1 = 2 = c H I : not all j are EQUAL (All column means are equal)
20
20 The probability Law of MSB C MSW = “F calc ”, is The F - distribution with (C-1, (R-1)C) degrees of freedom Assuming H O true. C = Table Value
21
21 In our problem: ANOVA TABLE Source of Variability SSQ df M.S. BRAND ERROR 69.12 46.72 7 16 9.87 2.92 = 9.87 2.92 F calc 3.38
22
22 =.05 C = 2.66 3.38 F table coming up (7,16 DF)
23
23 F-Table
24
24 Hence, at =.05, Reject H o. (i.e., Conclude that level of BRAND does have an impact on battery lifetime.)
25
25
26
26 SPSS/MINITAB INPUT VAR001VAR002 1.81 5.01 1.01 4.22 5.42 4.22... 9.08 7.48 5.88
27
27
28
28 ONE FACTOR ANOVA (MINITAB) Analysis of Variance for life Source DF SS MS F P brand 7 69.12 9.87 3.38 0.021 Error 16 46.72 2.92 Total 23 115.84 MINITAB: STAT>>ANOVA>>ONE-WAY
29
29
30
30
31
31 EXAMPLE: MORTAR The tension bond strength of cement mortar is an important characteristic of the product. An engineer is interested in comparing the strength of a modified formulation in which polymer latex emulsions have been added during mixing to the strength of the unmodified mortar. The experimenter has collected 10 observations on strength for the modified formulation and another 10 observations for the unmodified formulation.
32
32 ModifiedUnmodified 16.8517.50 16.4017.63 17.2118.25 16.3518.00 16.5217.86 17.0417.75 16.9618.22 17.1517.90 16.5917.96 16.5718.15
33
33 One-way ANOVA: strength versus type (Minitab) Analysis of Variance for strength Source DF SS MS F P type 1 6.7048 6.7048 82.98 0.000 Error 18 1.4544 0.0808 Total 19 8.1592
34
34
35
35
36
36
37
37 Assumptions Basically, the same as in Regression analysis: MODEL: Y ij = + j + ij 1.) the ij are indep. random variables 2.) Each ij is Normally Distributed E( ij ) = 0 for all i, j 3.) 2 ( ij ) = constant for all i, j Normality plot Residual plot Run order plot
38
38 Diagnosis: Normality The points on the normality plot must more or less follow a line to claim “normal distributed”. There are statistic tests to verify it scientifically. The ANOVA method we learn here is not sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much. Normality plot: normal scores vs. residuals
39
39 From Mortar data:
40
40 Diagnosis: Constant Variances The points on the residual plot must be more or less within a horizontal band to claim “constant variances”. There are statistic tests to verify it scientifically. The ANOVA method we learn here is not sensitive to the constant variances assumption. That is, slightly different variances within groups will not change our conclusions much. Residual plot: fitted values vs. residuals
41
41 From Mortar data:
42
42 Diagnosis: Randomness/Independence The run order plot must show no “systematic” patterns to claim “randomness”. There are statistic tests to verify it scientifically. The ANOVA method is sensitive to the constant variances assumption. That is, a little level of dependence between data points will change our conclusions a lot. Run order plot: order vs. residuals
43
43 From Mortar data:
44
44 This assumes a “fixed model”: Inherent interest in the specific levels of the factors under study - there’s no direct interest in extrapolating to other levels - inference will be limited to levels that appear in the experiment. Experimenter selects the levels If a “random model”: Levels in experiment randomly selected from a population of such levels, and inference is to be made about the entire population of levels. Then, besides assumptions 1 to 3, there is another assumption: 4) a) the j are independent random variables which are normally distributed with constant variance b) the j and ij are independent
45
45 With these assumptions, the estimates (Y.. and the Y j ) are “Maximum likelihood estimates”(a statistical notion which could be thought of as “efficiency” [“most likely value”]), and, more directly relevant: The “Conventional” F- and t- tests are applicable (VALID) for a variety of hypothesis testing and confidence interval computations.
46
46 KRUSKAL - WALLIS TEST (Non - Parametric Alternative) H O : The probability distributions are identical for each level of the factor H I : Not all the distributions are the same
47
47 Brand A B C 32 32 28 30 32 21 30 26 15 29 26 15 26 22 14 23 20 14 20 19 14 19 16 11 18 14 9 12 14 8 BATTERY LIFETIME (hours) (each column rank ordered, for simplicity) Mean: 23.9 22.1 14.9 (here, irrelevant!!)
48
48 H O : no difference in distribution among the three brands with respect to battery lifetime H I : At least one of the 3 brands differs in distribution from the others with respect to lifetime
49
49 Brand A B C 32 (29) 32 (29) 28 (24) 30 (26.5) 32 (29) 21 (18) 30 (26.5) 26 (22) 15 (10.5) 29 (25) 26 (22) 15 (10.5) 26 (22) 22 (19) 14 (7) 23 (20) 20 (16.5) 14 (7) 20 (16.5) 19 (14.5) 14 (7) 19 (14.5) 16 (12) 11 (3) 18 (13) 14 (7) 9 (2) 12 (4) 14 (7) 8 (1) T 1 = 197 T 2 = 178 T 3 = 90 n 1 = 10 n 2 = 10 n 3 = 10 Ranks
50
50 TEST STATISTIC: H = 12 N (N + 1) (T j 2 /n j ) - 3 (N + 1) n j = # data values in column j N = n j K = # Columns (levels) T j = SUM OF RANKS OF DATA ON COL j When all DATA COMBINED (There is a slight adjustment in the formula as a function of the number of ties in rank.) K j = 1 K
51
51 H = [ 12 197 2 178 2 90 2 30 (31) 10 10 10 + + [ - 3 (31) = 8.41 (with adjustment for ties, we get 8.46)
52
52 We can show that, under H O, H is well approximated by a 2 distribution with df = K - 1. What do we do with H? Here, df = 2, and at =.05, the critical value = 5.99 2 df df F df, = 5.99 8.41 = H =.05 Reject H O ; conclude that mean lifetime NOT the same for all 3 BRANDS 8
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.