Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Analysis of Variance. One-Way ANOVA  We use ANOVA when we want to look at statistical relationships (difference in means for example) between more.

Similar presentations


Presentation on theme: "The Analysis of Variance. One-Way ANOVA  We use ANOVA when we want to look at statistical relationships (difference in means for example) between more."— Presentation transcript:

1 The Analysis of Variance

2 One-Way ANOVA  We use ANOVA when we want to look at statistical relationships (difference in means for example) between more than 2 populations or samples  ANOVA is a natural extension of ideas used in 2-pop t-tests and other methods we have explored

3 Trouble on the School Board!  Despite the school board’s best efforts – sensitive test score data for a large urban school district was leaked to the press!  The issue is a long standing argument that children in the inner city do not receive the same quality of education as do children in the suburban parts of the city. This could be very embarrassing for both the board and the mayor!  Here’s the data

4 A school board official states: “ The data is roughly normally distributed and is what you would expect for a random sample of 90 students – 30 from each of the East, Central and West districts” NOT SO FAST! Take a closer look at the data – check for “structure”

5  Our investigative reporter took Stats 300 in college! Here is what she did:  Sort the data into East, Central and West “bins”  The box plot suggests a cover-up!

6 Digging further… the full set becomes

7 Further tests…Thanks StatsMan!

8

9 Summary of the 3 data sets:  Is there a statistical hypothesis lurking about?

10 The Hypotheses  Let  1,  2, and  3 be the mean scores for the three populations:  Pop1 = East  Pop 2 = Central  Pop 3 = West  Ho:  1 =  2 =  3  Ha: ? The null hypothesis is pretty straight forward Why is this a problem?

11 Could we do this with paired t-tests?  YES! What does this imply?

12 We have good evidence to reject the null hypothesis – the central district scores are statistically lower than the other two districts.

13 Could we just use paired t-tests?  If we had 12 school districts that we were testing in the same way as the previous case – how would the analysis change? How many pairs How many false positives would we get at a 95% Confidence level?

14 Why we can’t use multiple pairs of t- tests or why we should consider the entire set: 1. As the number of pairs increases the chance of a false positives or erroneous conclusion on the null hypothesis increases 2. By pooling all of information (not just pairs) we get a much more precise value for the standard deviation in the population 3. By treating all of the data we can, potentially detect interesting correlations between subgroups – this could easily be overlooked in we approached the data in a pair-wise fashion. Decreases the chance of false positives Pooling gives more precision in statisitcs Detect interesting correlations

15 Setting up for ANOVA  You guessed it – yet more terminology!  In 12.1 and 12.2 we will introduce: A method to get an estimate for the standard deviation  for the entire population (Pooled Estimator) A new spin on degrees of freedom (df) A new test for significance – the F-test

16 Pooled Estimator for   This is a generalization of the method we used in paired t-tests: This expression begins to measure the total variation in a population. Each s i 2 term measures variation within a given sample. “I” represents the total number of independent SRS’s

17 Sigma Rule…  If the largest standard deviation in a set of I SRS’s is less than twice as large as the smallest then we can approximate the standard deviation by using the pooled estimator.

18 Example: What is the pooled estimate for sigma for the 3 school districts?  I = 3 (East, Central, West are SRS’s)  n1=n2=n3=30

19 Part II – Developing the F-Test  Conceptual Model A collection of SRS’s drawn from a larger population illustrate two different kinds of variation:  Internal variation around a sample mean within a given SRS  Variation of the SRS means with the overall population mean

20  ANOVA compares the two kinds of variability  The null hypothesis often is equivalent to saying that the populations overlap (have the same mean for example)  Another way of saying this is that the SRS’s share the “grand mean” of the entire population  This could happen if the individual SRS’s have large variation internally but not externally  We need a way to quantify this Ways of quantifying variation

21 The F-Value  We can compare variation between samples with the variation within samples by calculating the Mean Square of the error in both cases.  This is expressed as:  We will get to F-distributions in a few moments

22 Mean Square Error – MSE(within)  This is what the pooled estimator determines:  This means that our school board data has an internal MSE of (31.8) 2

23 Mean Square Error – MSE(between)  To determine this we need the “grand mean” for all of the data:

24 Mean Square Error – MSE(between)  Define as: A new application of the idea of degrees of freedom

25 Example – school board data: We can now determine the “F-Value” for this data:

26 I Don’t Get It!  Confused? We are almost there. We now know how to quantify the variation within SRS’s (MS w ) and the variation between the means of the SRS’s (MS b ) The “F-ratio” can be compared against tables just like we did for z-tests and t-tests

27 How to Use an “F-ratio”  You need to know some important numbers:  The number of SRS’s (I) from this we form the degrees of freedom for the MS b term: df b = I-1  The total number of data points ( the pooled data) = N, df w =N-1  The F-ratio tests the null hypothesis (ie – that the means are equal)  If Ho is true the F ≈ 1 denominator numerator

28 Testing the School Board’s Claim  The school board’s claim was that there was no difference between the three district’s mean test scores.  Since there were 90 students (n=90) and 3 groups (I=3) we should use the F(I- 1,N-1) = F(2,89) distribution  So … use Table E and F(2,89) = 88.8. Since this is not listed we need to approximate. You should be able to determine the probability of the null hypothesis between an upper and lower p-value. With an F-ratio as big as 88.8 you really don’t normally need to look it up – you know Ho is false!

29 Use Minitab or EXCEL  Life is short! ANOVA is a complex (number intensive) process. Let’s look at two approaches: Minitab

30 Next lecture …  We will spend next lecture working through several examples of ANOVA  When doing this keep in mind what it is that you are calculating  Don’t get overwhelmed by the detail!


Download ppt "The Analysis of Variance. One-Way ANOVA  We use ANOVA when we want to look at statistical relationships (difference in means for example) between more."

Similar presentations


Ads by Google