Download presentation
Presentation is loading. Please wait.
1
Comparing k Populations
Means – One way Analysis of Variance (ANOVA)
2
The F test
3
The F test – for comparing k means
Situation We have k normal populations Let mi and s denote the mean and standard deviation of population i. i = 1, 2, 3, … k. Note: we assume that the standard deviation for each population is the same. s1 = s2 = … = sk = s
4
We want to test against
5
To test against use the test statistic
6
the statistic is called the Between Sum of Squares and is denoted by SSBetween It measures the variability between samples k – 1 is known as the Between degrees of freedom and is called the Between Mean Square and is denoted by MSBetween
7
the statistic is called the Error Sum of Squares and is denoted by SSError is known as the Error degrees of freedom and is called the Error Mean Square and is denoted by MSError
8
then
9
The Computing formula for F:
Compute 1) 2) 3) 4) 5)
10
Then 1) 2) 3)
11
The critical region for the F test
We reject if Fa is the critical point under the F distribution with n1 = k - 1degrees of freedom in the numerator and n2 = N – k degrees of freedom in the denominator
12
Example In the following example we are comparing weight gains resulting from the following six diets Diet 1 - High Protein , Beef Diet 2 - High Protein , Cereal Diet 3 - High Protein , Pork Diet 4 - Low protein , Beef Diet 5 - Low protein , Cereal Diet 6 - Low protein , Pork
14
Hence
15
Thus Thus since F > we reject H0
16
A convenient method for displaying the calculations for the F-test
The ANOVA Table A convenient method for displaying the calculations for the F-test
17
Anova Table Mean Square F-ratio Between k - 1 SSBetween MSBetween
Source d.f. Sum of Squares Mean Square F-ratio Between k - 1 SSBetween MSBetween MSB /MSE Within N - k SSError MSError Total N - 1 SSTotal
18
The Diet Example Mean Square F-ratio Between 5 4612.933 922.587 4.3
Source d.f. Sum of Squares Mean Square F-ratio Between 5 4.3 Within 54 (p = ) Total 59
19
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS
20
Assume the data is contained in an Excel file
21
Each variable is in a column
Weight gain (wtgn) diet Source of protein (Source) Level of Protein (Level)
22
After starting the SSPS program the following dialogue box appears:
23
If you select Opening an existing file and press OK the following dialogue box appears
24
The following dialogue box appears:
25
If the variable names are in the file ask it to read the names
If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear
26
One that will contain the output:
27
The other containing the data:
28
To perform ANOVA select Analyze->General Linear Model-> Univariate
29
The following dialog box appears
30
Select the dependent variable and the fixed factors
Press OK to perform the Analysis
31
The Output
32
Comments The F-test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different If H0 is accepted we know that all means are equal (not significantly different) If H0 is rejected we conclude that at least one pair of means is significantly different. The F – test gives no information to which pairs of means are different. One now can use two sample t tests to determine which pairs means are significantly different
33
Fishers LSD (least significant difference) procedure:
Test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different, using the ANOVA F-test If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by using two sample t tests to determine which pairs means are significantly different
34
Hypothesis testing and Estimation
Linear Regression Hypothesis testing and Estimation
35
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
36
The Statistical Model
37
Each yi is assumed to be randomly generated from a normal distribution with
mean mi = a + bxi and standard deviation s. (a, b and s are unknown) yi a + bxi s xi Y = a + bX slope = b a
38
The Data The Linear Regression Model
The data falls roughly about a straight line. Y = a + bX unseen
39
Fitting the best straight line to “linear” data
The Least Squares Line Fitting the best straight line to “linear” data
40
Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is:
41
The residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data
42
The optimal choice of a and b will result in the residual sum of squares
attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line
43
The equation for the least squares line
Let
44
Computing Formulae:
45
Then the slope of the least squares line can be shown to be:
46
and the intercept of the least squares line can be shown to be:
47
The residual sum of Squares
Computing formula
48
Estimating s, the standard deviation in the regression model :
Computing formula This estimate of s is said to be based on n – 2 degrees of freedom
49
Sampling distributions of the estimators
50
The sampling distribution slope of the least squares line :
It can be shown that b has a normal distribution with mean and standard deviation
51
Thus has a standard normal distribution, and has a t distribution with df = n - 2
52
(1 – a)100% Confidence Limits for slope b :
ta/2 critical value for the t-distribution with n – 2 degrees of freedom
53
Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H0 is true.
54
The Critical Region Reject df = n – 2 This is a two tailed tests. One tailed tests are also possible
55
The sampling distribution intercept of the least squares line :
It can be shown that a has a normal distribution with mean and standard deviation
56
Thus has a standard normal distribution and has a t distribution with df = n - 2
57
(1 – a)100% Confidence Limits for intercept a :
ta/2 critical value for the t-distribution with n – 2 degrees of freedom
58
Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H0 is true.
59
The Critical Region Reject df = n – 2
60
Example
61
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in Country (i) Xi Yi Australia Canada Denmark Finland Great Britain Holland Iceland Norway Sweden Switzerland USA
63
Fitting the Least Squares Line
64
Fitting the Least Squares Line
First compute the following three quantities:
65
Computing Estimate of Slope (b), Intercept (a) and standard deviation (s),
66
95% Confidence Limits for slope b :
to t.025 = critical value for the t-distribution with 9 degrees of freedom
67
95% Confidence Limits for intercept a :
-4.34 to 17.85 t.025 = critical value for the t-distribution with 9 degrees of freedom
68
Y = (0.228)X 95% confidence Limits for slope to 95% confidence Limits for intercept to 17.85
69
Testing the positive slope
The test statistic is:
70
The Critical Region Reject df = 11 – 2 = 9 A one tailed test
71
we reject and conclude
72
Confidence Limits for Points on the Regression Line
The intercept a is a specific point on the regression line. It is the y – coordinate of the point on the regression line when x = 0. It is the predicted value of y when x = 0. We may also be interested in other points on the regression line. e.g. when x = x0 In this case the y – coordinate of the point on the regression line when x = x0 is a + b x0
73
y = a + b x a + b x0 x0
74
(1- a)100% Confidence Limits for a + b x0 :
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
75
Prediction Limits for new values of the Dependent variable y
An important application of the regression line is prediction. Knowing the value of x (x0) what is the value of y? The predicted value of y when x = x0 is: This in turn can be estimated by:.
76
The predictor Gives only a single value for y. A more appropriate piece of information would be a range of values. A range of values that has a fixed probability of capturing the value for y. A (1- a)100% prediction interval for y.
77
(1- a)100% Prediction Limits for y when x = x0:
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
78
Example In this example we are studying building fires in a city and interested in the relationship between: X = the distance of the closest fire hall and the building that puts out the alarm and Y = cost of the damage (1000$) The data was collected on n = 15 fires.
79
The Data
80
Scatter Plot
81
Computations
82
Computations Continued
83
Computations Continued
84
Computations Continued
85
95% Confidence Limits for slope b :
4.07 to 5.77 t.025 = critical value for the t-distribution with 13 degrees of freedom
86
95% Confidence Limits for intercept a :
7.21 to 13.35 t.025 = critical value for the t-distribution with 13 degrees of freedom
87
Least Squares Line y=4.92x+10.28
88
(1- a)100% Confidence Limits for a + b x0 :
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
89
95% Confidence Limits for a + b x0 :
90
95% Confidence Limits for a + b x0
91
(1- a)100% Prediction Limits for y when x = x0:
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
92
95% Prediction Limits for y when x = x0
93
95% Prediction Limits for y when x = x0
94
Linear Regression Summary
Hypothesis testing and Estimation
95
(1 – a)100% Confidence Limits for slope b :
ta/2 critical value for the t-distribution with n – 2 degrees of freedom
96
Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H0 is true.
97
(1 – a)100% Confidence Limits for intercept a :
ta/2 critical value for the t-distribution with n – 2 degrees of freedom
98
Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H0 is true.
99
(1- a)100% Confidence Limits for a + b x0 :
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
100
(1- a)100% Prediction Limits for y when x = x0:
ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom
101
Comparing k Populations
Proportions The c2 test for independence
102
The c2 test for independence
103
Situation We have two categorical variables R and C.
The number of categories of R is r. The number of categories of C is c. We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j. R = rows, C = columns
104
Example Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects. The categories for Blood Pressure are: < The categories for Cholesterol are: <
105
Table: two-way frequency
106
The c2 test for independence
Define = Expected frequency in the (i,j) th cell in the case of independence.
107
H0: R and C are independent
Then to test H0: R and C are independent against HA: R and C are not independent Use test statistic Eij= Expected frequency in the (i,j) th cell in the case of independence. xij= observed frequency in the (i,j) th cell
108
Sampling distribution of test statistic when H0 is true
- c2 distribution with degrees of freedom n = (r - 1)(c - 1) Critical and Acceptance Region Reject H0 if : Accept H0 if :
110
Standardized residuals
Test statistic degrees of freedom n = (r - 1)(c - 1) = 9 Reject H0 using a = 0.05
111
Another Example This data comes from a Globe and Mail study examining the attitudes of the baby boomers. Data was collected on various age groups
112
One question with responses
Are there differences in weekly consumption of alcohol related to age?
113
Table: Expected frequencies
114
Table: Residuals Conclusion: There is a significant relationship between age group and weekly alcohol use
115
Examining the Residuals allows one to identify the cells that indicate a departure from independence
Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent
116
Another question with responses
In an average week, how many times would you surf the internet? Are there differences in weekly internet use related to age?
117
Table: Expected frequencies
118
Table: Residuals Conclusion: There is a significant relationship between age group and weekly internet use
119
Echo (Age 20 – 29)
120
Gen X (Age 30 – 39)
121
Younger Boomers (Age 40 – 49)
122
Older Boomers (Age 50 – 59)
123
Pre Boomers (Age 60+)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.