Comparing k Populations

Slides:



Advertisements
Similar presentations
“Students” t-test.
Advertisements

Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
SIMPLE LINEAR REGRESSION
Simple Linear Regression Analysis
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Linear Regression Hypothesis testing and Estimation.
© The McGraw-Hill Companies, Inc., Chapter 11 Correlation and Regression.
Basic concept Measures of central tendency Measures of central tendency Measures of dispersion & variability.
Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Orthogonal Linear Contrasts This is a technique for partitioning ANOVA sum of squares into individual degrees of freedom.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Hypothesis testing and Estimation
Multivariate Data. Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many.
Multivariate data. Regression and Correlation The Scatter Plot.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
The Simple Linear Regression Model. Estimators in Simple Linear Regression and.
Analysis of Variance STAT E-150 Statistical Methods.
The p-value approach to Hypothesis Testing
Linear Regression Hypothesis testing and Estimation.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
23. Inference for regression
Chapter 14 Introduction to Multiple Regression
Regression and Correlation
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Chapter 10 Two-Sample Tests and One-Way ANOVA.
Multivariate Data.
Statistics for Managers using Microsoft Excel 3rd Edition
Correlation and Simple Linear Regression
Statistics for Business and Economics (13e)
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
Chapter 10 Two-Sample Tests and One-Way ANOVA.
Hypothesis testing and Estimation
Comparing k Populations
The Practice of Statistics in the Life Sciences Fourth Edition
Reasoning in Psychology Using Statistics
Correlation and Regression
Comparing k Populations
Chapter 9 Hypothesis Testing.
Chapter 11 Analysis of Variance
Comparing k Populations
Simple Linear Regression
Comparing Populations
Hypothesis testing and Estimation
Correlation and Simple Linear Regression
Correlation and Regression
Reasoning in Psychology Using Statistics
Simple Linear Regression
SIMPLE LINEAR REGRESSION
CHAPTER 12 More About Regression
Simple Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Simple Linear Regression
Reasoning in Psychology Using Statistics
One way Analysis of Variance (ANOVA)
St. Edward’s University
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Comparing k Populations Means – One way Analysis of Variance (ANOVA)

The F test

The F test – for comparing k means Situation We have k normal populations Let mi and s denote the mean and standard deviation of population i. i = 1, 2, 3, … k. Note: we assume that the standard deviation for each population is the same. s1 = s2 = … = sk = s

We want to test against

To test against use the test statistic

the statistic is called the Between Sum of Squares and is denoted by SSBetween It measures the variability between samples k – 1 is known as the Between degrees of freedom and is called the Between Mean Square and is denoted by MSBetween

the statistic is called the Error Sum of Squares and is denoted by SSError is known as the Error degrees of freedom and is called the Error Mean Square and is denoted by MSError

then

The Computing formula for F: Compute 1) 2) 3) 4) 5)

Then 1) 2) 3)

The critical region for the F test We reject if Fa is the critical point under the F distribution with n1 = k - 1degrees of freedom in the numerator and n2 = N – k degrees of freedom in the denominator

Example In the following example we are comparing weight gains resulting from the following six diets Diet 1 - High Protein , Beef Diet 2 - High Protein , Cereal Diet 3 - High Protein , Pork Diet 4 - Low protein , Beef Diet 5 - Low protein , Cereal Diet 6 - Low protein , Pork

Hence

Thus Thus since F > 2.386 we reject H0

A convenient method for displaying the calculations for the F-test The ANOVA Table A convenient method for displaying the calculations for the F-test

Anova Table Mean Square F-ratio Between k - 1 SSBetween MSBetween Source d.f. Sum of Squares Mean Square F-ratio Between k - 1 SSBetween MSBetween MSB /MSE Within N - k SSError MSError Total N - 1 SSTotal

The Diet Example Mean Square F-ratio Between 5 4612.933 922.587 4.3 Source d.f. Sum of Squares Mean Square F-ratio Between 5 4612.933 922.587 4.3 Within 54 11586.000 214.556 (p = 0.0023) Total 59 16198.933

Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS

Assume the data is contained in an Excel file

Each variable is in a column Weight gain (wtgn) diet Source of protein (Source) Level of Protein (Level)

After starting the SSPS program the following dialogue box appears:

If you select Opening an existing file and press OK the following dialogue box appears

The following dialogue box appears:

If the variable names are in the file ask it to read the names If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear

One that will contain the output:

The other containing the data:

To perform ANOVA select Analyze->General Linear Model-> Univariate

The following dialog box appears

Select the dependent variable and the fixed factors Press OK to perform the Analysis

The Output

Comments The F-test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different If H0 is accepted we know that all means are equal (not significantly different) If H0 is rejected we conclude that at least one pair of means is significantly different. The F – test gives no information to which pairs of means are different. One now can use two sample t tests to determine which pairs means are significantly different

Fishers LSD (least significant difference) procedure: Test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different, using the ANOVA F-test If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by using two sample t tests to determine which pairs means are significantly different

Hypothesis testing and Estimation Linear Regression Hypothesis testing and Estimation

Assume that we have collected data on two variables X and Y. Let (x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

The Statistical Model

Each yi is assumed to be randomly generated from a normal distribution with mean mi = a + bxi and standard deviation s. (a, b and s are unknown) yi a + bxi s xi Y = a + bX slope = b a

The Data The Linear Regression Model The data falls roughly about a straight line. Y = a + bX unseen

Fitting the best straight line to “linear” data The Least Squares Line Fitting the best straight line to “linear” data

Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is:

The residual can be computed for each case in the sample, The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data

The optimal choice of a and b will result in the residual sum of squares attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line

The equation for the least squares line Let

Computing Formulae:

Then the slope of the least squares line can be shown to be:

and the intercept of the least squares line can be shown to be:

The residual sum of Squares Computing formula

Estimating s, the standard deviation in the regression model : Computing formula This estimate of s is said to be based on n – 2 degrees of freedom

Sampling distributions of the estimators

The sampling distribution slope of the least squares line : It can be shown that b has a normal distribution with mean and standard deviation

Thus has a standard normal distribution, and has a t distribution with df = n - 2

(1 – a)100% Confidence Limits for slope b : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

The Critical Region Reject df = n – 2 This is a two tailed tests. One tailed tests are also possible

The sampling distribution intercept of the least squares line : It can be shown that a has a normal distribution with mean and standard deviation

Thus has a standard normal distribution and has a t distribution with df = n - 2

(1 – a)100% Confidence Limits for intercept a : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

The Critical Region Reject df = n – 2

Example

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950.   TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.   Country (i) Xi Yi Australia 48 18 Canada 50 15 Denmark 38 17 Finland 110 35 Great Britain 110 46 Holland 49 24 Iceland 23 6 Norway 25 9 Sweden 30 11 Switzerland 51 25 USA 130 20  

Fitting the Least Squares Line

Fitting the Least Squares Line First compute the following three quantities:

Computing Estimate of Slope (b), Intercept (a) and standard deviation (s),

95% Confidence Limits for slope b : 0.0706 to 0.3862 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

95% Confidence Limits for intercept a : -4.34 to 17.85 t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

Y = 6.756 + (0.228)X 95% confidence Limits for slope 0.0706 to 0.3862 95% confidence Limits for intercept -4.34 to 17.85

Testing the positive slope The test statistic is:

The Critical Region Reject df = 11 – 2 = 9 A one tailed test

we reject and conclude

Confidence Limits for Points on the Regression Line The intercept a is a specific point on the regression line. It is the y – coordinate of the point on the regression line when x = 0. It is the predicted value of y when x = 0. We may also be interested in other points on the regression line. e.g. when x = x0 In this case the y – coordinate of the point on the regression line when x = x0 is a + b x0

y = a + b x a + b x0 x0

(1- a)100% Confidence Limits for a + b x0 : ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

Prediction Limits for new values of the Dependent variable y An important application of the regression line is prediction. Knowing the value of x (x0) what is the value of y? The predicted value of y when x = x0 is: This in turn can be estimated by:.

The predictor Gives only a single value for y. A more appropriate piece of information would be a range of values. A range of values that has a fixed probability of capturing the value for y. A (1- a)100% prediction interval for y.

(1- a)100% Prediction Limits for y when x = x0: ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

Example In this example we are studying building fires in a city and interested in the relationship between: X = the distance of the closest fire hall and the building that puts out the alarm and Y = cost of the damage (1000$) The data was collected on n = 15 fires.

The Data

Scatter Plot

Computations

Computations Continued

Computations Continued

Computations Continued

95% Confidence Limits for slope b : 4.07 to 5.77 t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom

95% Confidence Limits for intercept a : 7.21 to 13.35 t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom

Least Squares Line y=4.92x+10.28

(1- a)100% Confidence Limits for a + b x0 : ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

95% Confidence Limits for a + b x0 :

95% Confidence Limits for a + b x0

(1- a)100% Prediction Limits for y when x = x0: ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

95% Prediction Limits for y when x = x0

95% Prediction Limits for y when x = x0

Linear Regression Summary Hypothesis testing and Estimation

(1 – a)100% Confidence Limits for slope b : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

Testing the slope The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

(1 – a)100% Confidence Limits for intercept a : ta/2 critical value for the t-distribution with n – 2 degrees of freedom

Testing the intercept The test statistic is: - has a t distribution with df = n – 2 if H0 is true.

(1- a)100% Confidence Limits for a + b x0 : ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

(1- a)100% Prediction Limits for y when x = x0: ta/2 is the a/2 critical value for the t-distribution with n - 2 degrees of freedom

Comparing k Populations Proportions The c2 test for independence

The c2 test for independence

Situation We have two categorical variables R and C. The number of categories of R is r. The number of categories of C is c. We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j. R = rows, C = columns

Example Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects. The categories for Blood Pressure are: <126 127-146 147-166 167+ The categories for Cholesterol are: <200 200-219 220-259 260+

Table: two-way frequency

The c2 test for independence Define = Expected frequency in the (i,j) th cell in the case of independence.

H0: R and C are independent Then to test H0: R and C are independent against HA: R and C are not independent Use test statistic Eij= Expected frequency in the (i,j) th cell in the case of independence. xij= observed frequency in the (i,j) th cell

Sampling distribution of test statistic when H0 is true - c2 distribution with degrees of freedom n = (r - 1)(c - 1) Critical and Acceptance Region Reject H0 if : Accept H0 if :

Standardized residuals Test statistic degrees of freedom n = (r - 1)(c - 1) = 9 Reject H0 using a = 0.05

Another Example This data comes from a Globe and Mail study examining the attitudes of the baby boomers. Data was collected on various age groups

One question with responses Are there differences in weekly consumption of alcohol related to age?

Table: Expected frequencies

Table: Residuals Conclusion: There is a significant relationship between age group and weekly alcohol use

Examining the Residuals allows one to identify the cells that indicate a departure from independence Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent

Another question with responses In an average week, how many times would you surf the internet? Are there differences in weekly internet use related to age?

Table: Expected frequencies

Table: Residuals Conclusion: There is a significant relationship between age group and weekly internet use

Echo (Age 20 – 29)

Gen X (Age 30 – 39)

Younger Boomers (Age 40 – 49)

Older Boomers (Age 50 – 59)

Pre Boomers (Age 60+)