Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Let y be the monthly sales revenue for a company. This might be a function of.

Similar presentations


Presentation on theme: "Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Let y be the monthly sales revenue for a company. This might be a function of."— Presentation transcript:

1 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Let y be the monthly sales revenue for a company. This might be a function of several variables: –x 1 = advertising expenditure –x 2 = time of year –x 3 = state of economy –x 4 = size of inventory We want to predict y using knowledge of x 1, x 2, x 3 and x 4.

2 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. A Simple Linear Model In Chapter 3, we used the equation of a line to describe the relationship between y and x for a sample of n pairs, (x, y). If we want to describe the relationship between y and x for the whole population, there are two models we can choose Deterministic Model: y =  x Probabilistic Model: –y = deterministic model + random error –y =  x  Deterministic Model: y =  x Probabilistic Model: –y = deterministic model + random error –y =  x 

3 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. A Simple Linear Model exactlySince the bivariate measurements that we observe do not generally fall exactly on a straight line, we choose to use: Probabilistic Model:Probabilistic Model: –y =  x  –E(y) =  x Points deviate from the line of means line of means by an amount  where  has a normal distribution with mean 0 and variance  2.

4 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Method of Least Squares The equation of the best-fitting line is calculated using a set of n pairs (x i, y i ). We choose our estimates a and b to estimate  and  so that the vertical distances of the points from the line, are minimized.

5 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Least Squares Estimators

6 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example The table shows the math achievement test scores for a random sample of n = 10 college freshmen, along with their final calculus grades. Student12345678910 Math test, x39432164574728753452 Calculus grade, y65785282928973985675 Use your calculator to find the sums and sums of squares.

7 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example

8 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. total sum of squaresThe total variation in the experiment is measured by the total sum of squares: The Analysis of Variance Total SS The Total SS is divided into two parts: SSR SSR (sum of squares for regression): measures the variation explained by using x in the model. SSE SSE (sum of squares for error): measures the leftover variation not explained by x.

9 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Analysis of Variance We calculate

10 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The ANOVA Table Total df = Mean Squares Regression df = Error df = n -1 1 n –1 – 1 = n - 2 MSR = SSR/(1) MSE = SSE/(n-2) SourcedfSSMSF Regression1SSRSSR/(1)MSR/MSE Errorn - 2SSESSE/(n-2) Totaln -1Total SS

11 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Calculus Problem SourcedfSSMSF Regression11449.9741 19.14 Error8606.025975.7532 Total92056.0000

12 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing the Usefulness of the Model The first question to ask is whether the independent variable x is of any use in predicting y. If it is not, then the value of y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.

13 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing the Usefulness of the Model The test statistic is function of b, our best estimate of  Using MSE as the best estimate of the random variation  2, we obtain a t statistic.

14 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Calculus Problem Is there a significant relationship between the calculus grades and the test scores at the 5% level of significance? Reject H 0 when |t| > 2.306. Since t = 4.38 falls into the rejection region, H 0 is rejected. There is a significant linear relationship between the calculus grades and the test scores for the population of college freshmen.

15 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The F Test You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE. This test is exactly equivalent to the t-test, with t 2 = F.

16 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Regression Analysis: y versus x The regression equation is y = 40.8 + 0.766 x Predictor Coef SE Coef T P Constant 40.784 8.507 4.79 0.001 x 0.7656 0.1750 4.38 0.002 S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8% Analysis of Variance Source DF SS MS F P Regression 1 1450.0 1450.0 19.14 0.002 Residual Error 8 606.0 75.8 Total 9 2056.0 Regression coefficients, a and b Minitab Output Least squares regression line

17 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Measuring the Strength of the Relationship If the independent variable x is of useful in predicting y, you will want to know how well the model fits. The strength of the relationship between x and y can be measured using:

18 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Measuring the Strength of the Relationship Since Total SS = SSR + SSE, r 2 measures the proportion of the total variation in the responses that can be explained by using the independent variable x in the model. the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y. For the calculus problem, r 2 =.705 or 70.5%. The model is working well!

19 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Checking the Regression Assumptions 1.The relationship between x and y is linear, given by y =  +  x +  2.The random error terms  are independent and, for any value of x, have a normal distribution with mean 0 and variance  2. 1.The relationship between x and y is linear, given by y =  +  x +  2.The random error terms  are independent and, for any value of x, have a normal distribution with mean 0 and variance  2. Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.

20 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Residuals residual errorThe residual error is the “leftover” variation in each data point after the variation explained by the regression model has been removed. normalIf all assumptions have been met, these residuals should be normal, with mean 0 and variance  2.

21 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right. If not, you will often see the pattern fail in the tails of the graph. If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right. If not, you will often see the pattern fail in the tails of the graph. Normal Probability Plot

22 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Estimation and Prediction Once you have determined that the regression line is useful used the diagnostic plots to check for violation of the regression assumptions. You are ready to use the regression line to Estimate the average value of y for a given value of x Predict a particular value of y for a given value of x.

23 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Estimation and Prediction Estimating the average value of y when x = x 0 Estimating a particular value of y when x = x 0

24 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Estimation and Prediction The best estimate of either E(y) or y for a given value x = x 0 is Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.

25 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Estimation and Prediction

26 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Calculus Problem Estimate the average calculus grade for students whose achievement score is 50 with a 95% confidence interval.

27 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Calculus Problem Estimate the calculus grade for a particular student whose achievement score is 50 with a 95% confidence interval. Notice how much wider this interval is!

28 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 79.06 2.84 (72.51, 85.61) (57.95,100.17) Values of Predictors for New Observations New Obs x 1 50.0 Minitab Output Blue prediction bands are always wider than red confidence bands. Both intervals are narrowest when x = x- bar. Confidence and prediction intervals when x = 50

29 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Correlation Analysis coefficient of correlationThe strength of the relationship between x and y is measured using the coefficient of correlation: Recall from Chapter 3 that (1)-1  r  1 (2) r and b have the same sign (3)r  0 means no linear relationship (4) r  1 or –1 means a strong (+) or (-) relationship

30 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example The table shows the heights and weights of n = 10 randomly selected college football players. Player12345678910 Height, x73717572 7567697169 Weight, y185175200210190195150170180175 Use your calculator to find the sums and sums of squares.

31 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Some Correlation Patterns Exploring CorrelationUse the Exploring Correlation applet to explore some correlation patterns: r = 0; No correlation r =.931; Strong positive correlation r = 1; Linear relationship r = -.67; Weaker negative correlation

32 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Inference using r coefficient of correlationThe population coefficient of correlation is called  (“rho”). We can test for a significant correlation between x and y using a t test: This test is exactly equivalent to the t-test for the slope .

33 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example Is there a significant positive correlation between weight and height in the population of all college football players? Use the t-table with n-2 = 8 df to bound the p-value as p-value <.005. There is a significant positive correlation. Applet

34 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Let y be the monthly sales revenue for a company. This might be a function of several variables: –x 1 = advertising expenditure –x 2 = time of year –x 3 = state of economy –x 4 = size of inventory We want to predict y using knowledge of x 1, x 2, x 3 and x 4.

35 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The General Linear Model –y =     x 1 +   x 2 +…+  k x k +  where y is the response variable you want to predict.        k        k are unknown constants x   x   x k x   x   x k are independent predictor variables, measured without error.

36 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Consider the model E(y) =     x 1 +   x 2 first order modelThis is a first order model (independent variables appear only to the first power). y-intercept    y-intercept = value of E(y) when x 1 =x 2 =0. partial regression coefficients when the other independent variables are held constant    and    are the partial regression coefficients—the change in y for a one- unit change in x i when the other independent variables are held constant. planeTraces a plane in three dimensional space.

37 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Method of Least Squares The best-fitting prediction equation is calculated using a set of n measurements (y, x 1, x 2,… x k ) as We choose our estimates b 0, b 1,…, b k to estimate     ,…,  k to minimize

38 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example A computer database in a small community contains the listed selling price y (in thousands of dollars), the amount of living area x 1 (in hundreds of square feet), and the number of floors x 2, bedrooms x 3, and bathrooms x 4, for n = 15 randomly selected residences currently on the market. Propertyyx1x1 x2x2 x3x3 x4x4 169.06121 2118.510122 3116.510132 ……………… 15209.921243 Fit a first order model to the data using the method of least squares.

39 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example The first order model is E(y) =     x 1 +   x 2 +   x 3 +   x 4 fit using Minitab with the values of y and the four independent variables entered into five columns of the Minitab worksheet. Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths The regression equation is ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths Predictor Coef SE Coef T P Constant 18.763 9.207 2.04 0.069 SqFeet 6.2698 0.7252 8.65 0.000 NumFlrs -16.203 6.212 -2.61 0.026 Bdrms -2.673 4.494 -0.59 0.565 Baths 30.271 6.849 4.42 0.001 Partial regression coefficients Regression equation

40 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. total sum of squaresThe total variation in the experiment is measured by the total sum of squares: The Analysis of Variance Total SS The Total SS is divided into two parts: SSR SSR (sum of squares for regression): measures the variation explained by using the regression equation. SSE SSE (sum of squares for error): measures the leftover variation not explained by the independent variables.

41 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The ANOVA Table Total df = Mean Squares Regression df = Error df = n -1 k n –1 – k = n – k -1 MSR = SSR/k MSE = SSE/(n-k-1) SourcedfSSMSF RegressionkSSRSSR/kMSR/MSE Errorn – k -1SSESSE/(n-k-1) Totaln -1Total SS

42 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Real Estate Problem Another portion of the Minitab printout shows the ANOVA Table, with n = 15 and k = 4. S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0% Analysis of Variance Source DF SS MS F P Regression 4 15913.0 3978.3 84.80 0.000 Residual Error 10 469.1 46.9 Total 14 16382.2 Source DF Seq SS SqFeet 1 14829.3 NumFlrs 1 0.9 Bdrms 1 166.4 Baths 1 916.5 Sequential Sums of squares: conditional contribution of each independent variable to SSR given the variables already entered into the model.

43 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing the Usefulness of the Model The first question to ask is whether the regression model is of any use in predicting y. If it is not, then the value of y does not change, regardless of the value of the independent variables, x 1, x 2,…, x k. This implies that the partial regression coefficients,  1  2,…,  k are all zero.

44 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The F Test You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.

45 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Measuring the Strength of the Relationship If the independent variables are useful in predicting y, you will want to know how well the model fits. The strength of the relationship between x and y can be measured using:

46 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Measuring the Strength of the Relationship Since Total SS = SSR + SSE, R 2 measures the proportion of the total variation in the responses that can be explained by using the independent variables in the model. the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y.

47 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing the Partial Regression Coefficients Is a particular independent variable useful in the model, in the presence of all the other independent variables? The test statistic is function of b i, our best estimate of  i  which has a t distribution with error df = n – k –1.

48 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Real Estate Problem Is the overall model useful in predicting list price? How much of the overall variation in the response is explained by the regression model? S = 6.849 R-Sq = 97.1% R-Sq(adj) = 96.0% Analysis of Variance Source DF SS MS F P Regression 4 15913.0 3978.3 84.80 0.000 Residual Error 10 469.1 46.9 Total 14 16382.2 Source DF Seq SS SqFeet 1 14829.3 NumFlrs 1 0.9 Bdrms 1 166.4 Baths 1 916.5 F = MSR/MSE = 84.80 with p-value =.000 is highly significant. The model is very useful in predicting the list price of homes. R 2 =.971 indicates that 97.1% of the overall variation is explained by the regression model.

49 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Real Estate Problem In the presence of the other three independent variables, is the number of bedrooms significant in predicting the list price of homes? Test using  =.05. Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths The regression equation is ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths Predictor Coef SE Coef T P Constant 18.763 9.207 2.04 0.069 SqFeet 6.2698 0.7252 8.65 0.000 NumFlrs -16.203 6.212 -2.61 0.026 Bdrms -2.673 4.494 -0.59 0.565 Baths 30.271 6.849 4.42 0.001 To test H 0 :    the test statistic is t = -0.59 with p-value =.565. The p-value is larger than.05 and H 0 is not rejected. We cannot conclude that number of bedrooms is a valuable predictor in the presence of the other variables. Perhaps the model could be refit without x 3.

50 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Comparing Regression Models The strength of a regression model is measured using R 2 = SSR/Total SS. This value will only increase as variables are added to the model. To fairly compare two models, it is better to use a measure that has been adjusted using df:

51 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Checking the Regression Assumptions   are independent. Have a mean 0 and common variance  2 for any set x   x   x k. Have a normal distribution.   are independent. Have a mean 0 and common variance  2 for any set x   x   x k. Have a normal distribution. Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.

52 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right. If not, you will often see the pattern fail in the tails of the graph. If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right. If not, you will often see the pattern fail in the tails of the graph. Normal Probability Plot

53 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Estimation and Prediction Enter the appropriate values of x 1, x 2, …, x k in Minitab. Minitab calculates and both the confidence interval and the prediction interval. Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.

54 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Real Estate Problem Estimate the average list price for a home with 1000 square feet of living space, one floor, 3 bedrooms and two baths with a 95% confidence interval. Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54) Values of Predictors for New Observations New Obs SqFeet NumFlrs Bdrms Baths 1 10.0 1.00 3.00 2.00 We estimate that the average list price will be between $110,860 and $124,700 for a home like this.

55 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Using Regression Models When you perform multiple regression analysis, use a step-by step approach: 1.Obtain the fitted prediction model. 2.Use the analysis of variance F test and R 2 to determine how well the model fits the data. 3.Check the t tests for the partial regression coefficients to see which ones are contributing significant information in the presence of the others. 4.If you choose to compare several different models, use R 2 (adj) to compare their effectiveness. 5.Use diagnostic plots to check for violation of the regression assumptions.

56 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. A Polynomial Model quadraticWhen k = 2, the model is quadratic: A response y is related to a single independent variable x, but not in a linear manner. The polynomial model is: cubicWhen k = 3, the model is cubic:

57 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example A market research firm has observed the sales (y) as a function of mass media advertising expenses (x) for 10 different companies selling a similar product. Since there is only one independent variable, you could fit a linear, quadratic, or cubic polynomial model. Which would you pick? Company12345678910 Expenditure, x1.01.62.53.04.04.65.05.76.07.0 Sales, y2.52.62.75.05.39.114.817.523.028.0

58 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Two Possible Choices A straight line model: y =  0 +  1 x +  A quadratic model: y =  0 +  1 x +  2 x 2 +  Here is the Minitab printout for the straight line: Regression Analysis: y versus x The regression equation is y = - 6.47 + 4.34 x Predictor Coef SE Coef T P Constant -6.465 2.795 -2.31 0.049 x 4.3355 0.6274 6.91 0.000 S = 3.725 R-Sq = 85.6% R-Sq(adj) = 83.9% Analysis of Variance Source DF SS MS F P Regression 1 662.46 662.46 47.74 0.000 Residual Error 8 111.00 13.88 Total 9 773.46 Overall F test is highly significant, as is the t-test of the slope. R 2 =.856 suggests a good fit. Let’s check the residual plots…

59 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example There is a strong pattern of a “curve” leftover in the residual plot. This indicates that there is a curvilinear relationship unaccounted for by your straight line model. You should have used the quadratic model! There is a strong pattern of a “curve” leftover in the residual plot. This indicates that there is a curvilinear relationship unaccounted for by your straight line model. You should have used the quadratic model! Use Minitab to fit the quadratic model: y =  0 +  1 x +  2 x 2 + 

60 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Quadratic Model Regression Analysis: y versus x, x-sq The regression equation is y = 4.66 - 3.03 x + 0.939 x-sq Predictor Coef SE Coef T P Constant 4.657 2.443 1.91 0.098 x -3.030 1.395 -2.17 0.067 x-sq 0.9389 0.1739 5.40 0.001 S = 1.752 R-Sq = 97.2% R-Sq(adj) = 96.4% Analysis of Variance Source DF SS MS F P Regression 2 751.98 375.99 122.49 0.000 Residual Error 7 21.49 3.07 Total 9 773.47 Overall F test is highly significant, as is the t-test of the quadratic term  2. R 2 =.972 suggests a very good fit. Let’s compare the two models, and check the residual plots. Overall F test is highly significant, as is the t-test of the quadratic term  2. R 2 =.972 suggests a very good fit. Let’s compare the two models, and check the residual plots.

61 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Which Model to Use? Use R 2 (adj) to compare the models: The straight line model: y =  0 +  1 x +  The quadratic model: y =  0 +  1 x +  2 x 2 +  The quadratic model is better. There are no patterns in the residual plot, indicating that this is the correct model for the data. The quadratic model is better. There are no patterns in the residual plot, indicating that this is the correct model for the data.

62 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Using Qualitative Variables Multiple regression requires that the response y be a quantitative variable. Independent variables can be either quantitative or qualitative. Qualitative variables dummy variables Qualitative variables involving k categories are entered into the model by using k-1 dummy variables. Example: To enter gender as a variable, use x i = 1 if male; 0 if female

63 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Data was collected on 6 male and 6 female assistant professors. The researchers recorded their salaries (y) along with years of experience (x 1 ). The professor’s gender enters into the model as a dummy variable: x 2 = 1 if male; 0 if not. ProfessorSalary, yExperience, x 1 Gender, x 2 Interaction, x 1 x 2 1$50,710111 2 49,510100 …………… 1155,590515 1253,200500

64 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example We want to predict a professor’s salary based on years of experience and gender. We think that there may be a difference in salary depending on whether you are male or female. The model we choose includes experience (x 1 ), gender (x 2 ), and an interaction term (x 1 x 2 ) to allow salary’s for males and females to behave differently.

65 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Minitab Output We use Minitab to fit the model. Regression Analysis: y versus x1, x2, x1x2 The regression equation is y = 48593 + 969 x1 + 867 x2 + 260 x1x2 Predictor Coef SE Coef T P Constant 48593.0 207.9 233.68 0.000 x1 969.00 63.67 15.22 0.000 x2 866.7 305.3 2.84 0.022 x1x2 260.13 87.06 2.99 0.017 S = 201.3 R-Sq = 99.2% R-Sq(adj) = 98.9% Analysis of Variance Source DF SS MS F P Regression 3 42108777 14036259 346.24 0.000 Residual Error 8 324315 40539 Total 11 42433092 What is the regression equation for males? For females? For males, x 2 = 1, y = 49459.7 + 1229.13x 1 For females, x 2 = 0, y = 48593.0 + 969.0x 1 Two different straight line models. What is the regression equation for males? For females? For males, x 2 = 1, y = 49459.7 + 1229.13x 1 For females, x 2 = 0, y = 48593.0 + 969.0x 1 Two different straight line models. Is the overall model useful in predicting y? The overall F test is F = 346.24 with p-value =.000. The value of R 2 =.992 indicates that the model fits very well. Is the overall model useful in predicting y? The overall F test is F = 346.24 with p-value =.000. The value of R 2 =.992 indicates that the model fits very well. Is there a difference in the relationship between salary and years of experience, depending on the gender of the professor? Yes. The individual t-test for the interaction term is t = 2.99 with p- value =.017. This indicates a significant interaction between gender and years of experience. Is there a difference in the relationship between salary and years of experience, depending on the gender of the professor? Yes. The individual t-test for the interaction term is t = 2.99 with p- value =.017. This indicates a significant interaction between gender and years of experience.

66 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Have any of the regression assumptions been violated, or have we fit the wrong model? It does not appear from the diagnostic plots that there are any violations of assumptions. The model is ready to be used for prediction or estimation. It does not appear from the diagnostic plots that there are any violations of assumptions. The model is ready to be used for prediction or estimation.

67 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing Sets of Parameters Suppose the demand y may be related to five independent variables, but that the cost of measuring three of them is very high. If it could be shown that these three contribute little or no information, they can be eliminated. You want to test the null hypothesis H 0 :  3 =  4 =  5 = 0—that is, the independent variables x 3, x 4, and x 5 contribute no information for the prediction of y— versus the alternative hypothesis: H a : At least one of the parameters  3,  4, or  5 differs from 0 —that is, at least one of the variables x 3, x 4, or x 5 contributes information for the prediction of y.

68 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Model Two (complete model) terms in model 1 additional terms in model 2 Testing Sets of Parameters To explain how to test a hypothesis concerning a set of model parameters, we define two models: Model One (reduced model)

69 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Testing Sets of Parameters The test of the hypothesis H 0 :  3 =  4 =  5 = 0 H a : At least one of the  i differs from 0 uses the test statistic where F is based on df 1 = (k - r ) and df 2 = n -(k + 1). The rejection region for the test is identical to other analysis of variance F tests, namely F > F  

70 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Stepwise Regression A stepwise regression analysis fits a variety significant nonsignificant of models to the data, adding and deleting variables as their significance in the presence of the other variables is either significant or nonsignificant, respectively. Once the program has performed a sufficient number of iterations and no more variables are significant when added to the model, and none of the variables are nonsignificant when removed, the procedure stops. always fit first-order models These programs always fit first-order models and are not helpful in detecting curvature or interaction in the data.

71 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Pearson’s Chi-Square Statistic We have some preconceived idea about the values of the p i and want to use sample information to see if we are correct. expected number E i = np i observed cell counts, O i,The expected number of times that outcome i will occur is E i = np i. If the observed cell counts, O i, are too far from what we hypothesize under H 0, the more likely it is that H 0 should be rejected.

72 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Pearson’s Chi-Square Statistic We use the Pearson chi-square statistic: When H 0 is true, the differences O-E will be small, but large when H 0 is false. large values of X 2Look for large values of X 2 based on the chi-square distribution with a particular number of degrees of freedom.

73 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Degrees of Freedom These will be different depending on the application. 1.Start with the number of categories or cells in the experiment. p 1 +p 2 +…+ p k = 1 2.Subtract 1df for each linear restriction on the cell probabilities. (You always lose 1 df since p 1 +p 2 +…+ p k = 1.) 3.Subtract 1 df for every population parameter you have to estimate to calculate or estimate E i.

74 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Goodness of Fit Test The simplest of the applications. A single categorical variable is measured, and exact numerical values are specified for each of the p i. E i = np iExpected cell counts are E i = np i df = k-1Degrees of freedom: df = k-1

75 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example A multinomial experiment with k = 6 and O 1 to O 6 given in the table. We test: Upper Face123456 Number of times503945626143 H 0 : p 1 = 1/6; p 2 = 1/6;…p 6 = 1/6 (die is fair) H a : at least one p i is different from 1/6 (die is biased) H 0 : p 1 = 1/6; p 2 = 1/6;…p 6 = 1/6 (die is fair) H a : at least one p i is different from 1/6 (die is biased) Toss a die 300 times with the following results. Is the die fair or biased?

76 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Upper Face123456 OiOi 503945626143 EiEi 50 E i = np i = 300(1/6) = 50 Calculate the expected cell counts: Test statistic and rejection region: Do not reject H 0. There is insufficient evidence to indicate that the die is biased.

77 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Some Notes The test statistic, X 2 has only an approximate chi-square distribution. E i  5For the approximation to be accurate, statisticians recommend E i  5 for all cells. Goodness of fit tests are different from previous tests since the experimenter uses H 0 for the model he thinks is true. Be careful not to accept H 0 (say the model is correct) without reporting . H 0 : model is correct (as specified) H a : model is not correct H 0 : model is correct (as specified) H a : model is not correct

78 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Contingency Tables: A Two-Way Classification two qualitative variables bivariate dataThe experimenter measures two qualitative variables to generate bivariate data. –Gender and colorblindness –Age and opinion –Professorial rank and type of university contingency table.Summarize the data by counting the observed number of outcomes in each of the intersections of category levels in a contingency table.

79 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. r x c Contingency Table The contingency table has r rows and c columns—rc total cells. 12…c 1O 11 O 12 …O 1c 2O 21 O 22 …O 2c ……………. rO r1 O r2 …O rc contingentdependentWe study the relationship between the two variables. Is one method of classification contingent or dependent on the other? Does the distribution of measurements in the various categories for variable 1 depend on which category of variable 2 is being observed? If not, the variables are independent. Does the distribution of measurements in the various categories for variable 1 depend on which category of variable 2 is being observed? If not, the variables are independent.

80 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Chi-Square Test of Independence O ijObserved cell counts are O ij for row i and column j. E ij = np ijExpected cell counts are E ij = np ij If H 0 is true and the classifications are independent, p ij = p i p j = P(falling in row i)P(falling in row j) H 0 : classifications are independent H a : classifications are dependent H 0 : classifications are independent H a : classifications are dependent

81 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Chi-Square Test of Independence df = (r-1)(c-1).The test statistic has an approximate chi-square distribution with df = (r-1)(c-1).

82 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example Furniture defects are classified according to type of defect and shift on which it was made. Shift Type123Total A15263374 B21311769 C453449128 D1352038 Total9496119309 Do the data present sufficient evidence to indicate that the type of furniture defect varies with the shift during which the piece of furniture is produced? Test at the 1% level of significance. H 0 : type of defect is independent of shift H a : type of defect depends on the shift H 0 : type of defect is independent of shift H a : type of defect depends on the shift

83 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Furniture Problem Calculate the expected cell counts. For example : Chi-Square Test: 1, 2, 3 Expected counts are printed below observed counts 1 2 3 Total 1 15 26 33 74 22.51 22.99 28.50 2 21 31 17 69 20.99 21.44 26.57 3 45 34 49 128 38.94 39.77 49.29 4 13 5 20 38 11.56 11.81 14.63 Total 94 96 119 309 Chi-Sq = 2.506 + 0.394 + 0.711 + 0.000 + 4.266 + 3.449 + 0.944 + 0.836 + 0.002 + 0.179 + 3.923 + 1.967 = 19.178 DF = 6, P-Value = 0.004 Reject H 0. There is sufficient evidence to indicate that the proportion of defect types vary from shift to shift. Applet

84 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Comparing Multinomial Populations Sometimes researchers design an experiment so that the number of experimental units falling in one set of categories is fixed in advance. Example: An experimenter selects 900 patients who have been treated for flu prevention. She selects 300 from each of three types—no vaccine, one shot, and two shots. No VaccineOne ShotTwo ShotsTotal Flur1r1 No Flur2r2 Total300 n = 900 The column totals have been fixed in advance!

85 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Comparing Multinomial Populations Each of the c columns (or r rows) whose totals have been fixed in advance is actually a single multinomial experiment. equivalent toThe chi-square test of independence with (r- 1)(c-1) df is equivalent to a test of the equality of c (or r) multinomial populations. No VaccineOne ShotTwo ShotsTotal Flur1r1 No Flur2r2 Total300 n = 900 Three binomial populations—no vaccine, one shot and two shots. Is the probability of getting the flu independent of the type of flu prevention used? Three binomial populations—no vaccine, one shot and two shots. Is the probability of getting the flu independent of the type of flu prevention used?

86 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example Random samples of 200 voters in each of four wards were surveyed and asked if they favor candidate A in a local election. Ward 1234Total Favor A76535948236 Do not favor A124147141152564 Total200 800 Do the data present sufficient evidence to indicate that the the fraction of voters favoring candidate A differs in the four wards? H 0 : fraction favoring A is independent of ward H a : fraction favoring A depends on the ward H 0 : fraction favoring A is independent of ward H a : fraction favoring A depends on the ward H 0 : p 1 = p 2 = p 3 = p 4 where p i = fraction favoring A in each of the four wards H 0 : p 1 = p 2 = p 3 = p 4 where p i = fraction favoring A in each of the four wards Applet

87 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Applet The Voter Problem Calculate the expected cell counts. For example : Chi-Square Test: 1, 2, 3, 4 Expected counts are printed below observed counts 1 2 3 4 Total 1 76 53 59 48 236 59.00 59.00 59.00 59.00 2 124 147 141 152 564 141.00 141.00 141.00 141.00 Total 200 200 200 200 800 Chi-Sq = 4.898 + 0.610 + 0.000 + 2.051 + 2.050 + 0.255 + 0.000 + 0.858 = 10.722 DF = 3, P-Value = 0.013 Reject H 0. There is sufficient evidence to indicate that the fraction of voters favoring A varies from ward to ward.

88 Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. The Voter Problem Since we know that there are differences among the four wards, what are the nature of the differences? Look at the proportions in favor of candidate A in the four wards. Ward1234 Favor A76/200=.3853/200 =.2759/200 =.3048/200 =.24 Candidate A is doing best in the first ward, and worst in the fourth ward. More importantly, he does not have a majority of the vote in any of the wards!


Download ppt "Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example Let y be the monthly sales revenue for a company. This might be a function of."

Similar presentations


Ads by Google