Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice.

Similar presentations


Presentation on theme: "Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice."— Presentation transcript:

1 Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice

2 Learning Objectives In this chapter, you learn:  How to use hypothesis testing for comparing the difference between  The means of two independent populations  The means of two related populations  The proportions of two independent populations  The variances of two independent populations

3 Two-Sample Tests ( 两总体检验 ) Two-Sample Tests Population Means, Independent Samples Means, Related Samples Population Variances Population 1 vs. independent Population 2 Same population before vs. after treatment Variance 1 vs. Variance 2 Examples: Population Proportions Proportion 1 vs. Proportion 2

4 Difference Between Two Means Population means, independent samples σ 1 and σ 2 known Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ 1 – μ 2 The point estimate for the difference is X 1 – X 2 * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

5 Independent Samples ( 独立样本 ) Population means, independent samples  Different data sources  Unrelated  Independent  Sample selected from one population has no effect on the sample selected from the other population  Use the difference between 2 sample means  Use Z test, a pooled-variance t test, or a separate-variance t test * σ 1 and σ 2 known σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

6 Difference Between Two Means Population means, independent samples σ 1 and σ 2 known * Use a Z test statistic Use S p to estimate unknown σ, use a t test statistic and pooled standard deviation σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal Use S 1 and S 2 to estimate unknown σ 1 and σ 2, use a separate-variance t test

7 Population means, independent samples σ 1 and σ 2 known σ 1 and σ 2 Known Assumptions:  Samples are randomly and independently drawn  Population distributions are normal or both sample sizes are  30  Population standard deviations are known * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

8 Population means, independent samples σ 1 and σ 2 known …and the standard error of X 1 – X 2 is When σ 1 and σ 2 are known and both populations are normal or both sample sizes are at least 30, the test statistic is a Z-value… (continued) σ 1 and σ 2 Known * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

9 Population means, independent samples σ 1 and σ 2 known The test statistic for μ 1 – μ 2 is: σ 1 and σ 2 Known * (continued) σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

10 Hypothesis Tests for Two Population Means Lower-tail test: H 0 : μ 1 ≥ μ 2 H a : μ 1 < μ 2 i.e., H 0 : μ 1 – μ 2 ≥ 0 H a : μ 1 – μ 2 < 0 Upper-tail test: H 0 : μ 1 ≤ μ 2 H a : μ 1 > μ 2 i.e., H 0 : μ 1 – μ 2 ≤ 0 H a : μ 1 – μ 2 > 0 Two-tail test: H 0 : μ 1 = μ 2 H a : μ 1 ≠ μ 2 i.e., H 0 : μ 1 – μ 2 = 0 H a : μ 1 – μ 2 ≠ 0 Two Population Means, Independent Samples

11 Lower-tail test: H 0 : μ 1 – μ 2 ≥ 0 H a : μ 1 – μ 2 < 0 Upper-tail test: H 0 : μ 1 – μ 2 ≤ 0 H a : μ 1 – μ 2 > 0 Two-tail test: H 0 : μ 1 – μ 2 = 0 H a : μ 1 – μ 2 ≠ 0 aa /2 a -z a -z a/2 zaza z a/2 Reject H 0 if Z < -Z a Reject H 0 if Z > Z a Reject H 0 if Z < -Z a/2 or Z > Z a/2 Hypothesis tests for μ 1 – μ 2

12 Population means, independent samples σ 1 and σ 2 known The confidence interval for μ 1 – μ 2 is: Confidence Interval, σ 1 and σ 2 Known * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

13 Customer Waiting Time Case  A random sample of size 100 waiting times observed under the current system of serving customers has a sample waiting time mean of 8.79  Call this population 1  Assume population 1 is normal or sample size is large  The variance is 4.7  A random sample of size 100 waiting times observed under the new system of serving customers has a sample mean waiting time of 5.14  Call this population 2  Assume population 2 is normal or sample size is large  The variance is 1.9  Then if the samples are independent … Example 9.1 Example 9.1

14 Customer Waiting Time Case  At 95% confidence, z  /2 = z 0.025 = 1.96, and  According to the calculated interval, the bank manager can be 95% confident that the new system reduces the mean waiting time by between 3.15 and 4.15 minutes

15 Customer Waiting Time Case  Test the claim that the new system reduces the mean waiting time  Test at the  = 0.05 significance level the null H 0 :  1 –  2 ≤ 0 against the alternative H a :  1 –  2 > 0  Use the rejection rule H 0 if z > z   At the 5% significance level, z  = z 0.05 = 1.645  So reject H 0 if z > 1.645  Use the sample and population data in Example 9.1 to calculate the test statistic

16 Customer Waiting Time Case  Because z = 14.21 > z 0.05 = 1.645, reject H 0  Conclude that  1 –  2 is greater than 0 and therefore the new system does reduce the waiting time by 3.65 minutes  On average, reduces the mean time by 3.65 minutes  The p-value for this test is the area under the standard normal curve to the right of z = 14.21  This z value is off the table, so the p-value has to be much less than 0.001  So, we have extremely strong evidence that H 0 is false and that H a is true  Therefore, we have extremely strong evidence that the new system reduces the mean waiting time

17 Population means, independent samples σ 1 and σ 2 known σ 1 and σ 2 Unknown, Assumed Equal Assumptions:  Samples are randomly and independently drawn  Populations are normally distributed or both sample sizes are at least 30  Population variances are unknown but assumed equal * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

18 Population means, independent samples σ 1 and σ 2 known (continued) * Forming interval estimates:  The population variances are assumed equal, so use the two sample variances and pool them to estimate the common σ 2  the test statistic is a t value with (n 1 + n 2 – 2) degrees of freedom σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Assumed Equal

19 Population means, independent samples σ 1 and σ 2 known The pooled variance ( 合并 方差 ) is (continued) * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Assumed Equal

20 Population means, independent samples σ 1 and σ 2 known Where t has (n 1 + n 2 – 2) d.f., and The test statistic for μ 1 – μ 2 is: * (continued) σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Assumed Equal

21 Population means, independent samples σ 1 and σ 2 known The confidence interval for μ 1 – μ 2 is: Where * Confidence Interval, σ 1 and σ 2 Unknown σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

22 Catalyst Comparison Case  The difference in mean hourly yields of a chemical process  Given:  Assume that populations of all possible hourly yields for the two catalysts are both normal with the same variance  The pooled estimate of  2 is  Let  1 be the mean hourly yield of catalyst 1 and let  2 be the mean hourly yield of catalyst 2 Example 9.2 Example 9.2

23 continued  Want the 95% confidence interval for  1 –  2  df = (n 1 + n 2 – 2) = (5 + 5 – 2) = 8  At 95% confidence, t  /2 = t 0.025. For 8 degrees of freedom, t 0.025 = 2.306  The 95% confidence interval is  So we can be confident that the mean hourly yield from catalyst 1 is between 30.38 and 91.22 pounds higher than that of catalyst 2  On average, the mean yields will differ by 60.8 lbs

24 Pooled-Variance t Test You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks listed on the NYSE & NASDAQ? You collect the following data: NYSE NASDAQ Number 21 25 Sample mean 3.27 2.53 Sample std dev 1.30 1.16 Assuming both populations are approximately normal with equal variances, is there a difference in average yield (  = 0.05)? Example 9.3 Example 9.3

25 Calculating the Test Statistic The test statistic is:

26 Solution H 0 : μ 1 - μ 2 = 0 i.e. (μ 1 = μ 2 ) H a : μ 1 - μ 2 ≠ 0 i.e. (μ 1 ≠ μ 2 )  = 0.05 df = 21 + 25 - 2 = 44 Critical Values: t = ± 2.0154 Test Statistic: Decision: Conclusion: Reject H 0 at a = 0.05 There is evidence of a difference in means. t 0 2.0154-2.0154.025 Reject H 0.025 2.040

27 Population means, independent samples σ 1 and σ 2 known σ 1 and σ 2 Unknown, Not Assumed Equal Assumptions:  Samples are randomly and independently drawn  Populations are normally distributed or both sample sizes are at least 30  Population variances are unknown but cannot be assumed to be equal * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal

28 Population means, independent samples σ 1 and σ 2 known (continued) * Forming the test statistic:  The population variances are not assumed equal, so include the two sample variances in the computation of the t-test statistic  the test statistic is a t value with v degrees of freedom (see next slide) σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Not Assumed Equal

29 Population means, independent samples σ 1 and σ 2 known The number of degrees of freedom is the integer portion of: (continued) * σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Not Assumed Equal

30 Population means, independent samples σ 1 and σ 2 known The test statistic for μ 1 – μ 2 is: * (continued) σ 1 and σ 2 unknown, assumed equal σ 1 and σ 2 unknown, not assumed equal σ 1 and σ 2 Unknown, Not Assumed Equal

31 A recent EPA study compared the highway fuel economy of domestic and imported passenger cars. A sample of 15 domestic cars revealed a mean of 33.7 mpg with a sample standard deviation of 2.4 mpg. A sample of 12 imported cars revealed a mean of 35.7 mpg with a sample standard deviation of 3.9. Assuming both populations are approximately normal with equal variances, At the.05 significance level can the EPA conclude that the mpg for the domestic cars is lower than the imported cars? Exercise

32 Step 1 State the null and alternate hypotheses. H0: µD > µI H1: µD < µI Step 2 The.05 significance level is stated in the problem Step 3 Find the appropriate test statistic. we use the t distribution.

33 Step 4 The decision rule is to reject H0 if t<-1.708 or if p-value <.05. There are n-1 or 25 degrees of freedom. Step 5 We compute the pooled variance.

34 Since a computed t of –1.64 > critical t of –1.71, the p-value of.0567 >  of.05, H 0 is not rejected. There is insufficient sample evidence to claim a higher mpg on the imported cars.

35 Dependent samples are samples that are paired or related in some fashion. If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices. If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program. Town and Country Cadillac Downtown Cadillac Related Populations ( 相关总体 )

36 Example (Related Population):  An analyst for Educational Testing Service wants to Compare the mean GMAT scores of students before & after taking a GMAT review course.  Nike wants to see if there is a difference in durability of 2 sole materials. One type is placed on one shoe, the other type on the other shoe of the same pair..

37 Test of Related Populations Tests Means of 2 Related Populations  Paired or matched samples  Repeated measures (before/after)  Use difference between paired values:  Eliminates Variation Among Subjects  Assumptions:  Both Populations Are Normally Distributed  Or, if not Normal, use large samples Related samples d i = X 1i - X 2i

38 Mean Difference The i th paired difference is d i, where Related samples d i = X 1i - X 2i The point estimate for the population mean paired difference is d : n is the number of pairs in the paired sample

39 We the unknown population standard deviation with a sample standard deviation: Related samples The sample standard deviation is Mean Difference, Estimate of σ d

40  Use a paired t test, the test statistic for d is a t statistic, with n-1 d.f.: Paired samples Where t has n - 1 d.f. and S d is: Mean Difference (continued)

41 The confidence interval for μ d is Paired samples where Confidence Interval

42 Lower-tail test: H 0 : μ d  0 H a : μ d < 0 Upper-tail test: H 0 : μ d ≤ 0 H a : μ d > 0 Two-tail test: H 0 : μ d = 0 H a : μ d ≠ 0 Paired Samples Hypothesis Testing for Mean Difference, σ d Unknown  /2  -t  -t  /2 tt t  /2 Reject H 0 if t < -t  Reject H 0 if t > t  Reject H 0 if t < -t   or t > t  Where t has n - 1 d.f.

43 Repair Cost Comparison Table: A sample of n=7 paired differences of the repair cost estimates at garages 1 and 2 (Cost estimate in hundreds of dollars) Damaged cars Repair cost estimates at garage 1 Repair cost estimate at garage 2 Paired differences Car 1$7.1$7.9d 1 =-0.8 Car 29.010.1d 2 =-1.1 Car 311.012.2d 3 =-1.2 Car 48.98.8d 4 =0.1 Car 59.910.4d 5 =-0.5 Car 69.19.8d 6 =-0.7 Car 710.311.7d 7 =-1.4 Example 9.4 Example 9.4

44 Repair Cost Comparison  Sample of n = 7 damaged cars  Each damaged car is taken to Garage 1 for its estimated repair cost, and then is taken to Garage 2 for its estimated repair cost  Estimated repair costs at Garage 1: 1 = 9.329  Estimated repair costs at Garage 2: 2 = 10.129  Sample of n = 7 paired differences and

45 continued  At 95% confidence, want t  /2 with n – 1 = 6 degrees of freedom t  /2 = 2.447  The 95% confidence interval is  Can be 95% confident that the mean of all possible paired differences of repair cost estimates at the two garages is between - $126.54 and -$33.46  Can be 95% confident that the mean of all possible repair cost estimates at Garage 1 is between $126.54 and $33.46 less than the mean of all possible repair cost estimates at Garage 2

46 Repair Cost Comparison  Now, test if repair cost at Garage 1 is less expensive than at Garage 2, that is, test if  d =  1 –  2 is less than zero  H 0 :  d ≥ 0 H a :  d < 0  Test at the  = 0.01 significance level.  Reject if t < –t , that is, if t < –t 0.01  With n – 1 = 6 degrees of freedom, t 0.01 = 3.143  So reject H 0 if t < –3.143

47 continued  Calculate the t statistic  Because t = –4.2053 is less than –t 0.01 = – 3.143, reject H 0  Conclude at the a = 0.01 significance level that the mean repair cost at Garage 1 is less than the mean repair cost of Garage 2  From a computer, for t = -4.2053, the p-value is 0.003  Because this p-value is very small, there is very strong evidence that H 0 should be rejected and that  1 is actually less than  2

48  Assume you send your salespeople to a “customer service” training workshop. Has the training made a difference in the number of complaints? You collect the following data: Paired t Test ( 成对 t 检验 ) Number of Complaints: (2) - (1) Salesperson Before (1) After (2) Difference, d i C.B. 6 4 - 2 T.F. 20 6 -14 M.H. 3 2 - 1 R.K. 0 0 0 M.O. 4 0 - 4 -21 d =  didi n = -4.2 Example 9.5 Example 9.5

49  Has the training made a difference in the number of complaints (at the 0.01 level)? - 4.2d = H 0 : μ d = 0 H a :  μ d  0 Test Statistic: Critical Value = ± 4.604 d.f. = n - 1 = 4 Reject  /2 - 4.604 4.604 Decision: Do not reject H 0 (t stat is not in the reject region) Conclusion: There is not a significant change in the number of complaints. Paired t Test: Solution Reject  /2 - 1.66  =.01

50 Two Population Proportions Goal: test a hypothesis or form a confidence interval for the difference between two population proportions, p 1 – p 2 The point estimate for the difference is Population proportions Assumptions: n 1 p 1  5, n 1 (1- p 1 )  5 n 2 p 2  5, n 2 (1- p 2 )  5

51 Two Population Proportions  Then the population of all possible values of  Has approximately a normal distribution if each of the sample sizes n 1 and n 2 is large  Here, n 1 and n 2 are large enough is n 1 p 1 ≥ 5, n 1 (1 - p 1 ) ≥ 5, n 2 p 2 ≥ 5, and n 1 (1 – p 2 ) ≥ 5  Has mean  Has standard deviation

52 Confidence Interval for Two Population Proportions Population proportions If the random samples are independent of each other, then the following a 100(1 – α) percent confidence interval for p 1 – p 2 is:

53 Population proportions The test statistic for p 1 – p 2 is a Z statistic: (continued) Testing the Difference of Two Population Proportions

54  If p 1 -p 2 = 0, estimate by  If p 1 -p 2 ≠ 0, estimate by (continued)

55 Hypothesis Tests for Two Population Proportions Population proportions Lower-tail test: H 0 : p 1  p 2 H a : p 1 < p 2 i.e., H 0 : p 1 – p 2  0 H a : p 1 – p 2 < 0 Upper-tail test: H 0 : p 1 ≤ p 2 H a : p 1 > p 2 i.e., H 0 : p 1 – p 2 ≤ 0 H a : p 1 – p 2 > 0 Two-tail test: H 0 : p 1 = p 2 H a : p 1 ≠ p 2 i.e., H 0 : p 1 – p 2 = 0 H a : p 1 – p 2 ≠ 0

56 Hypothesis Tests for Two Population Proportions Population proportions Lower-tail test: H 0 : p 1 – p 2  0 H a : p 1 – p 2 < 0 Upper-tail test: H 0 : p 1 – p 2 ≤ 0 H a : p 1 – p 2 > 0 Two-tail test: H 0 : p 1 – p 2 = 0 H a : p 1 – p 2 ≠ 0  /2  -z  -z  /2 zz z  /2 Reject H 0 if Z < -Z  Reject H 0 if Z > Z  Reject H 0 if Z < -Z   or Z > Z  (continued)

57 Two population Proportions Is there a significant difference between the proportion of men and the proportion of women who will vote Yes on Proposition A?  In a random sample, 36 of 72 men and 31 of 50 women indicated they would vote Yes  Test at the.05 level of significance Example 9.6 Example 9.6

58  The hypothesis test is: H 0 : p 1 – p 2 = 0 (the two proportions are equal) H a : p 1 – p 2 ≠ 0 (there is a significant difference between proportions)  The sample proportions are:  Men: = 36/72 =.50  Women: = 31/50 =.62  The pooled estimate for the overall proportion is: Two population Proportions (continued)

59 The test statistic for p 1 – p 2 is: Example: Two population Proportions (continued).025 -1.961.96.025 -1.31 Decision: Do not reject H 0 Conclusion: There is not significant evidence of a difference in proportions who will vote yes between men and women. Reject H 0 Critical Values = ±1.96 For  =.05

60 Chapter Summary  Compared two independent samples  Performed Z test for the difference in two means  Performed pooled variance t test for the difference in two means  Performed separate-variance t test for difference in two means  Formed confidence intervals for the difference between two means  Compared two related samples (paired samples)  Performed paired sample Z and t tests for the mean difference  Formed confidence intervals for the mean difference

61 Chapter Summary  Compared two population proportions  Formed confidence intervals for the difference between two population proportions  Performed Z-test for two population proportions (continued)

62 Chapter 11 Simple Linear Regression Analysis ( 线性回归分析 ) Business Statistics in Practice

63 Table 11.1 lists the percentage of the labour force that was unemployed during the decade 1991-2000. Plot a graph with the time (years after 1991) on the x axis and percentage of unemployment on the y axis. Do the points follow a clear pattern? Based on these data, what would you expect the percentage of unemployment to be in the year 2005? Number of Years Percentage of Year from 1991 Unemployed 1991 0 6.8 1992 1 7.5 1993 2 6.9 1994 3 6.1 1995 4 5.6 1996 5 5.4 1997 6 4.9 1998 7 4.5 1999 8 4.2 2000 9 4.0 Table 11.1 Percentage of Civilian Unemployment

64 The pattern does suggest that we may be able to get useful information by finding a line that “best fits” the data in some meaningful way. It produces the “best-fitting line”. Based on this formula, we can attempt a prediction of the unemployment rate in the year 2005: Note: Care must be taken when making predictions by extrapolating from known data, especially when the data set is as small as the one in this example.

65 Learning Objectives In this chapter, you learn:  How to use regression analysis to predict the value of a dependent variable based on an independent variable  The meaning of the regression coefficients b 0 and b 1  To make inferences about the slope and correlation coefficient  To estimate mean values and predict individual values

66 Correlation( 相关 ) vs. Regression( 回归 )  A scatter diagram ( 散点图 ) can be used to show the relationship between two variables  Correlation ( 相关 ) analysis is used to measure strength of the association (linear relationship) between two variables  Correlation is only concerned with strength of the relationship  No causal effect ( 因果效应 ) is implied with correlation

67  Scatter Diagrams are used to examine possible relationships between two numerical variables  The Scatter Diagram:  one variable is measured on the vertical axis and the other variable is measured on the horizontal axis Scatter Diagrams

68 Scatter Plots( 散点图 ) Restaurant Ratings: Mean Preference vs. Mean Taste Visualize the data to see patterns, especially “trends”

69 Introduction to Regression Analysis  Regression analysis is used to:  Predict the value of a dependent variable based on the value of at least one independent variable  Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to explain the dependent variable

70 Simple Linear Regression Model  Only one independent variable, X  Relationship between X and Y is described by a linear function  Changes in Y are assumed to be caused by changes in X

71 Types of Relationships Y X Y X Y Y X X Linear relationshipsCurvilinear relationships

72 Types of Relationships Y X Y X Y Y X X Strong relationshipsWeak relationships (continued)

73 Types of Relationships Y X Y X No relationship (continued)

74 Linear component Simple Linear Regression Model Population Y intercept Population Slope Coefficient Random Error term Dependent Variable Independent Variable Random Error component

75 (continued) Random Error for this X i value Y X Observed Value of Y for X i Predicted Value of Y for X i XiXi Slope = β 1 Intercept = β 0 εiεi Simple Linear Regression Model

76 The simple linear regression equation provides an estimate of the population regression line Simple Linear Regression Equation (Prediction Line) Estimate of the regression intercept Estimate of the regression slope Estimated (or predicted) Y value for observation i Value of X for observation i The individual random error terms e i have a mean of zero

77 Least Squares Method ( 最小二乘方法 )  b 0 and b 1 are obtained by finding the values of b 0 and b 1 that minimize the sum of the squared differences between Y and :

78  b 0 is the estimated average value of Y when the value of X is zero  b 1 is the estimated change in the average value of Y as a result of a one-unit change in X Interpretation of the Slope( 斜率 ) and the Intercept( 截距 )

79 Example 11.1 Example 11.1 The House Price Case  A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)  A random sample of 10 houses is selected  Dependent variable (Y) = house price in $1000s  Independent variable (X) = square feet

80 Graphical Presentation  House price model: scatter plot

81 Graphical Presentation  House price model: scatter plot and regression line Slope = 0.10977 Intercept = 98.248

82 Interpretation of the Intercept, b 0  b 0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values)  Here, no houses had 0 square feet, so b 0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet

83 Interpretation of the Slope Coefficient, b 1  b 1 measures the estimated change in the average value of Y as a result of a one- unit change in X  Here, b 1 =.10977 tells us that the average value of a house increases by.10977($1000) = $109.77, on average, for each additional one square foot of size

84 Predict the price for a house with 2000 square feet: The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850 Predictions using Regression Analysis

85 Interpolation vs. Extrapolation  When using a regression model for prediction, only predict within the relevant range of data Relevant range for interpolation Do not try to extrapolate beyond the range of observed X’s

86 Estimation/prediction equation Least squares point estimate of the slope  1 The Least Square Point Estimates Least squares point estimate of the y-intercept  0

87 Model Assumptions 1.Mean of Zero At any given value of x, the population of potential error term values has a mean equal to zero 2.Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x 3.Normality Assumption At any given value of x, the population of potential error term values has a normal distribution 4.Independence Assumption Any one value of the error term  is statistically independent of any other value of 

88 Measures of Variation  Total variation is made up of two parts: Total Sum of Squares Regression Sum of Squares Error Sum of Squares where: = Average value of the dependent variable Y i = Observed values of the dependent variable i = Predicted value of Y for the given X i value

89  SST = total sum of squares  Measures the variation of the Y i values around their mean Y  SSR = regression sum of squares  Explained variation attributable to the relationship between X and Y  SSE = error sum of squares  Variation attributable to factors other than the relationship between X and Y (continued) Measures of Variation

90 (continued) XiXi Y X YiYi SST =  (Y i - Y) 2 SSE =  (Y i - Y i ) 2  SSR =  (Y i - Y) 2  _ _ _ Y  Y Y _ Y  Measures of Variation

91  The coefficient of determination ( 决定系数 ) is the portion of the total variation in the dependent variable that is explained by variation in the independent variable  The coefficient of determination is also called r-squared and is denoted as r 2 Coefficient of Determination, r 2 note:

92 r 2 = 1 Examples of Approximate r 2 Values Y X Y X r 2 = 1 Perfect linear relationship between X and Y: 100% of the variation in Y is explained by variation in X

93 Examples of Approximate r 2 Values Y X Y X 0 < r 2 < 1 Weaker linear relationships between X and Y: Some but not all of the variation in Y is explained by variation in X

94 Examples of Approximate r 2 Values r 2 = 0 No linear relationship between X and Y: The value of Y does not depend on X. (None of the variation in Y is explained by variation in X) Y X r 2 = 0

95 The Simple Correlation Coefficient ( 简单相关系数 ) The simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r Where, b 1 is the slope of the least squares line r can also be calculated using the formula

96 Different Values of the Correlation Coefficient

97  t test for a population slope  Is there a linear relationship between X and Y?  Null and alternative hypotheses H 0 : β 1 = 0(no linear relationship) H a : β 1  0(linear relationship does exist)  Test statistic Inference about the Slope: t Test where: b 1 = regression slope coefficient β 1 = hypothesized slope s b = standard error of the slope 1

98 Example 11.2 Example 11.2 The House Price Case House Price in $1000s (y) Square Feet (x) 2451400 3121600 2791700 3081875 1991100 2191550 4052350 3242450 3191425 2551700 Simple Linear Regression Equation: The slope of this model is 0.1098 Does square footage of the house affect its sales price?

99 H 0 : β 1 = 0 H a : β 1  0 The House Price Case # 2 Reject H 0  /2=.025 -t α/2 Do not reject H 0 0 t α/2  /2=.025 -2.30602.3060 3.329 n=10 d.f. = 10-2 = 8 There is sufficient evidence that square footage affects house price

100 Confidence Interval Estimate for the Slope Confidence Interval Estimate of the Slope: At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858) d.f. = n - 2 Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the.05 level of significance

101 Chapter Summary  Introduced types of regression models  Reviewed assumptions of regression and correlation  Discussed determining the simple linear regression equation  Described measures of variation  Described inference about the slope  Discussed correlation -- measuring the strength of the association  Addressed estimation of mean values and prediction of individual values


Download ppt "Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice."

Similar presentations


Ads by Google