Download presentation
Presentation is loading. Please wait.
1
Chapter 14 – Correlation and Simple Regression
Introduction to Business Statistics, 6e Kvanli, Pavur, Keeling Chapter 14 – Correlation and Simple Regression Slides prepared by Jeff Heyl, Lincoln University ©2003 South-Western/Thomson Learning™
2
Bivariate Data Figure 14.1 35 – 30 – 25 – 20 – 15 – 10 – 5 –
Square footage (hundreds) | 20 30 40 50 60 70 80 Y X Income (thousands) A B Figure 14.1
3
Coefficient of Correlation
The sample coefficient of correlation, r, measures the strength of the linear relationship that exists within a sample of n bivariate data r = ∑(x - x)(y - y) ∑(x - x)2 ∑(y - y)2 = ∑xy - (∑x)(∑y) / n ∑x2 - (∑x)2 / n ∑y2 - (∑y)2 / n
4
Coefficient of Correlation Properties
r ranges from -1.0 to 1.0 The larger |r | is, the stronger the linear relationship r near zero indicates that there is no linear relationship. X and Y are uncorrelated r = 1 or -1 implies that a perfect linear pattern exists between the two variables Values of r = 0, 1, or -1 are rare in practice
5
Coefficient of Correlation Properties
The sign of r tells you whether the relationship between X and Y is a positive (direct) or a negative (inverse) relationship The value of r tells you very little about the slope of the line. Except if the sign of r is positive the slope of the line is positive and if r is negative then the slope is negative
6
Various Values of r y x r = 0 A r = 1 B Figure 14.2
7
Various Values of r y x r = -1 C r = .9 D Figure 14.2
8
Various Values of r y x r = -.8 E r = .5 F Figure 14.2
9
Scatter Diagrams - Same r
y x 40 – 30 – 20 – 10 – | 2 4 16 20 6 8 10 12 14 18 Figure 14.3
10
Scatter Diagram and Correlation Coefficient
Figure 14.4
11
Covariance The sample covariance between two variables, cov(X,Y) is a measure of the joint variation of the two variables X and Y and is defined to be cov(X, Y) = ∑(x - x)(y - y) = SCPXY 1 n - 1 r = sample correlation between X and Y = cov(X, Y) sXsY
12
Least Squares Line The least squares line is the line through the data that minimizes the sum of the differences between the observations and the line ∑d2 = d12 + d22 + d32 + … + dn2 b1 = b0 = y - b1x SCPXY SSX
13
Vertical Distances Y d10 Line L d9 d7 Square footage d5 d8 d3 d6 d1 d4
| 20 30 40 50 60 70 80 X Y Square footage Income d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Line L Figure 14.5
14
Least Squares Line Y Y = b0 + b1X Y for X = 50 Square footage d2 d1 50
X (income) 50 Income d1 d2 Y = b0 + b1X ^ Y for X = 50 Square footage Figure 14.6
15
Sum of Squares of Error SSE = ∑d2 = ∑(y - y)2 (SCPXY)2 SSX SSE = SSY -
^ SSE = SSY - (SCPXY)2 SSX
16
Least Squares Line for Real Estate Data
Y X 50 Income Y = X ^ Y = 20 Y = 22.67 Square footage Figure 14.7
17
Assumptions for the Simple Regression Model
The mean of each error component is zero Each error component (random variable) follows an approximate normal distribution The variance of the error component is the same for each value of X The errors are independent of each other
18
Assumption 1 for the Simple Regression Model
Y X Y = 0 + 1X Y = 0 + 1X + e µy150 µy135 e 35 50 Income Square footage 0 Figure 14.8
19
Violation of Assumption 3
Y X Y = 0 + 1X e 35 50 Income Square footage 60 Figure 14.9
20
Assumptions 1, 2, 3 for the Simple Regression Model
Y X e 35 50 Income Square footage 60 e Figure 14.10
21
Estimating the Error Variance, e2
s2 = e2 = estimate of e2 = SSE n - 2 ^ where (SCPXY)2 SSX SSE = ∑(y - y)2 = SSY - ^
22
Three Possible Populations
1 < 0 1 = 0 1 > 0 A B C X Y Figure 14.11
23
Hypothesis Test on the Slope of the Regression Line
Ho: 1 = 0 (X provides no information) Ha: 1 ≠ 0 (X does provide information) Two-Tailed Test Test Statistic: t = b1 sb 1 reject Ho if |t| > t/2,n-2
24
Hypothesis Test on the Slope of the Regression Line
Ho: 1 ≤ 0 Ha: 1 > 0 One-Tailed Test Ho: 1 ≥ 0 Ha: 1 < 0 Test Statistic: t = b1 sb 1 reject Ho if t > t/2,n-2 reject Ho if t < -t/2,n-2
25
t Curve with 8 df 1.860 Rejection region t Figure 14.12
26
Real Estate Example Figure 14.13
27
Real Estate Example Figure 14.14
28
Real Estate Example Figure 14.15
29
Scatter Diagram Figure 14.16 30 – 20 – 10 – | 12 24 36 48 60 Age
(% of annual income) Liquid assets Y X Y = X ^ Figure 14.16
30
Confidence Interval for 1
The (1 - ) • 100% confidence interval for 1 is b1 - t/2,n-2sb to b1 + t/2,n-2sb 1
31
Curvilinear Relationship
Y X Figure 14.17
32
Measuring the Strength of the Model
SCPXY SSX SSY Ho: p = 0 (no linear relationship exists between X and Y) Ha: p ≠ 0 (a linear relationship does exist) r 1 - r2 n - 2 t =
33
Danger of Assuming Causality
A high statistical correlation does not imply causality There are many situations when variables are highly correlated because a factor not being studied affects the variables being studied
34
Coefficient of Determination
SSE = SSY - (SCPXY)2 SSX r2 = SSXSSY r2 = coefficient of determination = 1 - = percentage of explained variation in the dependent variable using the simple linear regression model SSE SSY
35
Total Variation, SSY Y Sample point Y = b0 + b1X (x, y) y - y y X
^ Figure 14.18
36
Total Variation, SSY SSY = SSR + SSE (SCPXY)2 SSX SSR = Y Sample point
y - y y Y = b0 + b1X Sample point ^ SSY = SSR + SSE SSR = (SCPXY)2 SSX Figure 14.18
37
Estimation and Prediction Using the Simple Linear Model
The least squares line can be used to estimate average values or predict individual values
38
Confidence Interval for µY|x
(1- ) 100% Confidence Interval for Y|x Y - t/2,n-2s ^ (x0 - x)2 SSX 1 n to Y + t/2,n-2s sY = s (x0 - x)2 SSX 1 n ^
39
Confidence Prediction Intervals
Figure 14.19
40
95% Confidence Intervals
35 – 30 – 25 – 20 – 15 – 10 – 5 – | 20 30 40 50 60 70 x = 49.8 X 20.27 12.33 Upper confidence limits Lower confidence limits Y = X ^ Figure 14.20
41
Prediction Interval for YX
Y - t/2,n-2s ^ (x0 - x)2 SSX 1 n to Y + t/2,n-2s sY2 = s (x0 - x)2 SSX 1 n ^
42
95% Confidence Intervals
35 – 30 – 25 – 20 – 15 – 10 – 5 – | 20 30 40 50 60 70 x = 49.8 X 24.43 20.27 Prediction interval limits Confidence interval limits 12.33 8.17 Figure 14.21
43
Checking Model Assumptions
The errors are normally distributed with a mean of zero The variance of the errors remains constant. For example, you should not observe larger errors associated with larger values of X. The errors are independent
44
Examination of Residuals
Y - Y ^ B Figure 14.22
45
Examination of Residuals
Time Y - Y ^ 1994 – 1995 – 1993 – 1997 – 1999 – 1992 – 1996 – 1998 – 2000 – 2001 – Figure 14.23
46
Checking for Outliers Figure 14.24
47
Identifying Outlying Values
Outlying sample values can be found by calculating the sample leverage hi = (xi - x)2 SSX 1 n SSX = ∑x2 - (∑x)2/n A sample is considered an outlier if its leverage is greater than 4/n or 6/n
48
Real Estate Example Figure 14.25(a)
49
Real Estate Example Figure 14.25(b)
50
Identifying Outlying Values
Unusually large or small values of the dependent variable (Y) can generally be detected using the sample standardized residuals Estimated standard deviation of the ith residual s hi Standardized residual = Yi - Yi s hi ^ An observation is thought to have and outlying value of Y if its standardized residual > 2 or < -2
51
Identifying Influential Observations
Cook’s distance measure Di = (standardized residual)2 1 2 hi (1 - hi)2 You may conclude the ith observation is influential if the corresponding Di measure > .8
52
Leverages, Standardized Residuals, and Cook’s Distance Measures
Figure 14.26
53
Engine Capacity and MPG
Figure 14.27
54
Engine Capacity and MPG
Figure 14.28
55
Engine Capacity and MPG
Figure 14.29
56
Engine Capacity and MPG
Figure 14.30
57
Engine Capacity and MPG
16 – 14 – 12 – 10 – 8 – 6 – 4 – 2 – 0 – Frequency Histogram -8 and under -6.5 -6.5 and under -5 -5 and under -3.5 -3.5 and under -2 -2 and under -0.5 -0.5 and under 1 1 and under 2.5 2.5 and under 4 4 and under 5.5 Class Limits Figure 14.31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.