Download presentation
Presentation is loading. Please wait.
Published bySolomon James Modified over 8 years ago
1
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13
2
2 GOALS Understand and interpret the terms dependent and independent variable. Calculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate. Conduct a test of hypothesis to determine whether the coefficient of correlation in the population is zero. Calculate the least squares regression line. Construct and interpret confidence and prediction intervals for the dependent variable.
3
3 Regression Analysis - Introduction Recall the idea of showing the relationship between two variables with a scatter diagram in Ch 4. There we showed that, as the age of the buyer increased, the amount spent for the vehicle also increased. In this chapter we carry this idea further. Numerical measures to express the strength of relationship between two variables are developed. In addition, an equation is used to express the relationship. between variables, allowing us to estimate one variable on the basis of another.
4
4 Regression Analysis - Uses Some examples. Is there a relationship between the advertisment spending per month and its sales in the month? Can we estimate the heating cost of a home in January based on the size of the home? Is there a relationship between the miles per gallon achieved by cars and the size of the engine? Is there a relationship between the number of hours that students studied for an exam and the score earned?
5
5 Correlation Analysis Correlation Analysis is the study of the relationship between variables. – A group of techniques measure the association between two variables. – A Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlations analysis <cf: another analysis (regression analysis) specifies the dependent variable and independent variables. – The dependent variable is the variable being predicted or estimated. – The independent variable(s) provides the basis for estimation (explanatory or predictor variable(s)).
6
6 Correlation & Regression Analysis: Example The sales manager of Copier Sales wants to know whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 sales representatives, finding the number of sales calls each representative made last month and the number of copiers sold by that person.
7
7 Scatter Diagram
8
8 The Coefficient of Correlation, r Related with the ‘covariance’ concept. The Coefficient of Correlation (r) is a measure of the direction and strength of the linear relationship between two interval or ratio-scaled variables. It can range from -1.00 to 1.00. Values of -1.00 or 1.00 indicate perfect and strong correlation. Values close to 0.0 indicate weak correlation. Negative values indicate an inverse relationship and positive values indicate a direct relationship. It does not give any information on the direction of the relationship (causality) between the variables. – There can be spurious correlations.
9
9 Perfect Correlation
10
10 Correlation Coefficient - Interpretation
11
11 Correlation Coefficient - Formula
12
12 Using the Copier Sales data which a scatter plot was developed earlier, compute the correlation coefficient and coefficient of determination. Correlation Coefficient - Example
13
13 Correlation Coefficient - Example
14
14 How do we interpret the value of correlation coefficient, 0.759? First, it is (+), so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls cause more sales? No, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related. Correlation Coefficient - Example
15
15 Correlation Coefficient – Excel Example
16
16 Coefficient of Determination The coefficient of determination (r 2 ) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). – It is the square of the coefficient of correlation (r 2 ), whose valee ranging from 0 to 1. – Further discussion on this concept – later in this chapter. Example (previous case) The coefficient of determination, r 2,is 0.576, found by (0.759) 2 We can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.
17
17 Testing the Significance of the Correlation Coefficient H 0 : = 0 (the correlation in the population is 0) H 1 : ≠ 0 (the correlation in the population is not 0) Reject H 0 if: t > t /2,n-2 or t < -t /2,n-2
18
18 Testing the Significance of the Correlation Coefficient - Example H 0 : = 0 (the correlation in the population is 0) H 1 : ≠ 0 (the correlation in the population is not 0) Reject H 0 if: t > t /2,n-2 or t < -t /2,n-2 t > t 0.025,8 or t < -t 0.025,8 t > 2.306 or t < -2.306
19
19 Testing the Significance of the Correlation Coefficient - Example The computed t (3.297) is within the rejection region, therefore, we reject H 0. This means the correlation in the population is not zero. It indicates that there is correlation between the number of sales calls and the number of copiers sold in the population of salespeople.
20
20 Regression Analysis In regression analysis we use the independent variable (X) to estimate the dependent variable (Y). The relationship between the variables is (assumed to be) linear. Both variables must be at least interval scale. The least squares criterion is used to determine the equation.
21
21 Illustration of the Least Squares Regression Principle
22
22 Linear Regression Model
23
23 Regression Analysis – Least Squares Principle The least squares principle is used to obtain a and b. – The regression line is the best-fitting line, because the sum of the squares of the vertical deviations from the line is minimized.
24
24 Computing the Slope of the Line
25
25 Computing the Y-Intercept
26
26 Regression Equation - Example Recall the example of Copier Sales. The sales manager gathered information on the number of sales calls made and the number of copiers sold for a random sample of 10 sales representatives. Use the least squares method to determine a linear equation to express the relationship between the two variables. What is the expected number of copiers sold by a representative who made 20 calls?
27
27 Finding the Regression Equation - Example
28
28 Computing the Estimates of Y Using the regression equation, substitute the value of X to get the estimated sales corresponding to each value of X.
29
29 Interpreting b and a (example) The b value of 1.1842 implies that for each additional sales call made, the number of copiers sold is expected to increase by 1.1842. The a value of 18.9476 is the point where the equation cross the Y-axis. – If 0 sales call made, then 18.9476 copiers would be sold (if the regression line is extended to X value of 0).
30
30 Plotting the Estimated and the Actual Y’s
31
31 Regression line and residuals The regression line always pass through the sample mean point, (X bar, Y bar ). – Y bar = a + b X bar. The difference between the actual value (Y) and the estimated value (Y hat ) is called residual (the error). – Residual (e) = Y – Y hat. – The least square method is the one that minimizes the sum of squares of the residuals. No other straight line has smaller value of sum of squared residuals than least square regression line.
32
32 The Standard Error of Estimate The standard error of estimate measures the dispersion, or scatter, of the observed values around the line of regression. – Measures how precise the estimation of Y is based on X. – Small value of s y.x means that estimated regression equation provides a precise estimate of Y. The formulas that are used to compute the standard error:
33
33 Standard Error of the Estimate - Example Recall the example involving Copier Sales. The sales manager determined the least squares regression equation as given below. Determine the standard error of estimate as a measure of how well the values fit the regression line.
34
34 Graphical Illustration of the Differences between Actual Y – Estimated Y
35
35 Standard Error of the Estimate - Excel
36
36 Regression as a descriptive tool vs as a tool of statistical inference. Suppose our data are a sample taken from a population. Model the linear relationship in a population by the following equation – Y = α + β X We estimate the linear relationship in the population using regression method. – a and b are estimate of population parameter α and β, respectively, – derived from a sample (taken from population).
37
37 Assumptions Underlying Linear Regression For each value of X, there is a group of Y values, and these Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. The standard deviations of these normal distributions are equal. The Y values are statistically independent. This means that Y value chosen for a particular X value do not depend on the Y values for any other X values.
38
38 Assumptions Underlying Linear Regression Recall the following characteristics of normal distribution. – Y hat + - s y.x will include the middle 68% of the observations. – Y hat + - 2s y.x will include the middle 95% of the observations. – Check with the example (Copiers sales). Compare the size of residual (error) & the standard error in the Table 13-4.
39
39 Confidence Interval and Prediction Interval Estimates of Y The standard error of estimate is used to establish interval estimates of Y, corresponding to any X. A confidence interval reports the mean value of Y for a given X. A prediction interval reports the range of values of Y for a particular value of X. The formula for the confidence interval and the prediction interval is given as follows.
40
40 Confidence Interval and Prediction Interval Estimates of Y Y hat is the estimated value for any selected value X. X is any selected value of X. t is the value of t distribution, with n-2 degrees of freedom corresponding to confidence level. s y,x is the standard error of estimate
41
41 Confidence Interval Estimate - Example We return to the Copier Sales of America example. Determine a 95 percent confidence interval for all sales representatives who make 25 calls.
42
42 Step 1 – Compute the point estimate of Y In other words, determine the number of copiers we expect a sales representative to sell if he or she makes 25 calls. Confidence Interval Estimate - Example
43
43 Step 2 – Find the value of t To find the t value, we need to first know the degrees of freedom. In this case the degrees of freedom is n - 2 = 10 – 2 = 8. We set the confidence level at 95 percent. To find the value of t, use the table for t distribution in Appendix B.2, for 8 degrees of freedom and the 95 percent level of confidence. The value of t is 2.306. Confidence Interval Estimate - Example
44
44 Confidence Interval Estimate - Example
45
45 Confidence Interval Estimate - Example Step 4 – Use the formula above by substituting the numbers computed in previous slides Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 25 calls is from 40.9170 up to 56.1882 copiers.
46
46 Prediction Interval Estimate - Example We return to the Copier Sales of America example. Determine a 95 percent prediction interval for Sheila Baker, a particular sales representative who made 25 calls.
47
47 Following the same steps and using the information computed earlier in the confidence interval estimation case, apply to the formula above. Prediction Interval Estimate - Example If Sheila Baker makes 25 sales calls, the number of copiers she will sell will be between about 24 and 73 copiers.
48
48 More on the Coefficient of Determination The sum of the squared deviation Σ(Y–Y hat ) 2 is interpreted as the “unexplained” variation in Y. – Use the previous “Copiers Sales” example.
49
49 More on the Coefficient of Determination Suppose only the Y values are known, while we want to predict the sales of every representative. – To make prediction, we could assign the sample mean (Y bar = 45) to each sales representative. – The squared sum of deviations from the sample mean Σ (Y–Y bar ) 2 is 1850 in the example. Smaller than the squared sum of deviations from any other value, such as the median. – This is interpreted as the total variation in Y. – Explained variation= Total variation – Unexplained variation. – R 2 =(explained variation)/(total variation)=0.576.
50
50 The Relationships among r, r 2, and s y.x Standard error of estimate (s y.x ) measures how close the actual data to the regression line. – Calculated based on Σ(Y–Y hat ) 2 Correlation coefficient (r) & coefficient of determination (r 2 ) measures the strength of linear association between X and Y. – Also closely related with Σ(Y–Y hat ) 2 ANOVA table can show the relationship among the three measures. – Total variation Σ (Y–Y bar ) 2 is divided into two components, (1) that explained by the regression (i.e. by the independent variable) and (2) the error (unexplained) variation.
51
51 The Relationships among r, r 2, and s yx SSR (Regression Sum of Squares)= SSR = Σ(Y hat –Y bar ) 2 – Degree of freedom 1 (# of explanatory variable) SSE (Error Sum of Squares)= SSE= Σ(Y–Y hat ) 2 – Degree of freedom n-2 SST (Total Sum of Squares)= SST= Σ(Y–Y bar ) 2 = SSR + SSE – Degree of freedom n-1 Coefficient of determination (r 2 )= SSR/SST =1 - SSE/SST.
52
52 Transforming Data The coefficient of correlation describes the strength of the linear relationship between two variables. – It could be that two variables are closely related, but there relationship is not linear. – Be cautious when you are interpreting the coefficient of correlation. – A value of r may indicate there is no linear relationship, but there can exist a relationship of some other, such as nonlinear or curvilinear form.
53
53 Transforming Data - Example On the right is a listing of 22 professional golfers, the number of events in which they participated, the amount of their winnings, and their mean score for the 2004 season. In golf, the objective is to play 18 holes in the least number of strokes. So, we would expect that those golfers with the lower mean scores would have the larger winnings. To put it another way, score and winnings should be inversely related. In 2004 Tiger Woods played in 19 events, earned $5,365,472, and had a mean score per round of 69.04. Fred Couples played in 16 events, earned $1,396,109, and had a mean score per round of 70.92. The data for the 22 golfers follows.
54
54 Scatterplot of Golf Data The correlation between the variables Winnings and Score is 0.782. This is a fairly strong inverse relationship. However, when we plot the data on a scatter diagram the relationship does not appear to be linear; it does not seem to follow a straight line.
55
55 What can we do to explore other (nonlinear) relationships? One possibility is to transform one (or both) of the variables. – For example, instead of using Y as the dependent variable, we may use its log, reciprocal, square, or square root. – Another possibility is to transform the independent variable (X) in the same way. After such transformation, a linear relation could arise. – Apply the regression analysis for the newly arisen linear relationship. Transforming Data
56
56 In the golf winnings example, changing the scale of the dependent variable is effective. We determine the log of each golfer’s winnings and then find the correlation between the log of winnings and score. That is, we find the log to the base 10 of Tiger Woods’ earnings of $5,365,472, which is 6.72961. Transforming Data - Example
57
57 Scatter Plot of Transformed Y
58
58 Linear Regression Using the Transformed Y
59
59 Using the Transformed Equation for Estimation Based on the regression equation, a golfer with a mean score of 70 could expect to earn: The value 6.4372 is the log to the base 10 of winnings. The antilog of 6.4372 is 2.736 So a golfer that had a mean score of 70 could expect to earn $2,736,528.
60
60 End of Chapter 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.