Download presentation
Presentation is loading. Please wait.
Published byDayna Small Modified over 9 years ago
1
© Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10
2
© Copyright McGraw-Hill 2000 10-2 Objectives Draw a scatter plot for a set of ordered pairs. Compute the correlation coefficient. Test the hypothesis H 0 : 0. Compute the equation of the regression line.
3
© Copyright McGraw-Hill 2000 10-3 Objectives (cont’d.) Compute the coefficient of determination. Compute the standard error of estimate. Find a prediction interval. Be familiar with the concept of multiple regression.
4
© Copyright McGraw-Hill 2000 10-4 Introduction In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists.
5
© Copyright McGraw-Hill 2000 10-5 Statistical Methods Correlation is a statistical method used to determine whether a linear relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear.
6
© Copyright McGraw-Hill 2000 10-6 Statistical Questions 1. Are two or more variables related? 2. If so, what is the strength of the relationship? 3. What type or relationship exists? 4. What kind of predictions can be made from the relationship?
7
© Copyright McGraw-Hill 2000 10-7 Vocabulary A correlation coefficient is a measure of how variables are related. In a simple relationship, there are only two types of variables under study. In multiple relationships, many variables are under study.
8
© Copyright McGraw-Hill 2000 10-8 Scatter Plots A scatter plot is a graph of the ordered pairs ( x, y ) of numbers consisting of the independent variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables.
9
© Copyright McGraw-Hill 2000 10-9 Scatter Plot Example
10
© Copyright McGraw-Hill 2000 10-10 Correlation Coefficient The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is .
11
© Copyright McGraw-Hill 2000 10-11 Correlation Coefficient (cont’d.) The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1.
12
© Copyright McGraw-Hill 2000 10-12 Correlation Coefficient (cont’d.) When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. Strong negative linear relationship Strong positive linear relationship 11 11 0 No linear relationship
13
© Copyright McGraw-Hill 2000 10-13 Formula for the Correlation Coefficient r where n is the number of data pairs.
14
© Copyright McGraw-Hill 2000 10-14 Population Correlation Coefficient Formally defined, the population correlation coefficient, , is the correlation computed by using all possible pairs of data values ( x, y ) taken from a population.
15
© Copyright McGraw-Hill 2000 10-15 Hypothesis Testing In hypothesis testing, one of the following is true: H 0 : 0This null hypothesis means that there is no correlation between the x and y variables in the population. H 1 : 0This alternative hypothesis means that there is a significant correlation between the variables in the population.
16
© Copyright McGraw-Hill 2000 10-16 t Test for the Correlation Coefficient Formula for the t test for the correlation coefficient: with degrees of freedom equal to n 2.
17
© Copyright McGraw-Hill 2000 10-17 Possible Relationships Between Variables There is a direct cause-and-effect relationship between the variables : that is, x causes y. There is a reverse cause-and-effect relationship between the variables : that is, y causes x. The relationship between the variable may be caused by a third variable : that is, y may appear to cause x but in reality z causes x.
18
© Copyright McGraw-Hill 2000 10-18 Possible Relationships Between Variables There may be a complexity of interrelationships among many variables ; that is, x may cause y but w, t, and z fit into the picture as well. The relationship may be coincidental : although a researcher may find a relationship between x and y, common sense may prove otherwise.
19
© Copyright McGraw-Hill 2000 10-19 Interpretation of Relationships When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation.
20
© Copyright McGraw-Hill 2000 10-20 Regression Line If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.
21
© Copyright McGraw-Hill 2000 10-21 Scatter Plot with Three Lines
22
© Copyright McGraw-Hill 2000 10-22 A Linear Relation
23
© Copyright McGraw-Hill 2000 10-23 Equation of a Line In algebra, the equation of a line is usually given as, where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is written as, where b is the slope of the line and a is the y ' intercept.
24
© Copyright McGraw-Hill 2000 10-24 Regression Line Formulas for the regression line : where a is the y ' intercept and b is the slope of the line.
25
© Copyright McGraw-Hill 2000 10-25 Rounding Rule When calculating the values of a and b, round to three decimal places.
26
© Copyright McGraw-Hill 2000 10-26 Assumptions for Valid Predictions For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line. The standard deviation of each of the dependent variables must be the same for each value of the independent variable.
27
© Copyright McGraw-Hill 2000 10-27 Limits of Predictions Remember when assumptions are made, they are based on present conditions or on the premise that present trends will continue. The assumption may not prove true in the future.
28
© Copyright McGraw-Hill 2000 10-28 Procedure Finding the correlation coefficient and the regression line equation Step 1Make a table with columns for subject, x, y, xy, x 2, and y 2. Step 2Find the values of xy, x 2, and y 2. Place them in the appropriate columns. Step 3Substitute in the formula to find the value of r.
29
© Copyright McGraw-Hill 2000 10-29 Procedure (cont’d.) Step 4When r is significant, substitute in the formulas to find the values of a and b for the regression line equation.
30
© Copyright McGraw-Hill 2000 10-30 Total Variation The total variation,, is the sum of the squares of the vertical distance each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance.
31
© Copyright McGraw-Hill 2000 10-31 Two Parts of Total Variation The variation obtained from the relationship (i.e., from the predicted y' values) is and is called the explained variation. Variation due to chance, found by, is called the unexplained variation. This variation cannot be attributed to the relationships.
32
© Copyright McGraw-Hill 2000 10-32 Total Variation (cont’d.) Hence, the total variation is equal to the sum of the explained variation and the unexplained variation. For a single point, the differences are called deviations.
33
© Copyright McGraw-Hill 2000 10-33 Coefficient of Determination The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is r 2.
34
© Copyright McGraw-Hill 2000 10-34 Coefficient of Nondetermination The coefficient of nondetermination is a measure of the unexplained variation. The formula for the coefficient of nondetermination is:
35
© Copyright McGraw-Hill 2000 10-35 Standard Error of Estimate The standard error of estimate, denoted by S est is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is:
36
© Copyright McGraw-Hill 2000 10-36 Prediction Interval The standard error of estimate can be used for constructing a prediction interval about a y' value. The formula for the prediction interval is: The d.f. n 2.
37
© Copyright McGraw-Hill 2000 10-37 Multiple Regression In multiple regression, there are several independent variables and one dependent variable, and the equation is: where x 1, x 2,…, x k are the independent variables.
38
© Copyright McGraw-Hill 2000 10-38 Multiple Regression (cont’d.) Multiple regression analysis is used when a statistician thinks there are several independent variables contributing to the variation of the dependent variable. This analysis then can be used to increase the accuracy of predictions for the dependent variable over one independent variable alone.
39
© Copyright McGraw-Hill 2000 10-39 Assumptions for Multiple Regression Normality assumption —for any specific value of the independent variable, the values of the y variable are normally distributed. Equal variance assumption —the variances for the y variable are the same for each value of the independent variable. Linearity assumption —there is a linear relationship between the dependent variable and the independent variables.
40
© Copyright McGraw-Hill 2000 10-40 Assumptions (cont’d.) Nonmulticolinearity assumption—the independent variables are not correlated. Independence assumption—the values for the y variable are independent.
41
© Copyright McGraw-Hill 2000 10-41 Multiple Correlation Coefficient In multiple regression, as in simple regression, the strength of the relationship between the independent variables and the dependent variable is measured by a correlation coefficient. This multiple correlation coefficient is symbolized by R.
42
© Copyright McGraw-Hill 2000 10-42 Multiple Correlation Coefficient Formula The formula for R is: where r yx 1 is the correlation coefficient for the variables y and x 1 ; r yx 2 is the correlation coefficient for the variables y and x 2 ; and r x 1, x 2 is the value of the correlation coefficient for the variables x 1 and x 2.
43
© Copyright McGraw-Hill 2000 10-43 Coefficient of Multiple Determination As with simple regression, R 2 is the coefficient of multiple determination, and it is the amount of variation explained by the regression model. The expression 1- R 2 represents the amount of unexplained variation, called the error or residual variation.
44
© Copyright McGraw-Hill 2000 10-44 F Test for Significance of R The formula for the F test is: where n is the number of data groups ( x 1, x 2,…, y ) and k is the number of independent variables. The degrees of freedom are d.f.N n k and d.f.D n k 1.
45
© Copyright McGraw-Hill 2000 10-45 Adjusted R 2 Since the value of R 2 is dependent on n (the number of data pairs) and k (the number of variables), statisticians also calculate what is called an adjusted R 2, denoted by R 2 adj. This is based on the number of degrees of freedom.
46
© Copyright McGraw-Hill 2000 10-46 Adjusted R 2 (Cont’d) The formula for the adjusted R 2 is:
47
© Copyright McGraw-Hill 2000 10-47 Summary The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r. r can assume values between and including 1 and 1. The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables. A value of 1 or 1 indicates a perfect linear relationship.
48
© Copyright McGraw-Hill 2000 10-48 Summary (cont’d.) Relationships can be linear or curvilinear. To determine the shape, one draws a scatter plot of the variables. If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit.
49
© Copyright McGraw-Hill 2000 10-49 Summary (cont’d.) In addition, relationships can be multiple. That is, there can be two or more independent variables and one dependent variable. A coefficient of correlation and a regression equation can be found for multiple relationships, just as they can be found for simple relationships.
50
© Copyright McGraw-Hill 2000 10-50 Summary (cont’d.) The coefficient of determination is a better indicator of the strength of a linear relationship than the correlation coefficient. It is better because it identifies the percentage of variation of the dependent variable that is directly attributable to the variation of the independent variable. The coefficient of determination is obtained by squaring the correlation coefficient and converting the result to a percentage.
51
© Copyright McGraw-Hill 2000 10-51 Summary (cont’d.) Another statistic used in correlation and regression is the standard error of estimate, which is an estimate of the standard deviation of the y values about the predicted y' values. The standard error of estimate can be used to construct a prediction interval about a specific value point estimate y' of the mean or the y values for a given x.
52
© Copyright McGraw-Hill 2000 10-52 Conclusion Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.