Download presentation
Presentation is loading. Please wait.
1
Correlation and Regression
CHAPTER 10 Correlation and Regression © Copyright McGraw-Hill 2004
2
© Copyright McGraw-Hill 2004
Objectives Draw a scatter plot for a set of ordered pairs. Compute the correlation coefficient. Test the hypothesis H0: 0. Compute the equation of the regression line. © Copyright McGraw-Hill 2004
3
© Copyright McGraw-Hill 2004
Objectives (cont’d.) Compute the coefficient of determination. Compute the standard error of estimate. Find a prediction interval. Be familiar with the concept of multiple regression. © Copyright McGraw-Hill 2004
4
© Copyright McGraw-Hill 2004
Introduction In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists. © Copyright McGraw-Hill 2004
5
© Copyright McGraw-Hill 2004
Statistical Methods Correlation is a statistical method used to determine whether a linear relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear. © Copyright McGraw-Hill 2004
6
Statistical Questions
Are two or more variables related? If so, what is the strength of the relationship? What type or relationship exists? What kind of predictions can be made from the relationship? © Copyright McGraw-Hill 2004
7
© Copyright McGraw-Hill 2004
Vocabulary A correlation coefficient is a measure of how variables are related. In a simple relationship, there are only two types of variables under study. In multiple relationships, many variables are under study. © Copyright McGraw-Hill 2004
8
© Copyright McGraw-Hill 2004
Scatter Plots A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables. © Copyright McGraw-Hill 2004
9
© Copyright McGraw-Hill 2004
Scatter Plot Example © Copyright McGraw-Hill 2004
10
Correlation Coefficient
The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is . © Copyright McGraw-Hill 2004
11
Correlation Coefficient (cont’d.)
The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1. © Copyright McGraw-Hill 2004
12
Correlation Coefficient (cont’d.)
When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. No linear relationship 1 1 Strong negative linear relationship Strong positive linear relationship © Copyright McGraw-Hill 2004
13
Formula for the Correlation Coefficient r
where n is the number of data pairs. © Copyright McGraw-Hill 2004
14
Population Correlation Coefficient
Formally defined, the population correlation coefficient, , is the correlation computed by using all possible pairs of data values (x, y) taken from a population. © Copyright McGraw-Hill 2004
15
© Copyright McGraw-Hill 2004
Hypothesis Testing In hypothesis testing, one of the following is true: H0: 0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in the population. © Copyright McGraw-Hill 2004
16
t Test for the Correlation Coefficient
Formula for the t test for the correlation coefficient: with degrees of freedom equal to n 2. © Copyright McGraw-Hill 2004
17
Possible Relationships Between Variables
There is a direct cause-and-effect relationship between the variables: that is, x causes y. There is a reverse cause-and-effect relationship between the variables: that is, y causes x. The relationship between the variable may be caused by a third variable: that is, y may appear to cause x but in reality z causes x. © Copyright McGraw-Hill 2004
18
Possible Relationships Between Variables
There may be a complexity of interrelationships among many variables; that is, x may cause y but w, t, and z fit into the picture as well. The relationship may be coincidental: although a researcher may find a relationship between x and y, common sense may prove otherwise. © Copyright McGraw-Hill 2004
19
Interpretation of Relationships
When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation. © Copyright McGraw-Hill 2004
20
© Copyright McGraw-Hill 2004
Regression Line If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum. © Copyright McGraw-Hill 2004
21
Scatter Plot with Three Lines
© Copyright McGraw-Hill 2004
22
© Copyright McGraw-Hill 2004
A Linear Relation © Copyright McGraw-Hill 2004
23
© Copyright McGraw-Hill 2004
Equation of a Line In algebra, the equation of a line is usually given as , where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is written as , where b is the slope of the line and a is the y' intercept. © Copyright McGraw-Hill 2004
24
© Copyright McGraw-Hill 2004
Regression Line Formulas for the regression line : where a is the y' intercept and b is the slope of the line. © Copyright McGraw-Hill 2004
25
© Copyright McGraw-Hill 2004
Rounding Rule When calculating the values of a and b, round to three decimal places. © Copyright McGraw-Hill 2004
26
Assumptions for Valid Predictions
For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line. The standard deviation of each of the dependent variables must be the same for each value of the independent variable. © Copyright McGraw-Hill 2004
27
© Copyright McGraw-Hill 2004
Limits of Predictions Remember when assumptions are made, they are based on present conditions or on the premise that present trends will continue. The assumption may not prove true in the future. © Copyright McGraw-Hill 2004
28
© Copyright McGraw-Hill 2004
Procedure Finding the correlation coefficient and the regression line equation Step 1 Make a table with columns for subject, x, y, xy, x2, and y2. Step 2 Find the values of xy, x2, and y Place them in the appropriate columns. Step 3 Substitute in the formula to find the value of r. © Copyright McGraw-Hill 2004
29
© Copyright McGraw-Hill 2004
Procedure (cont’d.) Step 4 When r is significant, substitute in the formulas to find the values of a and b for the regression line equation. © Copyright McGraw-Hill 2004
30
© Copyright McGraw-Hill 2004
Total Variation The total variation, , is the sum of the squares of the vertical distance each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance. © Copyright McGraw-Hill 2004
31
Two Parts of Total Variation
The variation obtained from the relationship (i.e., from the predicted y' values) is and is called the explained variation. Variation due to chance, found by , is called the unexplained variation. This variation cannot be attributed to the relationships. © Copyright McGraw-Hill 2004
32
Total Variation (cont’d.)
Hence, the total variation is equal to the sum of the explained variation and the unexplained variation. For a single point, the differences are called deviations. © Copyright McGraw-Hill 2004
33
Coefficient of Determination
The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is r2. © Copyright McGraw-Hill 2004
34
Coefficient of Nondetermination
The coefficient of nondetermination is a measure of the unexplained variation. The formula for the coefficient of nondetermination is: © Copyright McGraw-Hill 2004
35
Standard Error of Estimate
The standard error of estimate, denoted by Sest is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is: © Copyright McGraw-Hill 2004
36
© Copyright McGraw-Hill 2004
Prediction Interval The standard error of estimate can be used for constructing a prediction interval about a y' value. The formula for the prediction interval is: The d.f. n 2. © Copyright McGraw-Hill 2004
37
© Copyright McGraw-Hill 2004
Multiple Regression In multiple regression, there are several independent variables and one dependent variable, and the equation is: where x1, x2,…,xk are the independent variables. © Copyright McGraw-Hill 2004
38
Multiple Regression (cont’d.)
Multiple regression analysis is used when a statistician thinks there are several independent variables contributing to the variation of the dependent variable. This analysis then can be used to increase the accuracy of predictions for the dependent variable over one independent variable alone. © Copyright McGraw-Hill 2004
39
Assumptions for Multiple Regression
Normality assumption—for any specific value of the independent variable, the values of the y variable are normally distributed. Equal variance assumption—the variances for the y variable are the same for each value of the independent variable. Linearity assumption—there is a linear relationship between the dependent variable and the independent variables. © Copyright McGraw-Hill 2004
40
© Copyright McGraw-Hill 2004
Assumptions (cont’d.) Nonmulticolinearity assumption—the independent variables are not correlated. Independence assumption—the values for the y variable are independent. © Copyright McGraw-Hill 2004
41
Multiple Correlation Coefficient
In multiple regression, as in simple regression, the strength of the relationship between the independent variables and the dependent variable is measured by a correlation coefficient. This multiple correlation coefficient is symbolized by R. © Copyright McGraw-Hill 2004
42
Multiple Correlation Coefficient Formula
The formula for R is: where ryx1 is the correlation coefficient for the variables y and x1;ryx2 is the correlation coefficient for the variables y and x2; and rx1,x2 is the value of the correlation coefficient for the variables x1 and x2. © Copyright McGraw-Hill 2004
43
Coefficient of Multiple Determination
As with simple regression, R2 is the coefficient of multiple determination, and it is the amount of variation explained by the regression model. The expression 1-R2 represents the amount of unexplained variation, called the error or residual variation. © Copyright McGraw-Hill 2004
44
F Test for Significance of R
The formula for the F test is: where n is the number of data groups (x1, x2,…, y) and k is the number of independent variables. The degrees of freedom are d.f.N n k and d.f.D n k 1. © Copyright McGraw-Hill 2004
45
© Copyright McGraw-Hill 2004
Adjusted R 2 Since the value of R2 is dependent on n (the number of data pairs) and k (the number of variables), statisticians also calculate what is called an adjusted R2, denoted by R2adj. This is based on the number of degrees of freedom. © Copyright McGraw-Hill 2004
46
© Copyright McGraw-Hill 2004
Adjusted R 2 (Cont’d) The formula for the adjusted R2 is: © Copyright McGraw-Hill 2004
47
© Copyright McGraw-Hill 2004
Summary The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r. r can assume values between and including 1 and 1. The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables. A value of 1 or 1 indicates a perfect linear relationship. © Copyright McGraw-Hill 2004
48
© Copyright McGraw-Hill 2004
Summary (cont’d.) Relationships can be linear or curvilinear. To determine the shape, one draws a scatter plot of the variables. If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit. © Copyright McGraw-Hill 2004
49
© Copyright McGraw-Hill 2004
Summary (cont’d.) In addition, relationships can be multiple. That is, there can be two or more independent variables and one dependent variable. A coefficient of correlation and a regression equation can be found for multiple relationships, just as they can be found for simple relationships. © Copyright McGraw-Hill 2004
50
© Copyright McGraw-Hill 2004
Summary (cont’d.) The coefficient of determination is a better indicator of the strength of a linear relationship than the correlation coefficient. It is better because it identifies the percentage of variation of the dependent variable that is directly attributable to the variation of the independent variable. The coefficient of determination is obtained by squaring the correlation coefficient and converting the result to a percentage. © Copyright McGraw-Hill 2004
51
© Copyright McGraw-Hill 2004
Summary (cont’d.) Another statistic used in correlation and regression is the standard error of estimate, which is an estimate of the standard deviation of the y values about the predicted y' values. The standard error of estimate can be used to construct a prediction interval about a specific value point estimate y' of the mean or the y values for a given x. © Copyright McGraw-Hill 2004
52
© Copyright McGraw-Hill 2004
Conclusion Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression. © Copyright McGraw-Hill 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.