Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.

Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association for evaluating the relationship between an interval level independent variable and an interval level dependent variable. These statistics are components of a broader set of statistical techniques for evaluating the relationship between two interval level variables, called regression analysis (sometimes referred to in combination as correlation and regression analysis).

Regression Analysis vs. Chi-Square Test of Independence
Our purpose now is to use a hypothesis test to conclude that there is a relationship between two interval level variables in the population represented by our sample data. We could use a chi-square test of independence to determine whether or not a relationship exists between two variables in the population represented by our data, provided we grouped the values of both variables to create a bivariate table. However, it is preferable to test for the presence of a relationship retaining the variables as interval level data because this strategy is more effective at detecting the existence of relationship. We might find a relationship using interval level statistics that we do not find using nominal level statistics because the nominal level statistics are less precise.

Elements of Regression Analysis
We will first review previous material on regression and correlation: The scatterplot or scattergram The regression equation Then, we will examine the statistical evidence to determine whether or not, the relationships found in our sample data are applicable to the population represented by the sample using a hypothesis test.

Purpose of Regression Analysis
The purpose of regression analysis is to answer the same three questions that have been identified as requirements for understanding the relationships between variables: Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship?

Scatterplots - 1 The relationship between two interval variables can be graphed as a scatterplot or a scatter diagram which shows the position of all of the cases in an x-y coordinate system. The independent variable is plotted on the x-axis, or the horizontal axis. The dependent variable is plotted on the y-axis, or the vertical axis. A dot in the body of the chart represented the intersection of the data on the x-axis and the y-axis

Scatterplots - 2 The trendline or regression line is plotted on the chart in a contrasting color The overall pattern of the dots, or data points, succinctly summarizes the nature of the relationship between the two variables. The clarity of the pattern formed by the dots can be enhanced by drawing a straight line through the cluster such that the line touches every dot or comes as close to doing so as possible. This summarizing line is called the “regression line.” We will see later how this line is obtained, but for now, we will look at how it helps us understand the scatterplot.

Scatterplots - 3 The pattern of the points on the scatterplot gives us information about the relationship between the variables. The regression line, drawn in red, makes it easier for us to understand the scatterplot.

The Uses of Scatterplots
Scatterplots give us information about our three questions about the relationship between two interval variables: Is there a relationship between the two variables? How strong is the relationship? What is the direction of the relationship? In addition, the regression line on the scatterplot can be used to estimate the value of the dependent variable for any value of the independent variable.

Scatterplots: Evidence of a Relationship
The angle between the regression line and the horizontal x-axis provides evidence of a relationship. If there is no relationship, the regression line will be parallel to the axis. When there is no relationship between two variables, the regression line is parallel to the horizontal axis. When there is a relationship between two variables, the regression line lies at an angle to the horizontal axis, sloping either upward or downward.

Scatterplots: Strength of a Relationship
The strength of a relationship is indicated by the narrowness of the band of points spread around the regression line: the tighter the band, the stronger the relationship. The spread of the points around the regression line is narrow, indicating a stronger relationship. We should check the scale of the vertical axis to make sure the narrow band is not the result of an excessively large scale. In this scatterplot, the points are very spread out around the regression line. The relationship is weak.

Scatterplots: Direction of Relationship
When the regression line slopes upward to the right, there is a positive, or direct, relationship between the variables. When the regression line slopes downward, the relationship is negative, or inverse. In this scatterplot, the regression line slopes donward to the right, indicating a negative or inverse relationship. The values of the variables move in opposite directions. In this scatterplot, the regression line slopes upward to the right, indicating a positive or direct relationship. The values of both variables increase and decrease at the same time.

Scatterplots: Predicting Scores
For any value of the independent variable on the horizontal x-axis, the predicted value for the dependent variable will be the corresponding value on the vertical y-axis. For the value of the independent variable on the horizontal axis, we draw a line upward to the regression line, e.g. 52. We draw a perpendicular line from the value on the x-axis to the regression line. The estimate for the dependent variable is obtained by drawing a line parallel to the x-axis from the regression line to the vertical y-axis and reading the value where this line crosses the y-axis, e.g. 50.

The Effect of Scaling on the Scatterplot
The scale used for the vertical y-axis can change the appearance of the scatterplot and alter our interpretation of the strength of the relationship. The three scatterplots on this slide all use the same data. In the original plot, the y-axis is scaled from 0 to 80. In this plot, I doubled the range of the y-axis scale to 0 to 160, drawing the points closer together, and making the relationship appear stronger. In this plot, I have narrowed the range of the y-axis scale to 25 to 75, spreading the points, and making the relationship appear weaker.

The Assumption of Linearity
An underlying assumption of regression analysis is that the relationship between the variables is linear, meaning that the points in the scatterplot must form a pattern that can be approximated with a straight line. While we could test the assumption of linearity with a test of statistical significance of the correlation coefficient, we will make a visual assess tor scatterplots. If the scatterplot indicates that the points do not follow a linear pattern, the techniques of linear correlation and regression should not be applied.

Examples of Linear Relationships
These two scatterplots are for data on poverty of nations. The plots below show strong linear relationships. The points are evenly distributed on either side of the regression line.

Examples of Non-linear Relationships
These scatterplots show a non-linear relationship. The points are not evenly distributed on either side of the regression line. We will often see a concentration of points on one side of the regression line and an absence of points on the other side.

The Regression Equation
The regression equation is the algebraic formula for the regression line, which states the mathematical relationship between the independent and the dependent variable. We can use the regression line to estimate the value of the dependent variable for any value of the independent variable. The stronger the relationship between the independent and dependent variables, the closer these estimates will come to the actual score that each case had on the dependent variable.

Components of the Regression Equation
The regression equation has two components. The first component is a number called the y-intercept that defines where the line crosses the vertical y axis. The second component is called the slope of the line, and is a number that multiplies the value of the independent variable. These two elements are combined in the general form for the regression equation: the estimated score on the dependent variable = the y-intercept + the slope × the score on the independent variable

The Standard Form of the Regression Equation
The standard form for the regression equation or formula is: Y = a + bX where Y is the estimated score for the dependent variable X is the score for the independent variable b is the slope of the regression line, or the multiplier of X a is the intercept, or the point on the vertical axis where the regression line crosses the vertical y-axis

Depicting the Regression Equation
The regression equation includes both the y-intercept and the slope of the line. The y-intercept is 1.0 and the slope is 0.5. The slope is the multiplier of x. It is the amount of change in y for a change of one unit in x. If x changes one unit from 2.0 to 3.0, depicted by the blue arrow, y will change by 0.5 units, from 2.0 to 2.5 as depicted by the red arrow. The y-intercept is the point on the vertical y-axis where the regression line crosses the axis, i.e. 1.0.

Deriving the Regression Equation
In this plot, none of the points fall on the regression line. The difference between the actual value for the dependent variable and the predicted value for each point is shown by the red lines. This difference is called the residual, and represents the error between the actual and predicted values. The regression equation is computed to minimize the total amount of error in predicting values for the dependent variable. The method for deriving the equation is called the "method of least squares," meaning that the regression line minimizes the sum of the squared residuals, or errors between actual and predicted values.

Interpreting the Regression Equation: the Intercept
The intercept is the point on the vertical axis where the regression line crosses the axis. It is the predicted value for the dependent variable when the independent variable has a value of zero. This may or may not be useful information depending on the context of the problem.

Interpreting the Regression Equation: the Slope
The slope is interpreted as the amount of change in the predicted value of the dependent variable associated with a one unit change in the value of the independent variable. If the slope has a negative sign, the direction of the relationship is negative or inverse, meaning that the scores on the two variables move in opposite directions. If the slope has a positive sign, the direction of the relationship is positive or direct, meaning that the scores on the two variables move in the same direction.

Interpreting the Regression Equation: when the Slope equals 0
If there is no relationship between two variables, the slope of the regression line is zero and the regression line is parallel to the horizontal axis. A slope of zero means that the predicted value of the dependent variable will not change, no matter what value of the independent variable is used. If there is no relationship, using the regression equation to predict values of the dependent variable is no improvement over using the mean of the dependent variable.

Assumptions Required for Utilizing a Regression Equation
The assumptions required for utilizing a regression equation are the same as the assumptions for the test of significance of a correlation coefficient. Both variables are interval level. Both variables are normally distributed. The relationship between the two variables is linear. The variance of the values of the dependent variable is uniform for all values of the independent variable (equality of variance).

Assumption of Normality
Strictly speaking, the test requires that the two variables be bivariate normal, meaning that the combined distribution of the two variables is normal. It is usually assumed that the variables are bivariate normal if each variable is normally distributed, so this assumption is tested by checking the normality of each variable. Each variable will be considered normal if its skewness and kurtosis statistics fall between –1.0 and +1.0 or if the sample size is sufficiently large to apply the Central Limit theorem.

Assumption of Linearity
Linearity means that the pattern of the points in a scatterplot form a band, like the pattern in the chart on the right: When the pattern of the points follows a curve, like the scatterplot on the right, the correlation coefficient will not accurately measure the relationship.

Test of Linearity The test of linearity is a diagnostic statistical test of the null hypothesis that the linear model is an appropriate fit for the data points. The desired outcome for this test is to fail to reject the null hypothesis. If the probability for the test of statistic is less than or equal to the level of significance for the problem, we reject the null hypothesis, concluding that the data is not linear and the Regression Analysis is not appropriate for the relationship between the two variables. If the probability for the test of linearity statistic is greater than the level of significance for the problem, we fail to reject the null hypothesis and conclude that we satisfy the assumption of linearity.

Assumption of Homoscedasticity
Homoscedasticity (equality of variances) means that the points are evenly dispersed on either side of the regression line for the linear relationship. In this scatterplot, the points extend about the same distance above and below the regression line for most of the length of the regression line. This scatterplot meets the assumption of homoscedasticity. In this scatterplot, the spread of the points around the regression line is narrower at the left end of the regression line than at the right end of the regression line. This “funnel” shape is typical of a scatterplot showing violations of the assumption of homoscedasticity.

Test of Homoscedasticity
When we compared groups, we used the Levene test of population variances to test for the assumption that the group variances were equal. In order to use this test for the assumption of homoscedasity, we will convert the interval level independent variable into a dichotomous variable with low scores in one group and high scores in the other group. We can then compare the variances of the two groups derived from the independent variable.

Levene Test of Homogeneity of Variances
The Levene test of equality of population variances tests whether or not the variances for the two groups are equal. It is a test of the research hypothesis that the variance (dispersion) of the group with low scores is different from the variance of the group with high scores. The null hypothesis that the variance (dispersion) of both groups are equal. If the probability of the test statistic is greater than 0.05, we do not reject the null hypothesis and conclude that the variances are equal. This is the desired outcome. If the probability of the test statistic is less than or equal to 0.05, we conclude the variances are different and the Regression Analysis is not an appropriate test for the relationship between the two variables.

The hypothesis test of r2
The purpose of the hypothesis test of r2 is a test of the applicability of our findings to the population represented by the sample. When we studied association between two interval variables, we stated that the Pearson r correlation coefficient and its square, the coefficient of determination measure the strength of the relationship between two interval variables. When the correlation coefficient and coefficient of determination are zero (0), there is no relationship. The hypothesis test of r2 is a test of whether or not r2 is larger than zero in the population.

The hypothesis test of r2
The research hypothesis states that r2 is larger than zero. (a relationship exists) The null hypothesis states that r2 is equal to zero. (no relationship) Recall that we interpreted the coefficient of determination r2 as the reduction in error attributable to with the relationship between the variables. The test statistic is an ANOVA F-test which tests whether or not the reduction in error associated with using the regression equation is really greater than zero.

How the regression ANOVA test works?
We will use the sample data we used for correlation and regression to examine how the hypothesis test for r2 works. We are interested in the relationship between family size and number of credit cards.

The scatter diagram or scatterplot
The dependent variable is plotted on the Y or vertical axis. The independent variable is plotted on the x or horizontal axis.

The mean as the best guess
Without taking into account the independent variable, our best guess for the number of credit cards for any subject is the mean, 7.0.

Errors using the mean as estimate
Errors are measured by computing the difference between the mean and each Y value, squaring the differences, and then summing them. When we compute the answer in SPSS, it will tell us that the total amount of error is 22.0.

The regression line The regression line minimizes the error (the best fitting or least squares line)

The equation for the regression line
SPSS will give us the formula for the regression line in the form Y = a + bX, or for these variables: Number of Credit Cards = x Family Size

PRE reduction in error Error using mean only (total) 22.000
SPSS also tells us the amount of error using only the mean and using the regression line. Error using mean only (total) 22.000 Error using regression line 5.486 Reduction in error associated with the regression 16.514 PRE measure (r2) = .751 22.0

The ANOVA test for the regression
The F statistic is calculated as the ratio of error reduced by regressions divided the error remaining. If the ratio were 1 and these two numbers were the same, we would not have reduced any error, there would be no relationship, and the p-value would not let us reject the null hypothesis. In this problem, the amount of error reduced by the regression is large relative to the amount remaining, so the F statistic is large, the p-value(0.005) is smaller than the alpha level of significance, so we reject the null hypothesis.

Interpreting Pearson’s r correlation coefficient
The square root of r2 is Pearson’s r, the correlation coefficient. If we want to characterize the strength of the relationship, we compare the size of r to the interpretive guidelines for measures of association.

Interpreting the direction of the relationship
To interpret the direction of the relationship between the variables, we look at the coefficient for the independent variable. In this example, the coefficient of is positive, so we would interpret this relationship as: Families with more members had more credit cards.

Testing Assumptions in Homework Problems
The process of testing assumptions can easily overwhelm the task of testing the significance of the relationship. Since our emphasis here is testing the hypothesis that the relationship is generalizable to the population represented by the sample data, we will assume that our data satisfies the assumptions without explicitly testing assumptions.

Homework Problem Questions
The question in the homework problems requires us to look at three things: Does the hypothesis test support the existence of a relationship in the population? Is the strength of the relationship characterized correctly? Is the direction of the relationship between the variables correctly stated?

Practice Problem – 1 This question asks you to use linear regression to examine the relationship between [marital] and [age]. Linear regression requires that the dependent variable and the independent variables be interval. Ordinal variables may be included as interval variables if a caution is added to any true findings. The dependent variable [marital] is nominal level which does not satisfy the requirement for a dependent variable. The independent variable [age] is interval level, satisfying the requirement for an independent variable.

Practice Problem - 2 This question asks you to use linear regression to examine the relationship between [fund] and [attend]. The level of measurement requirements for multiple regression are satisfied: [fund] is ordinal level, and [attend] is ordinal level. A caution is added because ordinal level variables are included in the analysis. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional assumptions for the variables.

Linear Regression Hypothesis Test in SPSS (1)
You can conduct a linear regression using: Analyze > Regression > Linear…

Move the dependent variable to “Dependent:” and the independent variable to “Independent(s):” boxes and then click “OK” button.

Based on the ANOVA table for the linear regression (F(1, 604) = , p<0.001), there was an relationship between the dependent variable "degree of religious fundamentalism" and the independent variable "frequency of attendance at religious services". Since the probability of the F statistic (p<0.001) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.

Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.323, which would be characterized as a weak relationship using the rule of thumb that a correlation between 0.0 and 0.20 is very weak; 0.20 to 0.40 is weak; 0.40 to 0.60 is moderate; 0.60 to 0.80 is strong; and greater than 0.80 is very strong. The relationship between the independent variables and the dependent variable was incorrectly characterized as a moderate relationship. The relationship should have been characterized as a weak relationship. The answer to the problem is false.

Practice Problem – 3 This question asks you to use linear regression to examine the relationship between [educ] and [age]. [educ] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.

Based on the ANOVA table for the linear regression (F(1, 659) = 9.983, p=0.002), there was an relationship between the dependent variable "highest year of school completed" and the independent variable "age". Since the probability of the F statistic (p=0.002) was less than or equal to the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was rejected. The research hypothesis that there was a relationship between the variables was supported.

Given the significant F-test result, the correlation coefficient (R) can be interpreted. The correlation coefficient for the relationship between the independent variable and the dependent variable was 0.122, which can be characterized as a very weak relationship. .

The b coefficient for the independent variable "age" was -.021, indicating an inverse relationship with the dependent variable. Higher numeric values for the independent variable "age" [age] are associated with lower numeric values for the dependent variable "highest year of school completed" [educ]. The statement in the problem that "survey respondents who were older had completed more years of school" is incorrect. The direction of the relationship is stated incorrectly.

Practice Problem – 4 This question asks you to use linear regression to examine the relationship between [sei] and [age]. [sei] and [age] are interval level, satisfying the level of measurement requirements for regression. Given the assumption that the distributional requirements for linear regression are satisfied, you can conduct a linear regression using SPSS without examining distributional characteristics of variables.

Based on the ANOVA table for the linear regression (F(1, 629) = .266, p=0.606), there was no relationship between the dependent variable "socioeconomic index" and the independent variable "age". Since the probability of the F statistic (p=0.606) was greater than the level of significance (0.05), the null hypothesis that correlation coefficient (R) was equal to 0 was not rejected. The research hypothesis that there was a relationship between the variables was not supported.

Steps in solving Linear Regression Hypothesis Test Problems - 1
The following is a guide to the decision process for answering homework problems about Linear Regression Hypothesis Test problems: Are the dependent and independent variables ordinal or interval level? Incorrect application of a statistic No Yes Make sure that the assumption that the distributional requirements for linear regression are satisfied is made. Otherwise, you have to check the assumption first. Our regression problems will assume that the assumptions are met.

Conduct the linear regression analysis Is the p-value in the ANOVA table for the F ratio test <= alpha? No False Yes Is the interpretation of the strength of the correlation coefficient correct? No False Yes

Is the direction of the relationship correctly stated? No False Yes Are either of the variables ordinal level? No True Yes True with caution

Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.

Similar presentations

Presentation on theme: "Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.

Similar presentations

Presentation on theme: "Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association."— Presentation transcript:

Similar presentations

About project

Feedback