Presentation is loading. Please wait.

Presentation is loading. Please wait.

11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.

Similar presentations


Presentation on theme: "11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the."— Presentation transcript:

1 11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model. When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re-expressing one or both variables reduces the non-linear pattern in the scatterplot, equalizes the variance in the residual plot, and reduces the distance of outliers from the other cases in the distributions. Clues that re-expression might be effective in linearizing the relationship are: severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r. There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship.

2 11/4/2015Slide 2 To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re- expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions. Finally, if the model using raw data supports the regression assumption, we will interpret the direction and strength of the relationship. If it was necessary to re- express one or both variables, we will interpret the strength of the re-expressed model, providing it satisfies the regression assumptions. If the model still violates the conditions for a linear model, we will not interpret the direction and strength of the relationship. This is the convention in our homework problems; in a real application, I would consider interpreting the relationship, attaching a caution which identifies the violation of the assumptions. In these problems, outliers are defined as cases that have a Cook’s distance greater than 0.5, and hence have a larger influence on the regression solution than other cases included in the analysis.

3 11/4/2015Slide 3 The introductory statement in the question indicates: The data set to use (world2007.sav) The task to accomplish (simple linear regression) The variables to use in the analysis: the independent variable population median age in years [agemdn] and the dependent variable infant mortality rate [infmort]

4 11/4/2015Slide 4 The second paragraph tells us how to re-express the variables, should it be necessary.

5 11/4/2015Slide 5 The first statement asks about the size of the sample. To answer this question, we run the linear regression in SPSS.

6 11/4/2015Slide 6 To compute a simple linear regression, select Regression> Linear from the Analyze menu.

7 11/4/2015Slide 7 First, move the dependent variable, infmort, to the Dependent text box. Second, move the independent variable, agemdn, to the Independent(s) list box. Third, click on the Statistics button to request basic descriptive statistics.

8 11/4/2015Slide 8 First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis. Second, click on the Continue button to close the dialog box.

9 11/4/2015Slide 9 Next, click on the Plots button to request the residual plot. Though we do not need it to answer the immediate, producing the residual plot now will save us time later in the problem.

10 11/4/2015Slide 10 Second, move *ZPRED (for standardized predictions) to the Y axis text box. First, move *ZRESID (for standardized residuals) to the Y axis text box. Third, click on the Continue button to close the dialog box.

11 11/4/2015Slide 11 Next, click on the Save button to include Cooks distance in the output.

12 11/4/2015Slide 12 Click on the Continue button to close the dialog box. Mark the check box for Cook’s distance to include this value in the data view and the output.

13 11/4/2015Slide 13 Click on the OK button to request the output.

14 11/4/2015Slide 14 Click on the OK button to request the output. In the table of Descriptive Statistics, we see that the number of cases that have valid data for both variables is 190.

15 11/4/2015Slide 15 The number of cases with valid data to analyze the relationship between "population median age in years" and "infant mortality rate" was 190, out of the total of 192 cases in the data set. Mark the check box for a correct statement.

16 11/4/2015Slide 16 The next statement asks us to examine a scatterplot to evaluate the assumption of linearity.

17 11/4/2015Slide 17 To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

18 11/4/2015Slide 18 In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create. Click on the Define button to go to the next step.

19 11/4/2015Slide 19 First, move the dependent variable infmort to the Y axis text box. Second, move the independent variable agemdn to the X axis text box. Third, click on the OK button to produce the plot.

20 11/4/2015Slide 20 The scatterplot appears in the SPSS output window. To facilitate our determination about the linearity of the plot, we will add a linear fit line, a loess fit line, and a confidence interval to the plot. See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.

21 11/4/2015Slide 21 The criteria we use for evaluating linearity is a comparison of the loess fit line to the linear fit line. If the loess fit line falls within a 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations over the lines of the confidence interval are ignored. The pattern of points in the scatterplot shows an obvious curve indicating non-linearity. The assumption of linearity is not satisfied.

22 11/4/2015Slide 22 The pattern of points in the scatterplot showed an obvious curve indicating non-linearity. The assumption of linearity is not satisfied. The check box is not marked. We can try re-expressing one or both variables to see if we can improve the linearity of the relationship sufficiently to justify the use of linear regression to analyze the relationship.

23 11/4/2015Slide 23 The next statement asks us to examine the residual plot for evidence that the assumptions of linearity or homogeneity of variance are violated. We will not examine the residual plot when we have a clear violation of the assumption of linearity in the scatterplot. We would leave this statement unchecked when we violate linearity because it is difficult to evaluate homogeneity of variance when the plot is non-linear. However, we will examine the residual plot just for practice. If we violate linearity, homogeneity of variance, or have influential cases, we jump ahead to the question comparing r and rho.

24 11/4/2015Slide 24 If we add a loess fit line to the residual plot, we see that the non-linearity in the scatterplot is supported by the residual plot. Just for practice, we examine the residual plot for linearity and homogeneity of variance.

25 11/4/2015Slide 25 The violation of homogenity of variance is also clearly visible in the residual plot. The spread of residual is narrow for low predicted values, but substantially larger for high predicted values.

26 11/4/2015Slide 26 The next statement asks us to examine the impact of influential cases on the analysis. Like outliers, we might consider elimination of influential cases to improve the fit for the majority of cases. Since re-expression will alter the distance used to identify influential cases, we will not evaluate Cook's distances until we satisfy the assumptions of linearity and homogeneity. While we would leave this statement unchecked when we violate linearity or homogeneity of variance, we will check for the presence of influential cases just for practice.

27 11/4/2015Slide 27 Summary information about Cook’s Distances is found in the table of Residual Statistics. The maximum Cook’s distance is.150, less than the cutoff of.50 specified for the problems. We do not cases that have very high values of Cook’s distance for the variables in this relationship.

28 11/4/2015Slide 28 The next statement asks us to compare Spearman’s rho to Pearson’s r to assess our expectations for the effectiveness of re-expressing the variables. To compare rho and r, we compute the correlation coefficients in SPSS.

29 11/4/2015Slide 29 To compute correlations, select Correlate > Bivariate from the Correlate menu.

30 11/4/2015Slide 30 First, move the variables agemdn and infmort to the Variables list box. Second, mark the check box for Spearman and leave the check box for Pearson marked. Third, click on the OK button to produce the output.

31 11/4/2015Slide 31 Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported.

32 11/4/2015Slide 32 Spearman's rho (-0.886) was larger than Pearson's r (-0.732). The feasibility of re-expressing the data to correct for the violation of regression assumptions is supported. The check box for a correct answer is marked.

33 11/4/2015Slide 33 The next statement asks us which transformations we used to try to induce linearity in the scatterplot. We should re-express variables that have skewness less than -1.0 or greater than +1.0.

34 11/4/2015Slide 34 We will use the Descriptives procedure to obtain skewness for both variables. Select Descriptive Statistics > Descriptives from the Analyze menu.

35 11/4/2015Slide 35 First, move the variables agemdn and infmort to the Variable(s) list box. Second, click on the Options button to specify our choice for statistics.

36 11/4/2015Slide 36 Next, mark the check boxes for Kurtosis and Skewness in addition to the defaults marked by SPSSS. Finally, click on the Continue button to close the dialog box.

37 11/4/2015Slide 37 Click on the OK button to produce the output.

38 11/4/2015Slide 38 The skewness for "infant mortality rate" [infmort] was 1.470. The skewness for "population median age in years" [agemdn] was 0.456 Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. Since the skew for the independent variable "population median age in years" [agemdn] (0.456) was greater than -1.0 and less than +1.0, we do not attempt to correct violation of assumptions by re-expressing it.

39 11/4/2015Slide 39 Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. We mark the check box for a correct statement.

40 11/4/2015Slide 40 The next statement asks whether or not the relationship with the re-expressed variable, LG_infmort, supports the assumption of linearity.

41 11/4/2015Slide 41 We first create the transformed variable, the logarithm of infmort. Select the Compute Variable command from the Transform menu.

42 11/4/2015Slide 42 First, type the name for the re-expressed variable in the Target Variable text box. The directions for the problem give us the formula for the transformation: Use the formula LG10(infmort) to create the log transformation of infant mortality rate [LG_infmort]. Second, type the formula in the Numeric Expression text box. Third, click on the OK button to compute the transformation.

43 11/4/2015Slide 43 Next, we create the scatterplot for the relationship with the re-expressed variable. To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

44 11/4/2015Slide 44 In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create. Click on the Define button to go to the next step.

45 11/4/2015Slide 45 First, move the dependent variable LG_infmort to the Y axis text box. Second, move the independent variable agemdn to the X axis text box. Third, click on the OK button to produce the plot.

46 11/4/2015Slide 46 The scatterplot looks linear, but to make sure we will add fit lines and a confidence interval.

47 11/4/2015Slide 47 The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied. See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.

48 11/4/2015Slide 48 The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied. The check box for a correct answer is marked.

49 11/4/2015Slide 49 The next statement asks whether or not the residual plot supports the assumptions of linearity and equal variance.

50 11/4/2015Slide 50 To compute a simple linear regression, select Regression> Linear from the Analyze menu. We next do the regression analysis, creating the residual plot in the process.

51 11/4/2015Slide 51 First, move the dependent variable, LG_infmort, to the Dependent text box. Second, move the independent variable, agemdn, to the Independent(s) list box. Third, click on the Statistics button to request basic descriptive statistics.

52 11/4/2015Slide 52 First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis. Second, click on the Continue button to close the dialog box.

53 11/4/2015Slide 53 Next, click on the Plots button to request the residual plot.

54 11/4/2015Slide 54 Second, move *ZPRED (for standardized predictions) to the Y axis text box. First, move *ZRESID (for standardized residuals) to the Y axis text box. Third, click on the Continue button to close the dialog box.

55 11/4/2015Slide 55 Next, click on the Save button to include Cooks distance in the output.

56 11/4/2015Slide 56 Click on the Continue button to close the dialog box. Mark the check box for Cook’s distance to include this value in the data view and the output.

57 11/4/2015Slide 57 Click on the OK button to request the output.

58 11/4/2015Slide 58 The pattern of points in the residual plot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied, confirming the finding of linearity in the scatterplot for the data. The pattern of points in the residual plot shows equal spread across the standardized predictions of "infant mortality rate" [infmort]. The assumption of equal variance is satisfied.

59 11/4/2015Slide 59 The pattern of points in the residual plot does not show an obvious curve indicating non- linearity and shows equal spread across the standardized predictions. The check box for a correct answer is marked.

60 11/4/2015Slide 60 The next statement asks about the presence of influential cases after the variable is re-expressed.

61 11/4/2015Slide 61 Summary information about Cook’s Distances is found in the table of Residual Statistics. There were no cases that had a Cook's distance of 0.5 or greater, qualifying as influential cases.

62 11/4/2015Slide 62 Since there were no Cook’s distances greater than 0.5, we mark the check box as correct.

63 11/4/2015Slide 63 The next statement asks about the direction of the relationship between the variables.

64 11/4/2015Slide 64 The slope for the regression equation between "population median age in years" [agemdn] and the log transformation of "infant mortality rate" [LG_infmort] was -.05. The negative value for the slope means that scores for the two variables change in the opposite direction. Higher scores on the variable "population median age in years" were associated with lower scores on the log transformation of "infant mortality rate".

65 11/4/2015Slide 65 The negative slope implies an inverse relationship in which increases in one variable are associated with lower scores on the other variables. The statement that "countries who had a higher median age in years had a lower infant mortality rate" is correct. We mark the statement as correct.

66 11/4/2015Slide 66 The next statement asks about the strength of the relationship based on Tukey’s criteria.

67 11/4/2015Slide 67 Using the rule of thumb attributed to Tukey, an R² between 0.0 and 0.04 is very weak; 0.04 to 0.16 is weak; 0.16 to 0.36 is moderate; 0.36 to 0.64 is strong; and greater than 0.64 is very strong, the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%). To answer the question about the strength of the relationship, we look to the Model Summary table.

68 11/4/2015Slide 68 The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was correctly characterized as a very strong relationship (R² = 78.2%). The check box for a correct answer is marked.

69 11/4/2015Slide 69 The final statement asks about the strength of the relationship based on Cohen’s criteria.

70 11/4/2015Slide 70 Applying Cohen's criteria for effect size (less than 0.01 = trivial; 0.01 up to 0.09 = weak or small; 0.09 up to 0.25 = moderate; 0.25 or greater = strong or large), the relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship. The relationship should have been characterized as a strong relationship (R² = 78.2%).

71 11/4/2015Slide 71 The relationship between the log transformation of "infant mortality rate" [LG_infmort] and "population median age in years" [agemdn]was incorrectly characterized as a moderate relationship. The check box for a correct statement is not marked.

72 11/4/2015Slide 72 Dependent variable and independent variable both quantitative? Yes Stop. Remaining statements are not marked. Sample size stated correctly? Yes Do not mark check box. Mark statement check box. No There is no explicit question about level of measurement, but it should always be something we consider. Because of the large number of steps needed to solve this problem, we will outline the process.

73 11/4/2015Slide 73 Scatterplot supports linearity assumption? Yes Mark statement check box. Residual plot supports linearity/homogeneity? Yes Mark statement check box. If we don’t satisfy linearity, we go to the comparison of r and rho. If we don’t satisfy homogeneity, we go to the comparison of r and rho. Do not mark check box. No Do not mark check box. No

74 11/4/2015 Slide 74 Cook’s distance < 0.5 for all cases? Yes Mark statement check box. Spearman’s rho > Pearsons r? Yes Mark statement check box. No Do not mark check box. When we satisfy the linear conditions, we bypass the questions on re-expression. We use the comparison of r and rho to frame our expectations of re-expressing variables. Note: we compare absolute values, ignoring the sign of both r and rho. Do not mark check box. No

75 11/4/2015Slide 75 Scatterplot supports linearity assumption? Yes Do not mark check box. Mark statement check box. No Re-express variables with skew ≤ -1.0 or ≥ 1.0 Residual plot supports linearity/homogeneity? Yes Do not mark check box. Mark statement check box. No We re-express one or both variable, and re-test for a linear relationship.

76 11/4/2015 Yes Do not mark check box. Mark statement check box. Cook’s distance < 0.5 for all cases? Stop. Linear model is not appropriate. No YesNo Do not mark check box. Mark statement check box. Direction of relationship (b) interpreted correctly? If we support a linear model with either the raw data or the re-expressed data, we interpret direction and strength. This is the strategy for our homework problems. In reality, we might choose to interpret the relationship even though assumptions were violated.

77 11/4/2015Slide 77 Strength of relationship (R²) interpreted correctly? YesNo Do not mark check box. Mark statement check box.


Download ppt "11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the."

Similar presentations


Ads by Google