8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the assumption of linearity in the scatterplot of the raw data and the residual plot, the spread of the residuals is equal for all of the predicted values in the residual plot, and there are no outliers impacting the linear model. When the relationship we are analyzing does not meet these criteria, the use of regression analysis can still be justified if re- expressing one or both variables: reduces the non-linear pattern in the scatterplot, equalizes the variance in the residual plot, and reduces the distance of outliers from the other cases in the distributions.
8/7/2015Slide 2 Clues that re-expression might be effective in linearizing the relationship are: the identification of influential cases, severe skewing of one or both variables (outside the range from -1.0 to +1.0), and when Spearman's rho greater than Pearson's r. There is no guarantee that re-expression will produce a scatterplot that satisfies the assumptions of linear regression. When it does not we are left with the choice of determining that the violations are not of serious consequence, or choosing an alternative strategy for modeling the relationship. To solve these problems, we will first assess the conformity of the relationship to regression assumptions. Second, we will examine the criteria that suggest that re-expression might be effective. Third, we will examine the model using re-expressed variables to assess conformity to regression assumptions.
8/7/2015Slide 3 We will use a new strategy for identifying outliers that we may consider omitting from the analysis – Cook’s distance. Cook’s distance combines information about standardized residuals and leverage for independent variables so we can use one measure instead of three. Cook’s distance is a measure of the influence which a case has on the regression solution, i. e., how different would the solution be if this case were omitted. Larger values of Cook’s distance indicate a greater effect on the regression analysis. There are different criteria for what constitutes an outlier on Cook’s distance. Cook’s original criteria was 1.0 Fox proposed 4 / (number of cases – number of iv’s – 1) We will use 0.5, which is about halfway between the other two.
8/7/2015Slide 4 We will use an updated version of the script for simple linear regression to analyze relationships and test re-expressions. The script will compute the transformations for both the dependent and independent variables. The defaults are marked. My preferences are for a scatterplot with boxplots for each variable, and the residual plot. We will use a combination of fit lines to evaluate normality. We have options for the criteria for Cook’s distance and the opportunity to exclude influential cases.
8/7/2015Slide 5 This scatterplot shows that the blue loess fit line fluctuates slightly around the regression line, and stays within the confidence interval.
8/7/2015Slide 6 The residual plot shows that the vertical spread of the residuals is approximately the same height from left to right across the predicted values. There is no evidence of a pattern or shape, suggesting non-linearity.
8/7/2015Slide 7 Influential cases are green instead of blue. There are no cases with undue influence in this plot. This relationship satisfies the criteria for a linear relationship. There is no reason to re-express the data.
8/7/2015Slide 8 The next problem examines the relationship between poverty and per capita GDP.
8/7/2015Slide 9 The loess fit line clearly curves outside the confidence interval. Th boxplot for GDP suggests that we should re-express GDP on a logarithmic scale. The large positive skew value supports the use of logarithms.
8/7/2015Slide 10 The limited spread in the left side of the plot suggests a problem with homogeneity of variance as well as linearity. The pattern of the points is u-shaped supporting the non-linearity
8/7/2015Slide 11 The limited spread on the left side of the plot suggests a problem with homogeneity of variance as well as linearity.
8/7/2015Slide 12 To re-express GDP as logarithms, mark the option button for scale.
8/7/2015Slide 13 The log transformation improved the linearity of the scatterplot. The loess fit line moves slightly outside the confidence interval, but it is more a fluctuation than a well-defined curve.
8/7/2015Slide 14 The residual plot shows that the vertical spread is somewhat reduced at the left side of the plot. It is not so pronounced to be treated as a non-linear relationship. I would interpret the relationship between poverty and the log of GDP as linear.
8/7/2015Slide 15 The skewness of poverty (0.563) was not as pronounced as the skewness of GDP, but we can still re-express the data to see its impact on the relationship.
8/7/2015Slide 16 Including the log of poverty increased the non-linearity shown at the middle of the loess line, though R² increased from to
8/7/2015Slide 17 I think I see evidence of a curve emerging in the residual plot rather than up and down fluctuations. I think a case could be made to include the log of poverty based on the higher R². as well as a case for using raw data for poverty since it is more linear.
8/7/2015Slide 18
8/7/2015Slide 19 The curve clearly curves outside the confidence interval.
8/7/2015Slide 20 Both non-linearity and unequal variance are evident in the residual plot
8/7/2015Slide 21 We will first re-express deathrat as logarithms.
8/7/2015Slide 22 The curve looks more evident after the transformation.
8/7/2015Slide 23 The curve looks more evident after the transformation.
8/7/2015Slide 24 We will re-express poverty as logarithms as well, but I am not optimistic that it will help.
8/7/2015Slide 25 The second transformation did not help either. We can try the transformation of poverty with the raw data for deathrat.
8/7/2015Slide 26 Nor does it help to use the logarithm of poverty with the raw data for deathrat. We should be very cautious about reporting this relationship as linear.
8/7/2015Slide 27
8/7/2015Slide 28 In addition to being non- linear, this relationship shows one influential case.
8/7/2015Slide 29 In addition to being non- linear, this relationship shows one influential case.
8/7/2015Slide 30 x xx Since the skewness for both variables is greater than 1.0, we will try a log transformation for both.
8/7/2015Slide 31 The loess line has a very shallow curve to it, though without the loess line, I would judge this to be linear. The influential case is not as distant from the other cases in the scatterplot and is no longer colored green as an influential case.
8/7/2015Slide 32 The lower left-hand corner looks suspicious for equality of variance, but this is may be the result of lower bounds for the variables, i.e. values stop at zero and cannot be negative.