12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis.

12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis. Each dot represents the combination of scores on both variables for one or more cases. If two or more cases have the same scores, they will be shown by the same dot. The relationship between two quantitative variables is pictured with a scatterplot. This scatterplot depicts the relationship between Percent of the population living below the national poverty line [poverty] and Percent of population enrolled in primary, secondary, and tertiary schools [enrolPop]. The SPSS syntax file CorrelationAndRegression.sps was used to produce the following output.

12/14/2015Slide 2 The histogram at the bottom of the display shows the distribution of the independent variable. It is color coded in red to link the histogram and the scatterplot. The histogram at the top of the display shows the distribution of the dependent variable. It is color coded in green to link the histogram and the scatterplot. To facilitate our understanding of the distribution of variables in the scatterplot, we add a histogram for each variable to the display of charts. The histograms support the evaluation of skewness and the presence of outliers. Each histogram has a normal curve overlay, skewness, and kurtosis, to help us evaluate the shape.

12/14/2015Slide 3 For the dependent variable, the mean is green and the standard deviation units are tan. To support the location of outliers and the evaluation of normality, lines have been added at the location of the mean and standard deviation units to both charts. In the distribution of the dependent variable, we see that all of the cases fall within three tan standard deviations of the green mean, so there are no outliers more than 3 standard deviations from the mean.

12/14/2015Slide 4 For the dependent variable, the mean is red and the standard deviation units are orange. The skewness and kurtosis statistics for each histogram tell us that we satisfy the criteria for a nearly normal distribution.

12/14/2015Slide 5 We add a blue trend line or linear fit line that summarizes the overall pattern of the cases in the scatterplot. The strength of the relationship is depicted by the narrowness of the band around the trend line, though this is somewhat distorted by the desire to spread the points out throughout the graph space. Strength is measured with greater precision by the r and rho statistics in the scatterplot title.

12/14/2015Slide 6 We add the purple colored loess smoother fit line to evaluate the linearity of the relationship. A loess smoother averages subsets of points and thus tracks more closely where the points are concentrated. Differences between the linear fit line and the loess smoother are a visual tool for determining whether the relationship is linear. I would judge the overall pattern in this plot to be linear for because the differences between the linear fit line and the loess smoother are small and does not suggest a well-defined curve.

12/14/2015Slide 7 This chart shows a clear pattern of non-linearity. At the left side of the chart, increasing per capital health expenditures has a substantial impact on the rate of infant mortality rate, but increasing per capita health expenditures past $1,000 does not appear to produce further reductions in the rate of infant mortality.

12/14/2015Slide 8 To quantify the relationship between two quantitative variables, we use a correlation coefficient, Pearson’s r or Spearman’s rho. The correlation coefficient tells us: If there is a relationship between the variables The strength of the relationship The direction of the relationship Correlation coefficients vary from -1.0 to +1.0. A correlation coefficient of 0.0 indicates that there is no relationship. A correlation coefficient of -1.0 or + 1.0 indicates a perfect relationship, i.e. the scores on one variable can be accurately determined by the scores on the other variable.

12/14/2015Slide 9 If a correlation coefficient is negative, it implies an inverse relationship, i.e. the scores on the two variables move in opposite directions, higher scores on one variable are associated with lower scores on the other variable. If a correlation coefficient is positive, it implies a direct relationship, i.e. the scores on the two variables move in the same direction, higher scores on one variable are associated with higher scores on the other variable. When we talk about the size of a correlation, we refer to the value irrespective of the sign – a correlation of -.728 is just as large or strong as a correlation of +.728 The Pearson R correlation coefficient treats the data as interval. Spearman’s Rho treats the data as ordinal, using the rank order of the scores for each variable rather than the values.

12/14/2015Slide 10 Suppose I had the data to the right showing the relationship between GPA and income. SPSS would calculate Pearson’s r for this data to be.911 and Spearman’s rho to be.900. GPAIncome 3.245000 3.342000 3.548000 3.750000 3.855000 GPA Rank Income Rank 12 21 33 44 55 The ranks for the values for each of the variables are shown in the table to the right. Using the ranks as data, SPSS would calculate both Pearson’s r and Spearman’s rho to be.900.

12/14/2015Slide 11 Suppose the fifth subject had an income of 100,000 instead of 55,000. SPSS would calculate Pearson’s r for this data to be.733 and Spearman’s rho to be.900. GPAIncome 3.245000 3.342000 3.548000 3.750000 3.8100000 GPA Rank Income Rank 12 21 33 44 55 The ranks for the values did not change. The fifth subject had the highest income, so Spearman’s rho has the same value. The Pearson’s r decreased from.911 to.733. Outliers, and the skewing of the distribution by outliers, have a greater effect on Pearson’s r than they do on Spearman’s rho.

12/14/2015Slide 12 In the scatterplot, outliers (the case I changed from 55,000 to 100,000) will draw the loess line toward them away from the linear fit line, making the pattern of points appear less linear. or more non-linear. 55,000 100,000 The lines demonstrate the point, but the cyan line is really a quadratic fit rather than a loess line because I can’t do much smoothing with only 5 data points.

12/14/2015Slide 13 Outliers, and the skewing of the distribution by outliers, have a greater effect on Pearson’s r than they do on Spearman’s rho. As the outliers become more extreme, and the distribution becomes more skewed, Spearman’s rho becomes larger than Pearson’s r, and the overall trend in the data is non-linear. To accurately model the relationship, we have three choices: 1.use a more complex non-linear model to analyze the relationship 2.re-express the data to reduce skewing and the impact of outliers, and analyze the relationship with a linear model 3.Exclude outliers to reduce skewing, and analyze the relationship with a linear model The second alternative is preferred, though it may not always be possible.

12/14/2015Slide 14 If the three following conditions are present, re-expressing the data may reduce the skewness and increase the size of Pearson’s r to justify treating the relationship as linear: 1.If the model appears non-linear because of the difference between the loess line and the linear fit line, 2.If Spearman’s rho is larger than Pearson’s r (by ±.05 or more), 3.If one or both of the variables violates the skewness criteria for a normal distribution. We will employ the transformations we have used previously: if the distribution is negatively skewed, we re- express the data as squares; if the distribution is positively skewed, we re-express the data as logarithms.

12/14/2015Slide 15 There are two sets of guidelines used to translate the r correlation coefficient into a narrative phrase, guidelines attributed to Tukey and guidelines attributed to Cohen. Tukey’s guidelines interpret a correlation: between 0.0 up to ±0.20 as very weak; equal to or greater than ±0.20 up to ±0.40 as weak; equal to or greater than ±0.40 up to ±0.60 as moderate; equal to or greater than ±0.60 up to ±0.80 as strong; and equal to or greater than ±0.80 as very strong. Cohen’s guidelines interpret a correlation: less than ±0.10 = trivial; equal to or greater than ±0.10 up to ±0.30 = weak or small; equal to or greater than ±0.30 up to ±0.50 = moderate; equal to or greater than ±0.50 or greater = strong or large

Examples Slide 16

In this chart, both variables are positive skewed, and are more peaked that expected for normally distributed variables. The scatterplot is clearly not linear and Spearman’s rho suggests a much stronger relationship that Pearson’s r.

The log re-expression of both variables was effective in reducing the skewness of both variables, improving the linearity of the relationship, and producing a Pearson’s r that is the same strength as Spearman’s rho.

Re-expressions is not always effective at improving the relationship. In this example, Percent of females enrolled in primary education [enrFPri] is negatively skewed, and the scatterplot suggests a non- linear relationship. The birth rate declines rapidly at low levels of female enrollment (35% to 60%). At high level of female enrollment (80% to 100%), birth rates range from 10 to 45.

The re-expression reduced the skewness from -1.47 to -1.09, but it was not successful in improving the linearity of the relationship. There is little advantage to reporting the more complicate model that utilizes re- expression. The birth rate declines rapidly at low levels of female enrollment. At high level of female enrollment, birth rates continue to be spread across a wide range.

12/14/2015Slide 21 We might legitimately choose not to use any transformation, or to ignore the non-linearity of the relationship, but we should look at the plots and statistics for the distribution of the variables in the analysis so we are making an informed choice. The consequence of ignoring the issue of linearity is usually that we fail to state the actual importance of a relationship, though there are occasions when we might be citing a relationship as important when it’s strength is the result of extreme outliers. I am not confident that we can always draw a correct conclusion about re-expression by visually inspecting histograms and scatterplots. While I can avoid those that move the distribution in the wrong direction, I usually test all that are potentially applicable before making a decision.

Homework Problems Slide 22

SOLVING THE HOMEWORK PROBLEMS Pearson's r correlation coefficient measures the strength of the linear relationship between the distributions of two quantitative variables. If the relationship is not linear, the application of statistics that assume linearity may give questionable results. Determining whether a relationship should be characterized as linear or non-linear is challenging. One indicator of non-linearity is the difference between the rank-order correlation correlation coefficient (Spearman's rho) and Pearson's r. When Spearman's rho is larger than Pearson's r, the relationship is likely to be non-linear, and Pearson's r may understate the strength of the relationship. However, we can improve the linearity of the relationship and justify the use of statistics that assume linearity if one or both variables are badly skewed due to outliers, but can be corrected by re-expressing the data.

12/14/2015Slide 24 This example is from: Applied Statistics: From Bivariate through Multivariate Techniques by Rebecca M. Wagner, page 303. Problems for this assignment are based on the following summary of a correlation analysis.

12/14/2015Slide 25 Correlation of Quantitative Variables - 1 This is an sample of the problems in this assignment, with the correct answers displayed. In these problems, we will assess the normality conditions for both variables, but we will not re- express the variables or omit outliers. We will interpret the direction and strength of the relationship We will interpret the difference between the correlation measures.

12/14/2015Slide 26 Correlation of Quantitative Variables - 2 The first paragraph asks about the number of cases included in the analysis. The notes provide information about: the data set and variables to use, the criteria for evaluating normality, and the criteria for assessing effect size.

12/14/2015Slide 27 Correlation of Quantitative Variables - 3 To include only the cases that have valid data for both variables, choose the Select Cases command from the Data menu. When we use z-scores for outlier detection, we can either create z-scores for all cases in the distribution of a variable without regard to cases that are missing data for the other variable. These z-scores use information from cases that are not included in the correlation. To make sure that the z-scores we use for outlier detection are the same cases in the rest of the analysis, we will explicitly exclude cases that are missing data for either variable.

12/14/2015Slide 28 Correlation of Quantitative Variables - 4 First, mark the option button: If condition is satisfied Second, when the option button is marked, the If… button is activated. Click on the If… button to specify the condition.

12/14/2015Slide 29 Correlation of Quantitative Variables - 5 The NMISS function counts the number of variables that have missing data for a case. If NMISS equals 0, the case has valid data for all of the variables and should be included in the analysis. Click on the Continue button to close the dialog box. SPSS includes function commands that perform specific calculations which we can use for creating new variables or for selecting cases.

12/14/2015Slide 30 Correlation of Quantitative Variables - 6 Having entered the condition, click on the OK button to complete the selection. The condition we entered is printed to the right of the If… button.

12/14/2015Slide 31 Correlation of Quantitative Variables - 7 SPSS marks the cases that will not be included with slashes through the case number.

12/14/2015Slide 32 Correlation of Quantitative Variables - 8 To compute the descriptive statistics in SPSS, select the Descriptive Statistics > Descriptives command from the Analyze menu. Now that we have specifically included only the cases that are not missing data for any variable, we create the statistics and standard scores we need to assess the normality of the distribution of the variables to be correlated.

12/14/2015Slide 33 Correlation of Quantitative Variables - 9 Move the variables for the analysis poverty and prison to the Variable(s) list box. Click on the Options button to select optional statistics.

12/14/2015Slide 34 Correlation of Quantitative Variables - 10 The check boxes for Mean and Std. Deviation are already marked by default. Click on Continue button to close the dialog box. Mark the Kurtosis and Skewness check boxes. This will provide the statistics for assessing normality.

12/14/2015Slide 35 Correlation of Quantitative Variables - 11 Click on the OK button to produce the output. Mark the check box Save standardized values as variables.

12/14/2015Slide 36 Correlation of Quantitative Variables - 12 The table of Descriptive Statistics tells us the number of valid cases for the problem, 128.

12/14/2015Slide 37 Correlation of Quantitative Variables - 13 We enter the number of valid cases from the Descriptives output table, 128. The next paragraph asks us to evaluate the normality of the distribution based on three criteria: skewness kurtosis outliers more than three standard deviations from the mean. The criteria to be applied are listed in note 2. We postpone the characterization of the distribution until we have examined the three individual criteria.

12/14/2015Slide 38 Correlation of Quantitative Variables - 14 We can use the Correlation and Regression script to produce the histograms and scatterplot. Highlight the dependent variable, poverty, and the independent variable, prison, and click on the Run button. Mark the checkboxes to overlay means and standard deviations on the chart.

12/14/2015Slide 39 Correlation of Quantitative Variables - 15 The histogram for poverty has a nearly normal shape, and the skew and kurtosis are both between -1.0 and +1.0. It is not clear from histogram whether there are any outlier cases to the right above three standard deviations. The three standard deviation line does not appear in the scatterplot, reinforcing the interpretation that there are no outliers. To make certain, we need to check the z-scores for the variable. Mean + 1 S.D. Mean + 2 S.D. NOTE: the apparent number of outliers in the scatterplot may not be accurate because two or more cases with the same scores will appear as a single dot. Mean - 1 S.D.

12/14/2015Slide 40 Correlation of Quantitative Variables - 16 The histogram for prison has does not appear nearly normal. It is skewed to the right and shows 3 outliers to the left of the 3 standard deviation orange line. Both skewness and kurtosis are well above +1.0 The three outliers clearly show up in the scatterplot. Rho indicates a stronger relationship that r, suggesting that the re-expression may result in a stronger relationship + 1 S.D. + 2 S.D. + 3 S.D. NOTE: the apparent number of outliers in the scatterplot may not be accurate because two or more cases with the same scores will appear as a single dot.

12/14/2015Slide 41 Correlation of Quantitative Variables - 17 We can create a histogram in SPSS. Select Legacy Dialogs > Histogram from the Graphs menu.

12/14/2015Slide 42 Correlation of Quantitative Variables - 18 Move the variable prison to the Variable: text box. Mark the check box for Display normal curve. Click on the OK button to produce the output.

12/14/2015Slide 43 Correlation of Quantitative Variables - 19 The histogram with the normal curve overlay appears in the SPSS Viewer. Reference lines can be added to the histogram manually if we calculated the descriptive statistics, but we can answer our questions just by examining the statistical output.

12/14/2015Slide 44 Correlation of Quantitative Variables - 20 The first item in the second sentence asks us to enter and characterize the degree and direction of the skewness for the distribution.

12/14/2015Slide 45 Correlation of Quantitative Variables - 21 Since the skewness is positive, we characterize it as skewness to the right. Since skewness (.57) is smaller than +1.0, we characterize it as slightly skewed to the right. We enter the value of skewness from the table of descriptive statistics.

12/14/2015Slide 46 Correlation of Quantitative Variables - 22 The second part of the sentence asks us to enter and characterize the kurtosis of the distribution.

12/14/2015Slide 47 Correlation of Quantitative Variables - 23 Since the kurtosis is negative, we characterize it as flat. Since the kurtosis is greater than -1.0, we characterize it as slightly flatter. We enter the value of kurtosis from the table of descriptive statistics.

12/14/2015Slide 48 Correlation of Quantitative Variables - 24 The next sentence asks us to identify the number of extreme outliers, defined in note 2 as standard scores that were three or more standard deviations from the mean.

12/14/2015Slide 49 Correlation of Quantitative Variables - 25 In this example, we will count the number of outliers by sorting the column of data values. First, click on the column header for the variable (Zpoverty) containing the standard scores to select the column of data. Second, right click on the column header (Zpoverty) and select Sort Ascending from the popup menu. This will show any negative outliers at the top of the column.

12/14/2015Slide 50 Correlation of Quantitative Variables - 26 Scroll down in the data editor, past the cases with missing values. With the data for Zpoverty sorted in ascending order, we see that the smallest z-score was -1.55323. There are no outliers at the negative end of the distribution.

12/14/2015Slide 51 Correlation of Quantitative Variables - 27 Click the right mouse button again on the column header for Zpoverty, and select Sort Descending from the pop-up menu. This will show any positive outliers at the top of the column.

12/14/2015Slide 52 Correlation of Quantitative Variables - 28 With the data for Zpoverty sorted in descending order, we see that the largest z-score was 2.67952. There are no outliers at the positive end of the distribution. Since there were no outliers at either the positive or negative ends of the distribution, there are no outliers for this variable.

12/14/2015Slide 53 Correlation of Quantitative Variables - 29 We enter 0 for the number of extreme outliers.

12/14/2015Slide 54 Correlation of Quantitative Variables - 30 Since the distribution was slightly skewed to the right, slightly flatter than expected, and contained no extreme outliers, it is nearly normal.

12/14/2015Slide 55 Correlation of Quantitative Variables - 31 The next paragraph asks us to evaluate the normality of the distribution of the second variable based on the same three criteria: skewness kurtosis outliers more than three standard deviations from the mean. The criteria to be applied are listed in note 2. We postpone the characterization of the distribution until we have examined the three individual criteria.

12/14/2015Slide 56 Correlation of Quantitative Variables - 32 The first item in the second sentence asks us to enter and characterize the degree and direction of the skewness for the distribution.

12/14/2015Slide 57 Correlation of Quantitative Variables - 33 Since the skewness is positive, we characterize it as skewness to the right. Since skewness 1.93 is larger than +1.0, we characterize it as badly skewed to the right. We enter the value of skewness from the table of descriptive statistics.

12/14/2015Slide 58 Correlation of Quantitative Variables - 34 The second part of the sentence asks us to enter and characterize the kurtosis of the distribution.

12/14/2015Slide 59 Correlation of Quantitative Variables - 35 Since the kurtosis is positive, we characterize it as peaked. Since the kurtosis is greater than +1.0, we characterize it as much more peaked. We enter the value of kurtosis from the table of descriptive statistics.

12/14/2015Slide 60 Correlation of Quantitative Variables - 36 The next sentence asks us to identify the number of extreme outliers, defined in note 2 as standard scores that were three or more standard deviations from the mean.

12/14/2015Slide 61 Correlation of Quantitative Variables - 37 In this example, we will count the number of outliers by sorting the column of data values. First, click on the column header for the variable (Zprison) containing the standard scores to select the column of data. Second, right click on the column header (Zprison) and select Sort Ascending from the popup menu. This will show any negative outliers at the top of the column.

12/14/2015Slide 62 Correlation of Quantitative Variables - 38 Scroll down in the data editor, past the cases with missing values. With the data for Zprison sorted in ascending order, we see that the smallest z-score was -1.05312. There are no outliers at the negative end of the distribution.

12/14/2015Slide 63 Correlation of Quantitative Variables - 39 Click the right mouse button again on the column header for Zprison, and select Sort Descending from the pop-up menu. This will show any positive outliers at the top of the column.

12/14/2015Slide 64 Correlation of Quantitative Variables - 40 With the data for Zprison sorted in descending order, we see that there are three outliers at the positive end of the distribution that have z-scores greater than 3.0. This reinforces our conclusion that the distribution was badly skewed to the right.

12/14/2015Slide 65 Correlation of Quantitative Variables - 41 We enter 3 for the number of extreme outliers.

12/14/2015Slide 66 Correlation of Quantitative Variables - 42 Since the distribution was badly skewed to the right, more peaked than expected, and contained three extreme outliers, it is not nearly normal.

12/14/2015Slide 67 Correlation of Quantitative Variables - 43 Though one of the variables did not satisfy the normality assumption, we can still compute and present the findings for the correlation of the two variables. At worst, we can acknowledge the violation of the assumption as a limitation to the analysis. Or, further analysis may suggest re-expression to meet the expected assumption. The next paragraph presents the correlation between the two quantitative variables.

12/14/2015Slide 68 Correlation of Quantitative Variables - 44 The first sentence in this paragraph asks us to enter the direction of the relationship. To answer this question, we compute the correlation. NOTE: the correlation measures are included in the script output in the title for the scatterplot.

12/14/2015Slide 69 Correlation of Quantitative Variables - 45 To compute correlations, select Correlate > Bivariate from the Correlate menu.

12/14/2015Slide 70 Correlation of Quantitative Variables - 46 First, move the variables poverty and prison to the Variables list box. Second, mark the check box for Spearman and leave the check box for Pearson marked. Third, click on the OK button to produce the output.

12/14/2015Slide 71 Correlation of Quantitative Variables - 47 The Pearson Correlation r was -0.16. The negative sign of r means that the relationship is negative.

12/14/2015Slide 72 Correlation of Quantitative Variables - 48 The negative sign of r means that the relationship is negative. The next sentence further interprets the direction of the relationship, i.e. whether higher scores on the independent variable are associated with higher or lower scores on the dependent variable.

12/14/2015Slide 73 Correlation of Quantitative Variables - 49 A negative correlation means that the scores of the two variables move in opposite directions, i.e. as one increases the other decreases. Thus, larger prison populations are associated with lower rates of poverty. The next sentence asks us to enter the value of Pearson’s r.

12/14/2015Slide 74 Correlation of Quantitative Variables - 50 The Pearson Correlation is entered into the narrative.

12/14/2015Slide 75 Correlation of Quantitative Variables - 51 The next sentence interprets the Pearson Correlation as the Coefficient of Determination, r². R² is not computed as part of the Correlations output, but we can compute it with a calculator or in Excel. It is computed as r multiplied by itself.

12/14/2015Slide 76 Correlation of Quantitative Variables - 52 By my calculations, r² is equal to: -0.16 x -0.16 = 0.0256, or 0.03 The R squared of 0.03 converts to 3%, which is interpreted as the percent of variance in the dependent variable explained or predicted by the variance in the independent variable.

12/14/2015Slide 77 Correlation of Quantitative Variables - 53 The final sentence in the paragraph asks for the effect size interpretation of the correlation. We use Cohen’s criteria for characterizing the effect size, as shown in Note 3.

12/14/2015Slide 78 Correlation of Quantitative Variables - 54 The last paragraph calls for us to enter Spearman’s rho and compare its size to Pearson’s r. A Pearson r of -0.16 falls in the interval from -.10 to -.30, characterized as weak.

12/14/2015Slide 79 Correlation of Quantitative Variables - 55 Spearman’s rho is entered into the narrative. A coefficient of -0.25 implies a stronger relationship than a correlation of -0.16.

12/14/2015Slide 80 Correlation of Quantitative Variables - 56 The final part of the sentence focuses on whether or not there is a stronger relationship between the variables than that represented by Pearson’s r and we should consider re-expression.

12/14/2015Slide 81 Correlation of Quantitative Variables - 57 When Spearman’s rho indicates a stronger relationship that Pearson’s r, we characterize Pearson’s r as understating the strength of the relationship. If one or both of the variables was not nearly normal, we could try re-expression, but it is not required for these problems.

12/14/2015Slide 82 Correlation of Quantitative Variables - 58 Submitting the problem for grading supports the correctness of our answers.

12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis.

Similar presentations

Presentation on theme: "12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis.

Similar presentations

Presentation on theme: "12/14/2015Slide 1 The dependent variable, poverty, is plotted on the vertical axis. The independent variable, enrolPop, is plotted on the horizontal axis."— Presentation transcript:

Similar presentations

About project

Feedback