Relationships Among Variables Chapter 8 Relationships Among Variables Research Methods in Physical Activity
Correlation — A statistical technique used to determine the relationship between two or more variables. Correlations may be simple, when they involve only two variables of comparison, or may be multiple correlations when they involve more than two variables. Multiple correlations have a dependent variable (criterion variable) and two or more independent variables (predictor variables). A canonical correlation, establishes the relationships between two or more dependent variables and two or more independent variables. Research Methods in Physical Activity
Positive correlation — A relationship between two variables in which a small value for one variable is associated with a small value for another variable, and a large value for one variable is associated with a large value for the other. Research Methods in Physical Activity
Negative correlation — A relationship between two variables in which a small value for the first variable is associated with a large value for the second variable, and a large value for the first variable is associated with a small value for the second variable. Research Methods in Physical Activity
Correlation and Causation A correlation between two variables does not mean that one variable causes the other. While two variables must be correlated for a cause and effect relationship to exist, correlation alone does not guarantee such a relationship. Correlation is a necessary but not sufficient condition for causation. The only way that causation can be shown is with an experimental study in which an independent variable can be manipulated to bring about an effect. Research Methods in Physical Activity
coefficient of correlation [ r ] — A quantitative value of the relationship between two or more variables that can range from .00 to 1.00 in either a positive or negative direction. Pearson product moment coefficient of correlation — The most commonly used method of computing correlation between two variables; also called interclass correlation, simple correlation, or Pearson r. The Pearson r has one criterion (or dependent) variable and one predictor (or independent) variable. An important assumption for the use of r is that the relationship between the variables is expected to be linear, that is, that a straight line is the best model of the relationship. When that is not true (e.g., figure 8.4d, p. 129 ), r is an inappropriate way to analyze the data. Research Methods in Physical Activity
Computation of the correlation coefficient The computation of the correlation coefficient involves the relative distances of the scores from the two means of the distributions. The formula consists of only three operations: Sum each set of scores. Square and sum each set of scores. Multiply each pair of scores and obtain the cumulative sum of these products. See Example 8.1, p.130, for example of computation Research Methods in Physical Activity
Computation of the correlation coefficient In a correlation problem that simply determines the relationship between two variables, it does not matter which one is X and which is Y. If the investigator wants to predict one score from the other, then Y designates the criterion (dependent) variable (that which is being predicted) and X the predictor (independent) variable. In the example of the positive correlation to the left, the criterion variable is the “Years of education”, and the predictor variable is the annual income. Thus, we would “predict” the years of educational experience based upon the annual income. Research Methods in Physical Activity
Interpreting the reliability of r What does a coefficient of correlation mean in terms of being high or low, satisfactory or unsatisfactory? One criterion is its reliability, or significance. Does it represent a real relationship? That is, if the study were repeated, what is the probability of finding a similar relationship? For this statistical criterion of significance, simply consult a table. Table 3 in the appendix (p. 428) contains the necessary correlation coefficients for significance at the .05 and .01 levels. In using the Table 3, select the desired level of significance, such as the .05 level, and then find the appropriate degrees of freedom (df, which is based on the number of participants corrected for sample bias), which, for r, is equal to N – 2 (remember, the variable N in correlation refers to the number of pairs of scores). Research Methods in Physical Activity
Some Observations about “significant r” (refer to Table 3) 1) The correlation needed for significance decreases with increased numbers of participants (df). Very low correlation coefficients can be significant if you have a large sample of participants. At the .05 level, r = .38 is significant with 25 df, r = .27 is significant with 50 df, and r = .195 is significant with 100 df. The second observation to note from the table is that a higher correlation is required for significance at the .01 level than at the .05 level. The .05 level means that if 100 experiments were conducted, the null hypothesis (that there is no relationship) would be rejected incorrectly, just by chance, on 5 of the 100 occasions. At the .01 level, we would expect a relationship of this magnitude because of chance less than once in 100 experiments. Therefore, the test of significance at the .01 level is more stringent than at the .05 level, so a higher correlation is required for significance at the .01 level. Research Methods in Physical Activity
Interpreting the Meaningfulness of r The interpretation of a correlation for statistical significance is important, but because of the vast influence of sample size, this criterion is not always meaningful. The most commonly used criterion for interpreting the meaningfulness of the correlation coefficient is the coefficient of determination (r2). With r2 the portion of common association of the factors that influence the two variables is determined. Thus, the coefficient of determination indicates the portion of the total variance in one measure that can be explained, or accounted for, by the variance in the other measure. The Venn diagram visually depicts this idea. Circle A represents the variance in one variable, and Circle B represents the variance in a second variable. Overlay C, r = .60; thus r2 = .36 (shared variance). Thus, 36% of changes in A can be explained by changes in B. (Unexplained variance is equal to 1- r2. Research Methods in Physical Activity
Using Correlation for Prediction (Regression) Prediction is based on correlation. The higher the relationship is between two variables, the more accurately you can predict one from the other. If the correlation were perfect, you could predict with complete accuracy. Thus, r = .00 means no predictive ability, and r = 1.0 means absolute predictive ability. Prediction equation — A formula to predict some criterion (e.g., some measure of performance) based on the relationship between the predictor variable(s) and the criterion; also called regression equation. We predict “Y” (criterion or dependent variable) on the ordinate/vertical axis from “X” ( predictor or independent variable) on the abscissa/horizontal axis. Research Methods in Physical Activity
Using Correlation for Prediction (regression) We predict “Y” (criterion or dependent variable) on the ordinate/vertical axis from “X” ( predictor or independent variable) on the abscissa/horizontal axis…. Where, Y = a + bX (equation for a straight line) Y = the predicted score (dependent score) X = the predictor score (independent score) a = the intercept point on Y b = the slope of the line Keep it simple. 1) “a” is the place on the “Y” axis, where the line will intersect, and 2) the “slope of the line” is really about how “X” changes with “Y” (degree or magnitude of slope) and their direction (positive or negative) So, if we want to predict “Y” from “X” then we need to calculate “a” and “b”. Research Methods in Physical Activity
Using Correlation for Prediction (regression) Calculating “a” and “b” First you will need to calculate “b” which is determined by the correlation coefficient and the standardized variance (standard deviation ) of variables “X” and “Y” with the following formula: b= r(sY/sX) sY = the standard deviation of “Y” sX = the standard deviation of “X” Note that the slope of the line is not only about the association of “X” and “Y” (direction: positive or negative), but also the degree to which the variance of “X” is related to the variance of “Y” (rise over run = degree or magnitude of slope). Research Methods in Physical Activity
Using Correlation for Prediction (regression) Calculating “a” and “b” Next you can calculate “a” which is the intercept on the “Y” axis: a = MY - bMX MY = the Mean of the “Y” scores MX = the Mean of the “X” scores b = the slope of the regression line (see last slide) Note this formula will only produce one value dependent upon the measure of central tendency (means of “X” and “Y”) and the variance of “X” and “Y” (the degree to which the variance of “X” is related to the variance of “Y” or “b”) So, the intercept of the line and the slope of the line are dependent on the mean and standard deviation of “X” and “Y”. (see your text for examples of using the regression formula) Research Methods in Physical Activity
Line of Best Fit (regression line) The line of best fit is the line that passes through the intersection of the X and Y means. The slope of the line is dependent not only on the mean but also the variance of X and Y (see previous formulas). Thus the line of best fit is the least distance for all of the X and Y coordinates, it is the “best fit” for all the X and Y data coordinates. The line is a regression line because used to predict Y from X. (see previous slides) Those X and Y coordinates that do not fall on the line are called residuals or residual scores. residual scores — The difference between the predicted and actual scores that represents the error of prediction. Note that if you have perfect correlation all scores in the scatter plot would be in a straight line (line of best fit) and there would be no residual scores. Also, residual scores are really unexplained variance (error of prediction). Research Methods in Physical Activity
Line of Best Fit (regression line) If we were to compute all the residual scores (variance scores) the mean would be zero (ie. the line of best fit is the least distance for all of the X and Y coordinates), and the unexplained variance (standard deviation) is called the standard error of prediction, or standard error of the estimate. The larger the standard error of the estimate the less predictive ability and the larger the r 2 is, the smaller the error of prediction. Note: Chapter 8 also contains information on Partial, Semi-partial, and Multiple regression principles, and Fischer Z transformation of r. This information is beyond the scope of our class in the introduction of statistical principles. I welcome you to read the information, but I will not review nor test you on the material Research Methods in Physical Activity
End of Lecture Research Methods in Physical Activity