Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.

Similar presentations


Presentation on theme: "Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc."— Presentation transcript:

1 Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.

2 Major Uses Correlational techniques are used for three major purposes:
Degree of Association Predication Reliability

3 Bivariate Distribution
Bivariate distribution - a distribution in which two variables are presented simultaneously Consider the following: X Y 9 8 4 6 7 Ordinarily, we might construct a graph for each set of data. However, we can place both on a “scatter diagram.”

4 Scatter Diagram X Y 8 7 6 5 4 3 2 1 X Y 8 4 6 7

5 What Scatter Diagrams Can Tell Us
A scatter diagram can tell us much about a bivariate distribution: presence of relationship No Relationship Relationship

6 - Direction of relationship
Positive Relationship Negative Relationship There is a positive relationship between high school SAT scores and college GPA. Other examples? There is a negative relationship between the number of missed classes and exam scores. Other examples?

7 - Linear or non-linear Linear Non-linear

8 - Homoscedasticity/Heteroscedasticity

9 - Exceptions to relationship
Perfect Relationship

10 √ Conceptualizing r xy Sxy rxy = (Sx2)(Sy2) cross-products X S xy I II
IV III Y X (-) values (+) values S xy xy rxy = Sxy (Sx2)(Sy2)

11 Computational Formula
rxy = = Sxy (Sx2)(Sy2) rxy = (SX)(SY) n SXY - SX2 - (SX)2 SY2 - (SY)2 [ ]

12 Correlation and Causation
“Correlation does not imply causation.” Consider the following: There is a very high correlation (i.e., in the upper .90s) between the length of a person’s big toe and ability to spell! Several possibilities exist: changes in X cause changes in Y changes in Y cause changes in X a third (or other) variable affects X and Y

13 Correlation and Causation
How about this one? Children exposed to violent TV are more aggressive than children exposed to non-violent TV

14 Factors Influencing the Size of “r”
Linearity of regression the more closely scores follow a straight line, the higher the value of r r underestimates true degree of association in non-linear relationship High value r Low value r

15 Factors Influencing the Size of “r”
Restriction of Range (Truncated range) If the correlation coefficient is calculated on a portion of the data, r will usually be smaller than had all data been used Higher value r Lower value r

16 Factors Influencing the Size of “r”
Discontinuous distribution If the correlation coefficient is calculated on portions of the data that are separated, r will usually be higher than had all data been used Lower value r Higher value r

17 Factors Influencing the Size of “r”
The correlation coefficient will adequately reflect the degree of association for a homoscedastic distribution across the entire range of scores, but not for a heteroscedastic distribution Over estimates the degree of association at this point Homoscedastic Heteroscedastic Under estimates the degree of association at this point

18 Factors Influencing the Size of “r”
Pooled data small samples may be combined if their means and standard deviations are similar, otherwise “spurious correlations” may occur Lower value r Higher value r

19 Factors Influencing the Size of “r”
Sampling Variability Large sample sizes (i.e., n > 100) are not greatly affected by sampling variability Small sample sizes will vary considerably, so one must take sample size into consideration when interpreting r. Each of the previous factors indicates the need to consider the conditions under which the correlation coefficient is calculated when interpreting r

20 Interpreting Strength of Association
The correlation coefficient is not the best way to interpret the strength of the association between X and Y scale is not linear and, therefore, r = .60 (for example) is not twice as strong a relationship as r = .30 The coefficient of determination is a better index of strength coefficient of determination - the proportion of variability in Y scores that can be explained by changes in X scores r 2

21 Regression

22 Prediction If two variables are correlated, you can predict Y from X with better than chance probability Given r < 1, there will be predictive error - the difference between the actual Y score and the predicted score (Y’) for a given value of X For example, predicted GPA = 3.40 actual GPA = 2.78 error = = .62 Predictive error = Y - Y’

23 Reducing Predictive Error
Obviously, we would want our predictions to be as accurate as possible (i.e., have little predictive error) When S(Y - Y’)2 is a minimum, we have met the least squares criterion for the “best fitting straight line” called the regression line

24 The regression line can be thought of as a “running mean”
the means are estimated (i.e., what would be expected given a large number of observations for a given X value) Y = 2.57 X = 425 X = 650 Regression line Y’ = 2.78 Y’ = 2.31

25 Which Line is Best? Given the scatter plot below, where would we place the regression line?

26 The Regression Equation
Fortunately, there is a simple way to determine precisely where the regression line should be placed so that the least squares criterion is met: r ( ) Sy Sx X - [ X]+ Y Y’ = X score

27 The regression equation is really nothing more than the equation for a straight line:
y = aX + b where, a = slope b = y-intercept r ( ) Sy Sx X - [ X]+ Y Y’ = { slope { y-intercept As such, we can use the regression equation to predict Y from X

28 An Example Consider the following data: Batting Avg HR .219 8 .287 11
n = 4 Y’ = 61.06X SX = 1.127 X = Y = 11.5 SX = SY = 2.5 r = SX2 = SY = 46 XAVG = .271 SY2 = 554 Y’HR = 10.84 SXY =

29 Regression to the Mean Any time r < 1.00, the Y’ values will cluster more towards the overall Y The tendency for Y’ values to move closer to Y is called regression to the mean At the extreme case where r = 0, all our Y’ values will equal Y

30 Measuring Predictive Error
Since a predicted value is only a “best estimate,” we would like to know how much is the predictive error overall One way to measure the predictive error is to calculate the amount of variability of the Y scores around the regression line Standard error of estimate (prediction): SYX = S(Y - Y’)2 n

31 Standard Error of Estimate
The standard error of estimate is like a standard deviation, but one where the deviations are measured from the regression line and not the mean SYX = S(Y - Y’)2 n SX = S(X - X)2 Standard deviation Standard error of estimate vs.

32 Standard Error of Estimate
An easier formula is as follows: SYX = SY r2 As r decreases, SYX increases High value r Low value r

33 Confidence in Predictions
We can also establish limits, with a specified probability, within which an individual’s actual score is likely to fall For example, given: Y’ = 2.78, SYX = .45 Y’GPA = 2.78 SAT = 650 Upper limit 3.66 1.96(SYX) 1.96(.45) = 3.66 95% -1.96(SYX) 1.90 Lower limit -1.96(.45) = 1.90

34 Confidence in Predictions
Given an SAT = 650, we can be 95% confident the individual’s actual GPA will fall between 1.90 and 3.66 For such “confidence intervals” to make sense: the relationship between X and Y must be linear the bivariate distribution must be homoscedastic Y values must be normally distributed about Y’ n > 100

35 Ordinal and Nominal Measures of Association

36 Spearman r When you have two ordinal variables (e.g., ranks of candidates from two admissions counselors), you can determine the degree of association between the variables with Spearman r rs = 1 - 6 SD2 n(n2 - 1) where, D = difference between rankings n = number of pairs of ranks

37 Spearman r In case of ties, it is usual to assign to each tied observation the mean rank of the ranks the tied observations would have otherwise occupied For example, if you cannot decide whether applicant #8 or applicant #3 should be your 7th choice, then assign each a rank of 7.5 since they would have been your 7th and 8th choices had you been able to decide It is best to make the judges not have ties, but if they persist, it would be better to calculate Pearson r and interpret the value as Spearman r corrected for ties

38 Phi (f) When you have two true dichotomous variables (e.g., gender and employment), you can use f (AD - BC) (A+B)(C+D)(A+C)(B+D) f = = .35 M F (A) (B) (C) (D) Employed Unemployed n = 200

39 Reliability The third major use of correlation is determining reliability - how consistently does a measuring instrument measure over time The most common is test-retest reliability in which a test is given at one time and, following some period (e.g., a week, month, year, etc.), the test is given a second time Other types of reliability include split-half alternate forms

40 Multiple Correlation and Regression
Thus far we have examined the relationship between two variables, X and Y Multiple correlation and multiple regression examine the relationship between several X variables and a single Y variable (more commonly called “predictor” variables and the “criterion” variable) R = multiple correlation coefficient R2 = proportion of variability in Y scores that can be explained by the combined predictors Xi Y’ = a + b1X1 + b2X2


Download ppt "Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc."

Similar presentations


Ads by Google