Lecture 11 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D Chicago School of Professional Psychology Lecture 11 Kin Ching Kong, Ph.D
Agenda Correlation The Pearson Correlation Introduction The Pearson Correlation Definition Sum of Products (SP) Calculation The Pearson Correlation and z-scores Uses of the Pearson Correlation Interpreting the Pearson Correlation Hypothesis Tests with the Pearson Correlation The Point-Biserial Correlation The Spearman Correlation Introduction to Regression
Correlation, Introduction measures and describes the relationship (association) between two variables (X and Y). requires two scores (X, Y) for each individual. Usually the two variables are simply observed. A scatterplot of the data displays the relationship between the two variables. Figure 15.1 of your book
Correlation, Three Characteristics of a Relationship A Correlation measures three characteristics of a relationship: The direction of the relationship Positive: X and Y tend to move in the same direction. Negative: X and Y tend to go in opposite directions. Direction is identified by + and – signs. Figure 15.2 of your book The form of the relationship Linear form (i.e. straight line) Nonlinear The degree of relationship How well the data fit the specific form being considered How closely the two variables associate. Represented by the numerical value of the correlation Figure 15.3 of your book Strength of relationship: how closely the two variables assoicate.
The Pearson Correlation (r) The Pearson Correlation (or Pearson Product-Moment Correlation): Measures the degree and direction of the linear relationship between two variables. r = degree to which X and Y vary together degree to which X and Y vary separately = covariability of X and Y variability of X and Y separately r = SP/ When r = + 1, every change in X is accompanied by a perfectly predictable change in Y. X and Y always vary together, thus the numerator and denominator are identical.
SP: the Sum of Products of Deviations Sum of Products (SP) SP: the Sum of Products of Deviations Measures the covariability of two variables Definitional Formula: SP = S (X – MX)(Y – MY) Computational Formula: SP = SXY - SXSY n n = number of pairs of scores
The Pearson Correlation (r), an Example r = SP/ e.g. Scores Deviations Squared Deviations Products X Y (X–MX) (Y-MY) (X–MX)2 (Y-MY)2 (X–MX)(Y-MY) 0 1 -6 -1 36 1 +6 10 3 +4 +1 16 1 +4 4 1 -2 -1 4 1 +2 8 2 +2 0 4 0 0 8 3 +2 +1 4 1 +2 MX = 6 MY = 2 SSX = 64 SSY = 4 SP = +14 r = SP/ = +14/ = +14/16 = +0.875 Scatterplot of the Data (Figure 15.4)
Pearson Correlation & z-Scores Karl Pearson based his equation for r on the concept of z-scores. r is defined as the mean of the z-score products for X and Y: r = SzXzY n zX and zY are calculated using the population standard deviation. If using sample standard deviation, use n-1 in the above formula. When zX and zY are both positive or both negative, the product is positive. When zX and zY are of opposite sign, the product is negative. When most of the products are positive, then r is positive (i.e. as X increase, Y increase; as X decrease, Y decrease) When most of the products are negative, then r is negative (i.e. an inverse relationship between X and Y)
Uses of the Pearson Correlation, r Prediction When two variables are correlated, it is possible to use one to make predictions about the other. e.g. using SAT scores to predict college grade point average. Validity r can be used to demonstrate the validity of a new instrument/measure. e.g. The validity of a new IQ test can be demonstrated by high correlations with standardized IQ tests, performance on learning tests, problem solving ability etc. Reliability r can be used to determine the reliability of a measurement procedure. e.g. If an IQ test is reliable, then your IQ measured this week will correlate highly with your IQ measured 3 weeks from now. Theory Verification Many psychological theories make predictions about relationships between two variables, which can be tested by determining the correlation between the two variables. e.g. parents’ IQ and child’s IQ
Interpreting Correlations Correlation Does Not Equal Causation Correlation simply describe a relationship between two variables, it doesn’t explain why the two are related. Correlation Can be Greatly Affected by Restricted Range Figure 15.6 of your book To be safe, should not generalize correlation beyond the range of data represented in the sample. Outliers (Extreme Data Points) can Greatly Influence a Correlation Figure 15.7 of your book You should always look at a scatterplot of your data Strength of the Relationship (r2) r2, the coefficient of determination, measures the proportion of variability in one variable that can be determined from it’s relationship with the other variable. e.g. let’s say r for IQ and GPA is +0.60, then 36% of the variability in GPA can be explained by differences in IQ
Hypothesis Testing with The Pearson Correlation Hypothesis Testing with Correlation: Use sample correlations to draw inferences about population correlations. The goal of the hypothesis test is to decide between two alternatives: The nonzero sample correlation is due to sampling error. The nonzero sample correlation reflects a real, nonzero correlation in the population. Basic Question: Does a Correlation Exists in the Population? H0: r = 0 (there is no population correlation) H1: r = 0 (there is a real correlation) Degree of Freedom df= n - 2 Table B.6 To be significant, a sample correlation has to be greater than the critical value (ignore the sign)
Hypothesis Testing with r, an Example A researcher obtains a correlation of r = 0.321 for a sample of 30 individuals. Does this sample provide sufficient evidence to conclude that there is a significant positive correlation in the population? Test with a = .05 Step I: State the Hypotheses: H0: r < 0 (there is not a positive correlation) H1: r > 0 (there is a positive correlation) Step 2: Find the Critical Value: df = n – 2 = 30 – 2 = 28 Critical r = 0.306 Step 3: Calculate sample statistic: r = 0.321 Step 4: Make a decision: Since the sample r is greater than the critical r, we reject the null hypothesis and conclude that there is a positive correlation in the population.
Hypothesis Testing with r, Your Turn A researcher obtained the following set of data. Is there a significant correlation between X and Y? Used alpha = .01 X Y 1 6 2 8 4 2 5 0 3 4
The Point-Biserial Correlation a special version of the Pearson correlation. used to measure the relationship between a quantitative and a dichotomous variable. The dichotomous variable is coded 0 and 1 The Pearson formula is then used to calculate the point-biserial correlation. The Point-Biserial Correlation and r2 The r2 used to measure effect size is directly related to the r used to measure correlation.
Compare Point-Biserial Correlation & t Test Table 15.1 The same data, organized for an independent-measures t and for a point-biserial correlation. The t-test results t (18) = 4.00, p <.05, r2 = 0.47, or 47% of variance in memory scores are accounted for by the treatment, i.e. mental imagery. The point-biserial correlation results r = SP/ = 40/ = 40/58.31 r= 0.686, n = 20, p < .05 r2 = (0.686)2 = 0.47, or 47% of variance in memory scores can be predicted from the variance in mental imagery. What does the two procedures evaluate? The relationship between mental imagery and memory scores.
The Spearman Correlation for use with data measured on an ordinal scale. for use with interval or ratio data when there is a nonlinear relationship Measure the consistency of relationship, independent of form e.g. consider the relationship between practice (X) and performance. One would expect increase practice to lead to improved performance, but the relationship is not expected to be linear. Figure 15.10 of your book
The Spearman Correlation, Calculation The Data: Convert X and Y to ranks separately (if raw data are interval or ratio) When two or more scores are identical, find the mean of their ranked positions, assign this mean as the final rank for each score. The Calculation: Use the Pearson formula with the rank data. Use the simplified formula with the rank data when there is no ties among the ranks. rS = 1 - 6SD2 n(n2 – 1) D = Rank Y – Rank X n = # of pairs of scores
The Spearman Correlation, An Example Converting raw scores to ranks: Raw Scores Ranks X Y X Y XY 3 12 1 5 5 4 10 2 3 6 8 11 3 4 12 10 9 4 2 8 13 3 5 1 5 Figure 15.12 Using the Pearson formula: SSX = SX2 – (SX)2 = 55 – (15)2 = 10 SSY = 10 n 5 SP = SXY – (SX)(SY) = 36 – (15)(15) = -9 n 5 rS = SP/ = -9/10 = -0.9
The Spearman Correlation, Examples Using the simplified formula: Raw Scores Ranks Rank Difference X Y X Y D D2 3 12 1 5 4 16 4 10 2 3 1 1 8 11 3 4 1 1 10 9 4 2 -2 4 13 3 5 1 -4 16 rS = 1 - 6SD2 = 1 – 6(38) 1 – 1.9 = -0.9 n(n2 – 1) 5(24)
Introduction to Regression Figure 15.13 Hypothetical data showing the relationship between SAT scores and college GPA A line drawn through the middle serves several purposes: The line makes the relationship between SAT and GPA easier to see. The line identifies the center, or central tendency, of the relationship. The line can be used for prediction. The line establishes a precise, one-to-one relationship between each X and Y scores.
Introduction to Regression, Linear Equations The formula for a straight line: Y = a + bX a and b are constants b is called the slope, and is the amount of change in Y per unit change in X a is call the Y-intercept, which is the value of Y when X is zero e.g. Your local tennis club charges a fee of $5 per hour plus an annual membership fee of $25 The total cost of playing tennis in this club can be described by the linear equation: Y = 5X + 25 Figure 15.14
Introduction to Regression, Least-Squared Is the statistical technique for finding the best-fitting straight line for a set of data. The resulting straight line is called the regression line. The Least-Squared Method to best-fit: distance = Y – Ypred Y = actual score Ypred = Y score predicted by the line for each X value Figure 15.15 This distance measures the error of using the line to predict the actual score The Least-Square Method defines the best-fitting line to be the line that minimizes the total squared error.
Introduction to Regression, the Equation Ypred = a + bX b = SP/SSX or b = r(SY/SX) a = MY – bMX The above linear equation is called the regression equation for Y. This equation results in the least squared error between the data points and the line.
X Y (X–MX) (Y-MY) (X–MX)2 (X–MX)(Y-MY) Regression, an Example X Y (X–MX) (Y-MY) (X–MX)2 (X–MX)(Y-MY) 7 11 2 5 4 10 4 3 -1 -3 1 3 6 5 1 -1 1 -1 3 4 -2 -2 4 4 5 7 0 1 0 0 MX = 5 MY = 6 SSX = 10 SP = 16 b = SP/SSX = 16/10 = 1.6 a = MY – bMX = 6 – 1.6(5) = 6 – 8 = -2 The regression equation: Ypred = -2 + 1.6X Figure 15.16
Introduction to Regression, Prediction Using the regression equation for prediction: For a person with X = 5, what would be the predicted Y? Ypred = -2 + 1.6X = -2 + 1.6(5) = 6 Cautions: The predicted value is not prefect (unless r = + 1). The amount of error depend on the magnitude of the r. The regression line should not be use to make predictions for X values that fall outside the range of values covered by the original data.