3 basic analytical tasks in bivariate (or multivariate) analyses: Comparisons look for differences Cross-tabs look for contingencies Correlations look for covariances These seem different (in when, where, & how we use them) but are fundamentally comparable (in their analytical logic)
4 basic questions in bivariate (or multivariate) analyses: Is there a relationship? (statistical significance) What is the pattern? How strong is it? (measures of association) [plus one additional non-statistical question] What does it mean? (substantive importance or theoretical interpretation)
“Correlation” (revisited) Correlation = strength of the linear association between 2 numeric variables It reflects the degree to which the association is described by a “straight-line” relationship The degree to which two variable covary or share common variance – [“covariance” = a key term] It reflects the “predictability” or “commonality” between the two variables Note: r2 (r-squared) = the proportion of variance that “shared” or common to both variables
“Regression” = closely related topic What is the relationship/difference between correlation and regression? Correlation = compute the degree to which values of both variables cluster around a straight line It is a symmetric description (rxy = ryx) Regression = compute the equation for the “best Fitting” straight line (Y = a + bX) It is an asymmetric description (bxy <> byx) Why is regression used? To describe the functional pattern that links 2 variables together – what are the values of a and b for X & Y? To predict values of one variable from the other
“Regression” linear regression is the computational procedure of fitting a straight line to a set of bivariate points What does regression tell us about the bivariate relationship between Y & X? Y = a + bX (basic formula for linear relation) a = the intercept b = the “slope” of the line
Regression example (continued)
“Regression” Why is it called “regression”? It is admittedly a confusing, unhelpful name Name reflects peculiarities of its historical development (in the study of genetics and the inheritability of genius)
“Regression” How to obtain the straight line that “best fits” the data? Rely on a method called “least squares” which minimizes the sum of the squared errors (deviations) between the line and the data points Yields best-fitting line to the points Yields formulas for a and b provided in the book How to compute regression coefficients? By hand calculations: Definitional formula (the familiar one) Computational formula (no deviation scores) By SPSS: Analyze Regression Linear
Regression Coefficient: Definitional Formula Regression Coefficient: Computational Formula Intercept (Constant): Computational Formula
Use Example from Fox/Levin/Forde text (p. 277) (handout) “Regression” Use Example from Fox/Levin/Forde text (p. 277) (handout) Prior Charges Sentence (mos) 12 3 13 1 15 19 6 26 5 27 29 4 31 10 40 8 48
# Priors X Sentence Y X2 Y2 XY 12 144 O 3 13 9 169 39 1 15 225 19 361 6 26 36 676 156 5 27 25 729 135 29 841 87 4 31 16 961 124 10 40 100 1600 400 8 48 64 2304 384 = 40 =260 =8010 =1340
Regression Example (cont.) = 3.0 = = = 14.0
Regression example (continued)
Regression (continued) - How to interpret the results? Slope (b) = predicted change in Y for a 1-unit change in X Unstandardized b (b) = in original units/metric Standardized b (β)= in standard (Z) units Intercept (a) = predicted value of Y when X=0 Interpretable only when zero is a meaningful value of X Also called the “constant” term since it is the same for all values of X R (multiple r) = correlation between Y and the predictor(s) (predictability of Y from Xs)
Regression (continued) What are assumptions/requirements of correlation? Numeric variables (interval or ratio level) Linear relationship between variables Random sampling Normal distribution of data Homoscedasticity (equal conditional variances) What if the assumptions do not hold? May be able to transform variables May use alternative procedures
Regression (continued) How to test for significance of results? F-test for overall regression t-test for individual b coefficients What is relation between b and r? What is R? Can we use more than one independent variable? Yes – it’s called “multiple regression” Regress a single dependent variable (Y) on multiple independent variables (a linear combination that best predicts Y)
Multiple Regression - addenda Simultaneous analysis of the regression of a dependent variable on 2 or more independent variables Yi = a +b1X1 + b2 X2 + b3X3 + ei All coefficients are computed at once In this case, the b coefficients are partial regression coefficients They reflect the unique predictive ability of each variable (with the covariance of other independent variables “partialled out”)
Multiple Regression What is Multiple Regression good for? allows us to estimate: The combined effects of multiple variables The unique effects of individual variables In this case, R & R2 measure how well the entire set of independent variables does in predicting or explaining Y. The overall F-test of the regression refers to whole set of independent variables The t-tests for the individual (partial) coefficients of each variable by itself