4 basic analytical tasks in statistics: 1)Comparing scores across groups look for differences in means 2)Cross-tabulating categoric variables look for contingencies 3)Computing correlations among variables look for covariances 4)Predicting scores on an outcome variable from numerical predictor variables look for causal effects (or predicted outcomes) -- Focus this week on the 4 th task
“Correlation” (revisited) Correlation = strength of the linear association between 2 numeric variables It reflects the degree to which the association is described by a “straight-line” relationship –The degree to which two variable covary or share common variance – [“covariance” = a key term] It reflects the “commonality” (“predictability”) between the two variables Note: r 2 (r-squared) = the proportion of variance that “shared” or common to both variables
“Regression” = closely related topic The relationship/difference between correlation and regression? –Correlation = compute the degree to which values of variables cluster around a straight line a symmetric description (r xy = r yx ) a standardized measure –Regression = compute the equation for the “best fitting” straight line (Y = a + bX) It is an asymmetric description (b xy <> b yx ) an unstandardized measure (usually)
Linear Regression
So, what’s the deal with “Regression” ? Why is “regression” called that? a)Term introduced by Francis Galton in late-19 th century to describe prediction of genetic traits across generations reflecting imperfect correlations between parents and children b)It referred to tendency of extreme values of traits to “regress toward the mean” across successive generations reflecting Galton’s interest in the inheritability of genius & other unusual traits c)Correct word use: we “regress the dependent variable on the independent variable” Y = a + b yx X
What’s the deal with “Regression”? (cont.) Why is regression used in data analysis? To describe the functional pattern that links 2 variables together in a correlation – i.e., what are the optimal values of a and b for X & Y? Two basic uses of regression: a)Prediction: -- predict values of one variable (Y) from values of another variable (X) (using linear equation) b)Explanation: -- Estimate the causal influence of one variable (X) on another (Y) (based on measurable correlation). -- test a causal hypothesis about how Y and X are related.
How is regression analysis done? By fitting a straight line to a set of bivariate points (values on 2 variables for the same data units) –y = a + b yx x (basic formula for linear relation) –y = the dependent variable –x = the independent variable –a = the “intercept” –b yx = the “slope” of the line Concern is with fitting a straight line that minimizes the errors of prediction (of y from x) – (observed = predicted + error)
2 ways of expression the prediction equation: or
Regression example (continued)
“Regression” How to obtain the straight line that “best fits” the data? –Rely on a method called “least squares” which minimizes the sum of the squared errors (deviations between the line and the data points) –Yields best-fitting line to the points –Yields formulas for a and b provided in the book How to compute regression coefficients? By hand calculations: –Definitional formula (the familiar one) –Computational formula (no deviation scores) By SPSS: Analyze Regression Linear
Regression Coefficient: Definitional Formula Regression Coefficient: Computational Formula Intercept (Constant): Computational Formula
“Regression” Use Example from Fox/Levin/Forde text (p. 277) (handout) Prior ChargesSentence (mos)
# Priors X Sentence YX2X2 Y2Y2 XY O Σ= 40Σ=260 Σ=8010Σ=1340
Regression Example (cont.) == = 3.0 = 14.0
Regression example (continued)
Regression (continued) - How to interpret the results? Slope (b) = predicted change in Y for a 1-unit change in X Unstandardized b (b) = in original units/metric Standardized b (β)[beta]= in standard (Z) units Intercept (a) = predicted value of Y when X=0 Interpretable only when zero is a meaningful value of X Also called the “constant” term since it is the same for all values of X R (multiple r) = correlation between Y and the predictor(s) (predictability of Y from Xs)
Regression (continued) What are assumptions/requirements of regression? 1.Numeric variables (interval or ratio level) 2.Linear relationship between variables 3.Random sampling 4.Normal distribution of data 5.Homoscedasticity (equal conditional variances) What if the assumptions do not hold? 1.Don’t worry about small deviations 2.May be able to transform variables 3.May use alternative procedures
Regression (continued) How to test for significance of results? –F-test for overall regression –t-test for individual b coefficients What is R? (or R 2 ?) Can we use more than one independent variable? –Yes – it’s called “multiple regression” –Regress a single dependent variable (Y) on multiple independent variables (a linear combination that best predicts Y)
Multiple Regression - addenda Simultaneous analysis of the regression of a dependent variable on 2 or more independent variables Y i = a +b 1 X 1 + b 2 X 2 + b 3 X 3 + e i All coefficients are computed at once –In this case, the b coefficients are partial regression coefficients –They reflect the unique predictive ability of each variable (with the covariance of other independent variables “partialled out”)
Multiple Regression What is Multiple Regression good for? allows us to estimate: –The combined effects of multiple variables –The unique effects of individual variables allows us to test causal theories –The combined effects of multiple variables –The unique effects of individual variables In this case, R 2 measure how well the entire model does in predicting Y. The overall F-test refers to whole set of variables The t-tests apply to coefficients of each variable