Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.

Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression

Homework #12 due11/19 Homework #12 due11/19 Chapter 16: 1, 2, 7, 8, 10, 22 (Use SPSS), 24 (HW #13 – the last homework – is due on 11/21)

Last Time: We reviewed Pearson’s r, and how to calculate it from raw scores and Z scores, and with SPSS We learned about Spearman’s Rho We learned how to get a scatter plot using SPSS We learned about bivariate regression One x (predictor, independent) variable is used to predict one y (outcome, dependent) variable Review of the formula for a line (y=mx+b) Applied formula for a line to regression (Y=a + bX + error or Y=β 0 +β 1 X+error)

This time: Clarification and review of some regression concepts Multiple regression Regression in SPSS Scatterplots and simple linear (bivariate) regression in Excel

Regression is all about prediction If you want to predict (make an educated guess about) an individual person’s score on a variable (Y), what’s the best estimate, based on the information available to you? Suppose you only know the mean of the variable (M Y ) and nothing else. Then the best estimate of any individual’s score is the mean (more scores are close to the mean than far from it). Suppose you also know that the variable is correlated with another variable (X), and you know a person’s score on X, but not on Y. Regression involves using the person’s score on X, combined with what you know about the relationship between X & Y to get a much more accurate prediction of the person’s score on Y than you can get just using the mean of Y (M Y ). If X and Y are uncorrelated, then the best predictor of any individual score on Y is the mean of Y (Y = M Y + Error). The regression line is flat (has no slope). Note that in this scenario, the average error is based on deviations of Y values from the mean of Y. (It’s the standard deviation)

vs. Y vs. µ Y Refers to the predicted value of Y µYµY Refers to the expected (mean) value of Y. In the context of regression, it refers to the expected value of Y for a given value of X. YRefers to an actual (observed) value of Y. If you know an individual’s score for Y and X, and you know the regression equation, you can calculate a residual score for that individual. By looking at the residuals for a group of individuals, you can determine how good a “fit” the regression model is (i.e., how much error there is)

Regression The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points) Y X 1 2 3 4 5 6 123456 Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Regression The linear model Y = intercept + slope (X) + error Beta’s (β) are sometimes called parameters Come in two types: standardized unstandardized Now let’s go through an example computing these things

From when we computed Pearson’s r 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 2.4 -2.6 1.4 -0.6 0.0 2.0 -2.0 2.0 0.0 -2.0 0.0 4.8 5.2 2.8 0.0 1.2 5.76 6.76 1.96 0.36 4.0 0.0 4.0 14.015.2016.0 SS Y SS X SP

Computing regression line (with raw scores) 6 1 2 5 6 3 4 3 2 X Y 14.015.2016.0 SS Y SS X SP mean 3.64.0

Computing regression line (with raw scores) 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 Y X 1 2 3 4 5 6 123 456

Computing regression line (with raw scores) 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 Y X 1 2 3 4 5 6 123 456 The two means will be on the line

Computing regression line (with z-scores) mean ZYZY ZXZX 1 2 0 12 0.0 1.1 -1.1 0.0 -1.1 1.1 0.0 1.38 -1.49 0.8 - 0.34 -2 -2

Regression Error – Actual score minus the predicted score Measures of error – r 2 (r-squared) –Proportionate reduction in error Note: Total squared error when predicting from the mean = SS Total =SS Y –Squared error using prediction model = Sum of the squared residuals = SS residual = SS error

Computing Error around the line Compute the difference between the predicted values and the observed values (“residuals”) Square the differences Add up the squared differences Y X 1 2 3 4 5 6 123 456 Sum of the squared residuals = SS residual = SS error

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 Predicted values of Y (points on the line) Sum of the squared residuals = SS residual = SS error

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 6.2 = (0.92)(6)+0.688 Predicted values of Y (points on the line) Sum of the squared residuals = SS residual = SS error

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 6.2 = (0.92)(6)+0.688 1.6 = (0.92)(1)+0.688 5.3 = (0.92)(5)+0.688 3.45 = (0.92)(3)+0.688 3.45 = (0.92)(3)+0.688 Sum of the squared residuals = SS residual = SS error

Computing Error around the line Y X 1 2 3 4 5 6 123 456 Sum of the squared residuals = SS residual = SS error X Y 6 1 2 5 6 3 4 3 2 6.2 1.6 5.3 3.45 6.2 1.6 5.3 3.45

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 6.2 0.00 -0.20 0.40 0.70 0.55 -1.45 1.6 5.3 3.45 residuals Sum of the squared residuals = SS residual = SS error Quick check 6 - 6.2 = 2 - 1.6 = 6 - 5.3 = 4 - 3.45 = 2 - 3.45 =

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 6.2 0.00 0.04 0.16 0.49 0.30 2.10 3.09 -0.20 0.40 0.70 0.55 -1.45 1.6 5.3 3.45 SS ERROR Sum of the squared residuals = SS residual = SS error

Computing Error around the line 6 1 2 5 6 3 4 3 2 X Y mean 3.64.0 6.2 0.00 0.04 0.16 0.49 0.30 2.10 3.09 -0.20 0.40 0.70 0.55 -1.45 1.6 5.3 3.45 SS ERROR Sum of the squared residuals = SS residual = SS error 4.0 0.0 4.0 16.0 SS Y

Computing Error around the line Also (like r 2 ) represents the percent variance in Y accounted for by X 3.09 SS ERROR Sum of the squared residuals = SS residual = SS error 16.0 SS Y –Proportionate reduction in error In fact, in bivariate regression it is mathematically identical to r 2

Regression in SPSS Running the analysis in SPSS is pretty easy – Analyze: Regression: Linear – X or predictor variable(s) go into the ‘independent variable’ field – Y or predicted variable goes into the ‘dependent variable’ field You get a lot of output

Regression in SPSS The variables in the model r r 2 Unstandardized coefficients Slope (indep var name) Intercept (constant) Standardized coefficients We’ll get back to these numbers in a few weeks

In Excel With Data Analysis “Tool Pack” you can perform regression analysis With standard software package, you can get bivariate correlation (which is the same as the standardized regression coefficient), you can create a scatterplot, and you can request a trend line, which is a regression line (what is y and what is x in that case?)

Multiple Regression Multiple regression prediction models “fit”“residual”

Prediction in Research Articles Bivariate prediction models rarely reported Multiple regression results commonly reported

Multiple Regression Typically researchers are interested in predicting with more than one explanatory variable In multiple regression, an additional predictor variable (or set of variables) is used to predict the residuals left over from the first predictor.

Multiple Regression Y = intercept + slope (X) + error Bi-variate regression prediction models

Multiple Regression Multiple regression prediction models “fit” “residual” Y = intercept + slope (X) + error Bi-variate regression prediction models

Multiple Regression Multiple regression prediction models First Explanatory Variable Second Explanatory Variable Fourth Explanatory Variable whatever variability is left over Third Explanatory Variable

Multiple Regression Predict test performance based on: First Explanatory Variable Second Explanatory Variable Fourth Explanatory Variable whatever variability is left over Third Explanatory Variable Study time Test time What you eat for breakfast Hours of sleep

Multiple Regression Predict test performance based on: Study time Test time What you eat for breakfast Hours of sleep Typically your analysis consists of testing multiple regression models to see which “fits” best (comparing R 2 s of the models) versus For example:

Multiple Regression Response variable Total variability it test performance Total study time r =.6 Model #1: Some co-variance between the two variables R 2 for Model =.36 64% variance unexplained If we know the total study time, we can predict 36% of the variance in test performance

Multiple Regression Response variable Total variability it test performance Test time r =.1 Model #2: Add test time to the model Total study time r =.6 R 2 for Model =.37 63% variance unexplained Little co-variance between these test performance and test time We can explain more the of variance in test performance

Multiple Regression Response variable Total variability it test performance breakfast r =.0 Model #3: No co-variance between these test performance and breakfast food Total study time r =.6 Test time r =.1 R 2 for Model =.37 63% variance unexplained Not related, so we can NOT explain more the of variance in test performance

Multiple Regression Response variable Total variability it test performance breakfast r =.0 We can explain more the of variance But notice what happens with the overlap (covariation between explanatory variables), can’t just add r’s or r 2 ’s Total study time r =.6 Test time r =.1 Hrs of sleep r =.45 R 2 for Model =.45 55% variance unexplained Model #4: Some co-variance between these test performance and hours of sleep

Multiple Regression The “least squares” regression equation when there are multiple intercorrelated predictor (x) variables is found by calculating “partial regression coefficients” for each x A partial regression coefficient for x 1 shows the relationship between y and x 1 while statistically controlling for the other x variables (or holding the other x variables constant)

Multiple Regression The formula for the partial regression coefficient is : b 1 = (r Y1 -r Y2 r 12 )/(1-r 12 2 )*(s Y /s 1 ) Where r Y1 =correlation of x 1 and y r Y2 =correlation of x 2 and y r 12 =correlation of x 1 and x 2 s Y =standard deviation of y, s 1 =standard deviation of x 1

Multiple Regression Multiple correlation coefficient (R) is an estimate of the relationship between the dependent variable (y) and the best linear combination of predictor variables (correlation of y and y-pred.) R 2 tells you the amount of variance in y explained by the particular multiple regression model being tested.

Multiple Regression in SPSS Setup as before: Variables (explanatory and response) are entered into columns A couple of different ways to use SPSS to compare different models

Regression in SPSS Analyze: Regression, Linear

Multiple Regression in SPSS Method 1: enter all the explanatory variables together – Enter: All of the predictor variables into the Independent Variable field Predicted (criterion) variable into Dependent Variable field

Multiple Regression in SPSS The variables in the model r for the entire model r 2 for the entire model Unstandardized coefficients Coefficient for var1 (var name) Coefficient for var2 (var name)

Multiple Regression in SPSS The variables in the model r for the entire model r 2 for the entire model Standardized coefficients Coefficient for var1 (var name)Coefficient for var2 (var name)

Multiple Regression –Which coefficient to use, standardized or unstandardized? –Unstandardized b’s are easier to use if you want to predict a raw score based on raw scores (no z-scores needed). –Standardized β’s are nice to directly compare which variable is most “important” in the equation

Multiple Regression in SPSS Predicted (criterion) variable into Dependent Variable field First Predictor variable into the Independent Variable field Click the Next button Method 2: enter first model, then add another variable for second model, etc. –Enter:

Multiple Regression in SPSS Method 2 cont: – Enter: Second Predictor variable into the Independent Variable field Click Statistics

Multiple Regression in SPSS –Click the ‘R squared change’ box

Multiple Regression in SPSS The variables in the first model (math SAT) Shows the results of two models The variables in the second model (math and verbal SAT)

Multiple Regression in SPSS The variables in the first model (math SAT) r 2 for the first model Coefficients for var1 (var name) Shows the results of two models The variables in the second model (math and verbal SAT) Model 1

Multiple Regression in SPSS The variables in the first model (math SAT) Coefficients for var1 (var name) Coefficients for var2 (var name) Shows the results of two models r 2 for the second model The variables in the second model (math and verbal SAT) Model 2

Multiple Regression in SPSS The variables in the first model (math SAT) Shows the results of two models The variables in the second model (math and verbal SAT) Change statistics: is the change in r 2 from Model 1 to Model 2 statistically significant?

Cautions in Multiple Regression We can use as many predictors as we wish but we should be careful not to use more predictors than is warranted. –Simpler models are more likely to generalize to other samples. –If you use as many predictors as you have participants in your study, you can predict 100% of the variance. Although this may seem like a good thing, it is unlikely that your results would generalize to any other sample and thus they are not valid. –You probably should have at least 10 participants per predictor variable (and probably should aim for about 30).

Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.

Similar presentations

Presentation on theme: "Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.

Similar presentations

Presentation on theme: "Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression."— Presentation transcript:

Similar presentations

About project

Feedback