Statistics for the Social Sciences

Statistics for the Social Sciences
Psychology 340 Spring 2010 Prediction

Outline Simple bi-variate regression, least-squares fit line
The general linear model Residual plots Using SPSS Multiple regression Comparing models, Delta r2

Regression Last time: with correlation, we examined whether variables X & Y are related This time: with regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.

Regression Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis” The variable that you are predicting goes on the Y-axis (criterion variable) Predicted variable For regression this is NOT the case Y X 1 2 3 4 5 6 Quiz performance Predicting variable The variable that you are making the prediction based on goes on the X-axis (predictor variable) Hours of study

Regression Last time: “Imagine a line through the points”
But there are lots of possible lines Y X 1 2 3 4 5 6 One line is the “best fitting line” Today: learn how to compute the equation corresponding to this “best fitting line” Quiz performance Hours of study

The equation for a line A brief review of geometry
Y = intercept, when X = 0 Y X 1 2 3 4 5 6 Y = (X)(slope) + (intercept) 2.0

X 1 2 3 4 5 6 Y = (X)(slope) + (intercept) 0.5 2.0 1 2 Change in Y Change in X = slope

X 1 2 3 4 5 6 Y = (X)(slope) + (intercept) Y = (X)(0.5) + 2.0

Regression A brief review of geometry Consider a perfect correlation
X = 5 Y = ? Y X 1 2 3 4 5 6 Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = = 4.5 4.5 Can make specific predictions about Y based on X

Regression Consider a less than perfect correlation
The line still represents the predicted values of Y given X X = 5 Y = ? Y X 1 2 3 4 5 6 Y = (X)(0.5) + (2.0) Y = (5)(0.5) + (2.0) Y = = 4.5 4.5

Regression The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points) Y X 1 2 3 4 5 6 Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Regression The linear model Y = intercept + slope (X) + error a b
Beta’s (β) are sometimes called parameters Come in two types: standardized unstanderdized Now let’s go through an example computing these things

Scatterplot Using the dataset from our correlation lecture Y X Y 6 6
6 5 4 3 2 1 X 1 2 3 4 5 6

From the Computing Pearson’s r lecture
mean 3.6 4.0 2.4 -2.6 1.4 -0.6 0.0 2.0 -2.0 4.8 5.2 2.8 1.2 5.76 6.76 1.96 0.36 X Y 14.0 15.20 16.0 SSY SSX SP

Computing regression line (with raw scores)
X Y mean 3.6 4.0 14.0 15.20 16.0 SSY SSX SP

Y X 1 2 3 4 5 6 X Y mean 3.6 4.0

Y X Y 6 5 4 The two means will be on the line 3 2 1 mean 3.6 4.0 X 1 2 3 4 5 6

Computing regression line (standardized, using z-scores)
Sometimes the regression equation is standardized. Computed based on z-scores rather than with raw scores X Y 2.4 5.76 2.0 4.0 1.38 1.1 -2.6 6.76 -2.0 4.0 -1.49 -1.1 1.4 1.96 2.0 4.0 0.8 1.1 -0.6 0.36 0.0 0.0 - 0.34 0.0 -0.6 0.36 -2.0 4.0 - 0.34 -1.1 Mean 3.6 4.0 0.0 15.20 0.0 16.0 0.0 0.0 1.74 1.79 Std dev

Computing regression line (standardized, using z-scores)
Sometimes the regression equation is standardized. Computed based on z-scores rather than with raw scores Prediction model Predicted Z score (on criterion variable) = standardized regression coefficient multiplied by Z score on predictor variable Formula 1.38 1.1 -1.49 -1.1 0.8 1.1 - 0.34 0.0 - 0.34 -1.1 The standardized regression coefficient (β) In bivariate prediction, β = r 0.0 0.0

Computing regression line (with z-scores)
ZY 2 1 1.38 1.1 ZX -1.49 -1.1 -2 -1 1 2 -1 0.8 1.1 - 0.34 0.0 -2 - 0.34 -1.1 0.0 0.0 mean

Regression The linear equation isn’t the whole thing
Y = intercept + slope (X) + error The linear equation isn’t the whole thing Also need a measure of error Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error Same line, but different relationships (strength difference) Y X 1 2 3 4 5 6 Y X 1 2 3 4 5 6

Regression Error Actual score minus the predicted score
Measures of error (there are many, here are 3 of them) r2 (r-squared) Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror Standard error of estimate = √(SSresidual/df) Proportionate reduction in error Note: Total squared error when predicting from the mean = SSTotal=SSY

R-squared r2 represents the percent variance in Y accounted for by X
1 2 3 4 5 6 Y X 1 2 3 4 5 6 64% variance explained 25% variance explained

Computing Error around the line
Sum of the squared residuals = SSresidual = SSerror Compute the difference between the predicted values and the observed values (“residuals”) Y X 1 2 3 4 5 6 Square the differences Add up the squared differences

Sum of the squared residuals = SSresidual = SSerror X Y mean 3.6 Predicted values of Y (points on the line) 4.0

Sum of the squared residuals = SSresidual = SSerror X Y 6.2 = (0.92)(6)+0.688 mean 3.6 Predicted values of Y (points on the line) 4.0

Sum of the squared residuals = SSresidual = SSerror X Y 6.2 = (0.92)(6)+0.688 1.6 = (0.92)(1)+0.688 5.3 = (0.92)(5)+0.688 3.45 = (0.92)(3)+0.688 3.45 = (0.92)(3)+0.688 mean 3.6 4.0

Sum of the squared residuals = SSresidual = SSerror X Y Y X 1 2 3 4 5 6 6.2 6.2 1.6 5.3 3.45 1.6 5.3 3.45 3.45

Sum of the squared residuals = SSresidual = SSerror residuals X Y 6.2 = -0.20 1.6 = 0.40 5.3 = 0.70 3.45 = 0.55 3.45 -1.45 = mean 3.6 4.0 Quick check 0.00

Sum of the squared residuals = SSresidual = SSerror X Y 6.2 -0.20 0.04 1.6 0.40 0.16 5.3 0.70 0.49 3.45 0.55 0.30 3.45 -1.45 2.10 mean 3.6 4.0 0.00 3.09 SSERROR

Sum of the squared residuals = SSresidual = SSerror 4.0 0.0 16.0 SSY X Y 6.2 -0.20 0.04 1.6 0.40 0.16 5.3 0.70 0.49 3.45 0.55 0.30 3.45 -1.45 2.10 mean 3.6 4.0 0.00 3.09 SSERROR

Sum of the squared residuals = SSresidual = SSerror Proportionate reduction in error Also (like r2) represents the percent variance in Y accounted for by X In fact, it is mathematically identical to r2 16.0 3.09 SSY SSERROR

Seeing patterns in the error
Y X 1 2 3 4 5 6 The sum of the residuals should always equal 0. The least squares regression line splits the data in half Additionally, the residuals to be randomly distributed. There should be no pattern to the residuals. If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables.

Residual plots Useful tools to examine the relationship even further. These are basically scatterplots of the Residuals (often transformed into z-scores) against the Explanatory (X) variable (or sometimes against the Response variable)

Scatter plot Residual plot The scatter plot shows a nice linear relationship. The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Scatter plot Residual plot The residual plot shows that the residuals get larger as X increases. This suggests that the variability around the line is not constant across values of X. This is referred to as a violation of homogeniety of variance. The scatter plot also shows a nice linear relationship.

Scatter plot Residual plot The scatter plot shows what may be a linear relationship. The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Regression in SPSS Using SPSS
Variables (explanatory and response) are entered into columns Each row is an unit of analysis (e.g., a person)

Regression in SPSS Analyze: Regression, Linear

Regression in SPSS Enter:
Predicted (criterion) variable into Dependent Variable field Predictor variable into the Independent Variable field

Regression in SPSS The variables in the model r r2
We’ll get back to these numbers in a few weeks Unstandardized coefficients Slope (indep var name) Intercept (constant)

Regression in SPSS Recall that r = standardized β in
bi-variate regression Standardized coefficient β (indep var name)

Hypothesis testing with Regression
A brief review of regression Y X 1 2 3 4 5 6 Y = (X)(slope) + (intercept) Hypothesis testing on each of these

These t-tests test hypotheses Both: Standardized coefficients Unstandardized coefficients H0: Slope = 0 H0: Intercept (constant) =0

Multiple Regression Typically researchers are interested in predicting with more than one explanatory variable In multiple regression, an additional predictor variable (or set of variables) is used to predict the residuals left over from the first predictor.

Multiple Regression Bi-variate regression prediction models
Y = intercept + slope (X) + error

Multiple Regression “residual” “fit”
Bi-variate regression prediction models Y = intercept + slope (X) + error Multiple regression prediction models “fit” “residual”

Multiple Regression Multiple regression prediction models
whatever variability is left over First Explanatory Variable Second Explanatory Variable Third Explanatory Variable Fourth Explanatory Variable

Multiple Regression Predict test performance based on: Study time
Test time What you eat for breakfast Hours of sleep whatever variability is left over First Explanatory Variable Second Explanatory Variable Third Explanatory Variable Fourth Explanatory Variable

Multiple Regression Predict test performance based on: Study time
Test time What you eat for breakfast Hours of sleep Typically your analysis consists of testing multiple regression models to see which “fits” best (comparing r2s of the models) For example: versus versus

Total variability it test performance
Multiple Regression Model #1: Some co-variance between the two variables If we know the total study time, we can predict 36% of the variance in test performance R2 for Model = .36 Response variable Total variability it test performance Total study time r = .6 64% variance unexplained

Multiple Regression Model #2: Add test time to the model Little co-variance between these test performance and test time We can explain more the of variance in test performance R2 for Model = .49 Response variable Total variability it test performance Total study time r = .6 51% variance unexplained Test time r = .1

Multiple Regression Model #3: No co-variance between these test performance and breakfast food Not related, so we can NOT explain more the of variance in test performance R2 for Model = .49 Response variable Total variability it test performance breakfast r = .0 Total study time r = .6 51% variance unexplained Test time r = .1

Multiple Regression Model #4: Some co-variance between these test performance and hours of sleep We can explain more the of variance But notice what happens with the overlap (covariation between explanatory variables), can’t just add r’s or r2’s R2 for Model = .60 Response variable Total variability it test performance breakfast r = .0 Total study time r = .6 40% variance unexplained Hrs of sleep r = .45 Test time r = .1

Multiple Regression in SPSS
Setup as before: Variables (explanatory and response) are entered into columns A couple of different ways to use SPSS to compare different models

Regression in SPSS Analyze: Regression, Linear

Method 1:enter all the explanatory variables together Enter: Predicted (criterion) variable into Dependent Variable field All of the predictor variables into the Independent Variable field

The variables in the model r for the entire model r2 for the entire model Unstandardized coefficients Coefficient for var1 (var name) Coefficient for var2 (var name)

The variables in the model r for the entire model r2 for the entire model Standardized coefficients Coefficient for var1 (var name) Coefficient for var2 (var name)

Multiple Regression Which β to use, standardized or unstandardized?
Unstandardized β’s are easier to use if you want to predict a raw score based on raw scores (no z-scores needed). Standardized β’s are nice to directly compare which variable is most “important” in the equation

Method 2: enter first model, then add another variable for second model, etc. Enter: Click the Next button First Predictor variable into the Independent Variable field Predicted (criterion) variable into Dependent Variable field

Method 2 cont: Enter: Second Predictor variable into the Independent Variable field Click Statistics

Click the ‘R squared change’ box

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT)

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) r2 for the first model Model 1 Coefficients for var1 (var name)

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) r2 for the second model Model 2 Coefficients for var1 (var name) Coefficients for var2 (var name)

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) Change statistics: is the change in r2 from Model 1 to Model 2 statistically significant?

Cautions in Multiple Regression
We can use as many predictors as we wish but we should be careful not to use more predictors than is warranted. Simpler models are more likely to generalize to other samples. If you use as many predictors as you have participants in your study, you can predict 100% of the variance. Although this may seem like a good thing, it is unlikely that your results would generalize to any other sample and thus they are not valid. You probably should have at least 10 participants per predictor variable (and probably should aim for about 30).

Prediction in Research Articles
Bivariate prediction models rarely reported Multiple regression results commonly reported

Multiple Regression Typically researchers are interested in predicting with more than one explanatory variable In multiple regression, an additional predictor variable (or set of variables) is used to predict the residuals left over from the first predictor. “fit” “residual”

Multiple Regression We can test hypotheses about each of these explanatory hypotheses within a regression model So it’ll tell us whether that variable is explaining a “significant”amount of the variance in the response variable First Explanatory Variable Second Explanatory Variable Third Explanatory Variable Fourth Explanatory Variable

Null Hypotheses H0: Coefficient for var1 = 0 p < 0.05, so reject H0, var1 is a significant predictor H0: Coefficient for var2 = 0 p > 0.05, so fail to reject H0, var2 is a not a significant predictor

Multiple Regression We can test hypotheses about each of these explanatory hypotheses within a regression model So it’ll tell us whether that variable is explaining a “significant”amount of the variance in the response variable We can also use hypothesis testing to examine if the change in r2 is statistically significant

Method 2 cont: Enter: Second Predictor variable into the Independent Variable field Click Statistics

Click the ‘R squared change’ box

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) r2 for the first model Model 1 Coefficients for var1 (var name)

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) r2 for the second model Model 2 Coefficients for var1 (var name) Coefficients for var2 (var name)

Shows the results of two models The variables in the first model (math SAT) The variables in the second model (math and verbal SAT) The change in r2 is not statistically significant (p = 0.46) Change statistics: is the change in r2 from Model 1 to Model 2 statistically significant?

Relating Critical t’s and r’s
Inferential statistics: 2 choices (really the same): A t-test & the t-table Use the Pearson’s r table (if available) Showing that the rcrit is equal to the tcrit: From r-table: α= 0.05, 2-tailed, df = n - 2 = 3, rcrit = 0.878 From t-table, with df = n - 2 = 3: tcrit = 3.18

Statistics for the Social Sciences

Similar presentations

Presentation on theme: "Statistics for the Social Sciences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics for the Social Sciences

Similar presentations

Presentation on theme: "Statistics for the Social Sciences"— Presentation transcript:

Similar presentations

About project

Feedback