Simple Linear Regression
Review Correlation: statistical technique used to describe the relationship between two interval or ratio variables Defined by r Sign tells us direction of relationship (+ vs. -) Value tells us consistency of relationship (-1 to +1) But what if we want to do more?
Regression = Prediction Let’s say Amherst declares war on Northampton because Northampton tries to lure Judie's into moving out of Amherst. We are all pacifists here in the Happy Valley, so we decide to settle our differences with a rousing game of Trivial Pursuit! You are elected the Captain of Amherst’s team (as if you would be selected instead of me). How are you going to choose the team? Multiple criteria: Knowledge Performance under pressure Speed
Life Satisfaction Correlation: Do any of the following factors predict life satisfaction? Money Grades Relationship status Health Regression: If I know about your ________, (finances, GPA, relationships, health), I can predict your life satisfaction.
History - WWII We needed to place a ton of people into positions in which they would excel?
Relational Aggression (x) Regression: statistical technique used to define the best fit equation that predicts the DV from one of more IV’s Person Relational Aggression (x) Popularity (y) A 5 4 B C 3 D E F 2 G H I J K L 1 M N Best fit line: identifies the central tendency of the relationship between the IV and the DV
The Regression Line Basics of linear equations Ŷ = 0 + 1X …..(or perhaps you know it as y = mx + b) Ŷ = predicted dependent variable (e.g., popularity) X = independent (predictor) variable (e.g., relational aggression) 1 = slope 0 = y-intercept
Interpretation of y-intercept and slope Predicted value of Y when X is zero Intercept only makes sense if X value can go down to zero. e.g. heights (x) predicting weights (y) Slope (1) Change in y for a one unit change in x. + implies direct relationship – implies inverse relationship
College GPA' = (0.675)(High School GPA) + 1.097 Ŷ = 0.675 X + 1.097 What would be the predicted GPA of someone with a 3.0 high school GPA? College GPA' = (0.675)(3.0) + 1.097 = 3.122.
College GPA' = (0.675)(High School GPA) + 1.097 Ŷ = 0.675 X + 1.097 What would be the predicted GPA of someone with a 2.0 high school GPA? College GPA' = (0.675)(2.0) + 1.097 = 2.447.
College GPA' = (0.675)(High School GPA) + 1.097 X X College GPA' = (0.675)(High School GPA) + 1.097 Ŷ = 0.675 X + 1.097 College GPA' = (0.675)(3.0) + 1.097 = 3.122. College GPA' = (0.675)(2.0) + 1.097 = 2.447 3.122 – 2.447=0.675 (slope: change in Y for each 1 pt change in X) X X
How do we calculate a regression equation? Least-Squares method Sum of the vertical distance between each point and the line = 0. Square of the vertical distance is as small as possible. Unbiased Minimum variability
Calculating the slope and y-intercept Slope (1) = SP / SSx Intercept (0)= My – (1* Mx)
Leaving nothing to chance You’ve worked hard all semester long, but want to make sure you get the best possible grade. Would bribery help? Maybe. Let’s run a regression model to see if there is a relationship between monetary contributions to Jake and Abby’s College Fund and ‘Bonus Points for Special Contributions to Class’*. * Note: this is a humorous (?) example, not an actual invitation to bribery.
Defining the regression line Subject X Y x2 xy Peeta 4 1 16 Rue 8 9 64 72 Katniss 2 5 10 Thresh 6 36 30 x = 20 y = 20 (x2) = 120 (xy) = 116 SSx = (x2) – [(x)2 / n] = 120 – [(20)2 / 4] = 120 – (400 / 4) = 120 – 100 = 20 SP = (xy) – [(x)y) / n] = 116 – [(20)(20) / 4] = 116 – (400 /4) = 116 – 100 = 16
Defining the regression line Subject X Y x2 Xy Peeta 4 1 16 Rue 8 9 64 72 Katniss 2 5 10 Thresh 6 36 30 x = 20 y = 20 (x2) = 120 (xy) = 116 SSx= 20 SP= 16 1 = SP / SSx = 16 / 20 = 0.8 0 = My – (b* Mx) = 5 – (.8)(5) = 1.0
Rue Katniss Thresh Peeta Ŷ = 0.8x+1
Least Squares Method Sum of the vertical distance between each point and the line = 0. Square of the vertical distance is as small as possible.
Measuring Distance from the Regression Line (Error) Ŷ = 0.8x+1 Rue Katniss Thresh Peeta
Measuring Error Ŷ = 0.8x+1 x y Ŷ = 0.8x+1 Distance (Y-Ŷ) Rue Katniss Thresh Peeta x y Ŷ = 0.8x+1 Distance (Y-Ŷ) Squared-Distance (Y-Ŷ) 2 P 4 1 4.2 -3.2 10.24 R 8 9 7.4 1.6 2.56 K 2 5 2.6 2.4 5.76 T 6 5.8 -0.8 0.64 ∑ = 0 ∑ (Y-Ŷ) 2 =19.20
Standard error of the estimate STANDARD ERROR OF THE ESTIMATE: measure of standard distance between predicted Y values and actual Y values
x y Ŷ = 0.8x+1 Distance (Y-Ŷ) Squared-Distance (Y-Ŷ) 2 4 1 4.2 1 – 4.2 = -3.2 10.24 8 9 7.4 9 – 7.4 = 1.6 2.56 2 5 2.6 5 – 2.6 = 2.4 5.76 6 5.8 5 – 5.8 = -0.8 0.64 ∑ =0 ∑ =19.20 Standard Error of the Estimate “On Average, predicted scores differed from actual scores by 3.1 points.”
Standard Error and Correlation STE = 0 Small STE Large STE
Let it snow (3x) I’m not a giant fan of winter, but I do kind of like sledding. I’m trying to convince a group of y’all to go sledding with me, but no one is interested. So, I buy a bunch of beer to try to convince you that sledding down Memorial Hill is a good idea.* I let you guys consume what you consider to be an appropriate amount of beer for a Tuesday night and then ask you to rate your interest in sledding on a 10-point scale.
Find the regression equation x (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 3 6 9 36 18 7 49 42 4 16 12 2 8 x = y = (x2) = (y2) = (xy) = Find the regression equation What is the predicted interest in sledding for someone who has consumed 2 drinks? Calculate the standard error of the estimate
0 = My – (1* Mx) SSx = (x2) – [(x)2 / n] (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 3 6 9 36 18 7 49 42 4 16 12 2 8 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 SSx = (x2) – [(x)2 / n] SP = (xy) – [(x)y)] / n 1 = SP / SSx 0 = My – (1* Mx)
Using regression equations for prediction Ŷ = 1.0(X) + 2 What do we expect Y to be if someone has 4 drinks Ŷ = 1.0(X) + 2 = 1.0(4)+2 = 6 BUT remember… The predicted value is not perfect (unless the correlation between x and y = -1 or +1) The equation cannot be used to make predictions about x values outside of the range of observed x values (e.g. Y when x is 20) DO NOT confuse prediction with causation!
SSY SSresidual SSregression Preview: this will be important for hypothesis testing in regression Total Variability in Y SSregression SSresidual Variability in Y explained by X Variability in Y NOT explained by X
Partitioning Variability in Y Squared correlation (r2) tells us the proportion of variability in Y explained by X e.g. r2 = .65 means X explains 65% of the variability in Y SSregression = (r2)SSY = (.65)SSY
Partitioning Variability in Y Unpredicted variability will therefore = 1- r2 (variability not explained by x) e.g. if r2 = .65 then 35% of the variability in Y is not explained by x (1-.65=.35) SSresidual = (1-r2)SSY = (.35)SSY Alternative calculation of SSresidual = ∑(Y-Ŷ)2
r2 = .582 = .33 SSx = 10 SP = 10 SSy=30 x (#drinks) y (sledding) x2 y2 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 SSx = 10 SP = 10 SSy=30 r2 = .582 = .33
SSregression = (r2)SSY = .33 (30) = 9.9 x (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 SSx = 10 SP = 10 SSy=30 r2 = .582 = .33 SSregression = (r2)SSY = .33 (30) = 9.9 SSresidual = (1-r2)SSY = 1-.33 (30) = .67 (30) =20.1
Hypothesis Testing for Regression If there is no real association between x and y in the population we expect the slope to equal zero If the slope for the regression line from our sample does not equal zero we must decide between two alternatives: The nonzero slope is due to random chance error The nonzero slope suggests a systematic real relationship between x and y that exists in the population
Null and Alternative for Regression Ho: the equation does not account for a significant proportion of the variance in the Y scores Ho: b = 0 Ha: the equation does account for a significant proportion of the variance in Y scores. Ha: b 0 Process of testing the regression line Called analysis of regression Very similar to ANOVA F ratio used to determine if variance predicted is significantly greater than would be expected by chance
Calculating the F ratio F = MSregression = = 1 (for simple regression) MSresidual = = n-2 SSY SSresidual SSregression
Steps for Hypothesis Testing in Simple Regression Step 1: Specify the null and alternative hypotheses. Ho: 1 = 0 Ha: 1 0 Step 2: Designate the rejection region by selecting . Step 3: Obtain an F critical value df (1, n-2) Step 4: collect your data Step 5: Use your sample data to calculate the a) regression equation b) r2 c) SS Step 6: Calculate the F ratio Step 7: Compare Fobs with Fcrit Step 8: Interpret results
Full Sledding Example x (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 1: Specify the null and alternative hypotheses. Ho: 1 = 0 Ha: 1 0 Step 2: Designate the rejection region by selecting = .05 Step 3: Obtain an F critical value df (1, n-2) = (1, 3) Fcrit = 10.13
Full Sledding Example Step 5: (a) calculate regression equation (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Already Done! Step 5: (a) calculate regression equation Ŷ = 1.0 X + 2
Full Sledding Example r2 = .582 = .33 Step 5: (b) Calculate r2 x (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 5: (b) Calculate r2 Already Done! r2 = .582 = .33
Full Sledding Example Step 5: (c) Calculate SS (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 5: (c) Calculate SS SSregression = (r2)SSY = 9.9 SSresidual = (1-r2)SSY = 20.1 Already Done!
Full Sledding Example Step 6: Calculate F (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 6: Calculate F SSregression =9.9 MSregression = SSregression dfregression MSregression = 9.9 1 = 9.9 SSresidual = 20.1 MSresidual = SSresidual dfresidual MSresidual = 20.1 5−2 =6.7
Full Sledding Example F = MSregression MSresidual = 9.9 6.7 =1.48 (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 6: Calculate F MSregression =9.9 MSresidual =6.7 F = MSregression MSresidual = 9.9 6.7 =1.48
Full Sledding Example Step 7: Compare F obs to F crit Fcrit = 10.13 (#drinks) y (sledding) x2 y2 xy (Y- 5 10 25 100 50 7 3 6 9 36 18 1 49 42 8 -1 4 16 12 -3 2 x = 20 y = 30 (x2) = 90 (y2) = 210 (xy) = 130 (Y- = 20 Step 7: Compare F obs to F crit Fcrit = 10.13 Fobs = 1.48 Decision: fail to reject the null Step 8: Interpretation: number of drinks does not account for a significant portion of the variance in attractiveness ratings
Summary Table SS df MS F Regression 9.9 1 1.48 Residual 20.1 3 6.7 Total 30.0 4 Number of drinks did not explain a significant proportion of the variance in willingness to go sledding with me, F(1,3)=1.48, p > .05, R2=.33. Money down the toilet!!
Regression in RC Does weekend bedtime predict hours or TV viewing? Predictor (IV): Week end bed time Response (DV): hours of TV Step 1: Specify the null and alternative hypotheses. Ho: 1 = 0 Ha: 1 0 Step 2: Designate the rejection region by selecting = .05 Step 3: Will judge significance using p-value method!
Regression in RC: WE bed & TV Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.0234 2.1189 -2.371 0.02487 * WeekEnd 0.4291 0.1523 2.818 0.00877 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 Residual standard error: 0.7889 on 28 degrees of freedom Multiple R-squared: 0.2209, Adjusted R-squared: 0.1931 F-statistic: 7.94 on 1 and 28 DF, p-value: 0.00877
Regression in RC: WE bed & TV Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.0234 2.1189 -2.371 0.02487 * WeekEnd 0.4291 0.1523 2.818 0.00877 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 Residual standard error: 0.7889 on 28 degrees of freedom Multiple R-squared: 0.2209, Adjusted R-squared: 0.1931 F-statistic: 7.94 on 1 and 28 DF, p-value: 0.00877
Writing up results of regression Weekend bedtime was a significant predictor of television viewing, F (1, 28) = 7.94, p = .01, r2 = .2209. The r2 indicates that weekend bedtime explains 22.09% of the variance in TV viewing. The slope of the regression equation was not significant (β1 = 0.4291, p = .01) and indicated that people who went to sleep later tended to watch more TV (shame!). The regression equation relating weekend bedtime and TV viewing was: TV = -5.0234 + .4291(WE).
Standardized Slopes Standardized regression coefficient (β “beta”): standardized slope representing change of x and y in standard deviation units. Example: Ŷ = βx + a Ŷ = .5x + 10 Interpretation of β: for each one SD increase in X there will be a corresponding .5 SD increase in Y Why would we want a standardized coefficient? Gives us a measure of effect size If we have more than one predictor we can compare relative effects on Y
How do we calculate Beta? By Hand: Transform Y scores into z scores Transform X scores into z scores Compute slope using transformed scores (then change will be in standard deviation units) (I won’t make you do this by hand!)
Real data examples
-What is R2? -What is the standardized slope? -What does the standardized slope suggest about change in X vs. Y?
-Is self-esteem more strongly related to positivity or to negativity? -Calculate and interpret R2 for the association between self-esteem and positivity