Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Similar presentations


Presentation on theme: "Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission."— Presentation transcript:

1 Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

2 Announcements Final Project Proposals Due next week! Any questions? Today’s Class The linear regression model

3 Review: Linear Functions Linear functions can summarize the relationship between two variables: –Formula: Happy = 2 +.00005Income Linear functions can also be used to “predict” (estimate) a case’s value of variable (Y i ) based on its value of another variable (X i ) If you know the constant and slope “Y-hat” indicates an estimation function: b YX denotes the slope of Y with respect to X

4 Review: The Linear Regression Model The value of any point (Y i ) can be modeled as: The value of Y for case (i) is made up of A constant (a) A sloping function of the case’s value on variable X (b YX ) An error term (e), the deviation from the line By adding error (e), an abstract mathematical function can be applied to real data points

5 Review: The Linear Regression Model Visually: Y i = a + bX i + e i -4 -2 0 2 4 4 2 -2 -4 Y=2+.5X Constant (a) = 2 a = 2 bX = 3(.5) = 1.5 Case 7: X=3, Y=5 e = 1.5

6 Review: Estimating Linear Equations Question: How do we choose the best line to describe our real data? Idea: The best regression line is the one with the smallest amount of error The line comes as close as possible to all points Error is simply deviation from the regression line Note: to make all deviation positive, we square it, producing the “sum of squares error”:

7 Review: Estimating Linear Equations A poor estimation (big error) -4 -2 0 2 4 4 2 -2 -4 Y=1.5-1X

8 Review: Estimating Linear Equations Better estimation (less error) -4 -2 0 2 4 4 2 -2 -4 Y=2+.5X

9 Review: Estimating Linear Equations Look at the improvement (reduction) in error: High Error vs. Low Error

10 Review: Estimating Linear Equations Goal: Find values of constant (a) and slope (b) that produce the lowest squared error –The “least squares” regression line The formula for the slope (b) that yields the “least squares error” is: Where s 2 x is the variance of X And s YX is the covariance of Y and X A concept we must now define and discuss

11 Covariance Variance: Sum of deviation about Y-bar over N-1 Covariance (s YX ): Sum of deviation about Y-bar multiplied by deviation around X-bar:

12 Covariance Covariance: A measure of how much variance of a case in X is accompanied by variance in Y It measures whether deviation (from mean) in X tends to be accompanied by similar deviation in Y –Or if cases with positive deviation in X have negative deviation in Y –This is summed up for all cases in the data The covariance is one numerical measure that characterizes the extent of linear association –As is the correlation coefficient (r)

13 Covariance Covariance: based on multiplying deviation in X and Y -4 -2 0 2 4 4 2 -2 -4 Y-bar =.5 X-bar = -1 This point deviates a lot from both means (3)(2.5) = 7.5 dev = 2.5 dev = 3 This point deviates very little from X-bar, Y-bar (.4)(-.25) =-.01

14 Covariance Some points fall above both means (or below both means) -4 -2 0 2 4 4 2 -2 -4 Y-bar =.5 X-bar = -1 Points falling above both means (or below both means) contribute positively to the covariance: Two positive (or two negative) deviations multiply to give a positive number

15 Covariance Points falling above one mean but below the other = one positive and one negative deviation -4 -2 0 2 4 4 2 -2 -4 Y-bar =.5 X-bar = -1 One positive and one negative deviation multiply to be negative.

16 Covariance Covariance is positive if cases cluster on diagonal from lower-left to upper-right –Cases that deviate positively on X also deviate positively on Y (and negative X with negative Y) Covariance is negative if cases cluster on opposite diagonal (upper-left to lower-right) –Cases with positive deviation on X are negative on Y (and negative on X with positive on Y) If points are scattered all around, positives and negatives cancel out – the covariance is near zero

17 Covariance and Slope Note that the covariance has properties similar to the slope In fact, the covariance can be used to calculate a regression slope that minimizes error for all points –The “Ordinary Least Squares” error slope.

18 Covariance and Slope The slope formula can be written out as follows:

19 Computing the Constant Once the slope has been calculated, it is simple to determine the constant (a): Simply plug in the values of Y-bar, X-bar, and b Notes: The calculated value of b is called a “coefficient” The value of a is called the constant.

20 Regression Example Example: Study time and student achievement. –X variable: Average # hours spent studying per day –Y variable: Score on reading test CaseXY 12.628 21.413 3.6517 44.131 5.258 61.916 Y axis X axis 0 1 2 3 4 30 20 10 0 X-bar = 1.8 Y-bar = 18.8

21 Regression Example Slope = covariance (X and Y) / variance of X –X-bar = 1.8, Y-bar = 18.8 CaseXY 12.628 21.413 3.6517 44.131 5.258 61.916 X Dev 0.8 -0.4 1.15 2.3 -1.55 0.1 Y Dev 9.2 -5.8 -1.8 12.2 -10.8 -2.8 XD*YD 7.36 1.92 -2.07 28.06 16.74 -.28 Sum of X deviation * Y deviation = 51.73

22 Regression Example Calculating the Covariance: Standard deviation of X = 1.4 Variance = square of S.D. = 1.96 Finally:

23 Regression Example Results: Slope b = 5.3, constant a = 9.3 Equation: TestScore = 9.3 + 5.3*HrsStudied Question: What is the interpretation of b? Answer: For every hour studied, test scores increase by 5.3 points Question: What is the interpretation of the constant? Answer: Individuals who studied zero hours are predicted to score 9.3 on a the test.

24 Computing Regressions Regression coefficients can be calculated in SPSS –You will rarely, if ever, do them by hand SPSS will estimate: –The value of the constant (a) –The value of the slope (b) –Plus, a large number of related statistics and results of hypothesis testing procedures

25 Example: Education & Job Prestige Example: Years of Education versus Job Prestige –Previously, we made an “eyeball” estimate of the line Our estimate: Y = 5 + 3X

26 Example: Education & Job Prestige The actual SPSS regression results for that data: Estimates of a and b: “Constant” = a = 9.427 Slope for “Year of School” = b = 2.487 Equation: Prestige = 9.4 + 2.5 Education A year of education adds 2.5 points job prestige

27 Example: Education & Job Prestige Comparing our “eyeball” estimate to the actual OLS regression line Our estimate: Y = 5 + 3X Actual OLS regression line computed in SPSS

28 Example: Education & Job Prestige Much more information is provided: This information allows us to do hypothesis tests about constant & slope The R and R-Square indicate how well the line summarizes the data

29 R-Square Issue: Even the “best” regression line misses data points. We still have some error. Question: How good is our line at summarizing the relationship between two variables? –Do we have a lot of error? –Or only a little? (i.e., the line closely estimates cases) Specifically, does knowledge of X help us accurately understand values of Y? Solution: The R-Square statistic –Also called “coefficient of determination”

30 R-Square Variance around Y-bar can be split into two parts: -4 -2 0 2 4 4 2 -2 -4 Y-bar “Explained Variance” Y=2+.5X “Error Variance”

31 R-Square The total variation of a case Y i around Y-bar can be partitioned into two parts (like ANOVA): 1. Explained variance –Also called “Regression Variance” –The variance we predicted based on the line 2. Error variance –The variance not accounted for by the line Summing squared deviation for all cases give us:

32 R-Square The R-Square statistic is computed as follows: Question: What is R-square if the line is perfect? (i.e., it hits every point, there is no error) Answer: R-square = 1.00 Question: What is R-square if the line is NO HELP in estimating points… (lots of error) Answer: R-square is zero

33 R-Square Properties of R-square: 1. Tells us the proportion of all variance in Y that is explained as a linear function of X –It measures “how good” our line is at predicting Y 2. Ranges from 0 to 1 –1 indicates that perfect prediction of Y by X –0 indicates that the line explains no variance in Y The R-square indicates how well a variable (or groups of variables) account for variation Y.

34 Interpreting R-Square R-square is often used as an overall indicators of the “success” of a regression model Higher R-square is considered “better” than lower How high an R-square is “good enough”? –It varies depending on the dependent variable –Orderly phenomena can yield R-square >.9 –“Messy”, random phenomena can yield values like.05 –Look at literature to know what you should expect.

35 Interpreting R-Square But, finding variables that produce a high R- square is not the only important goal –Not all variables that generate high R-square are sensible to include in a regression analysis –Example: Suppose you want to predict annual income Hourly wage is a very good predictor… Because it is tautologically linked to the dependent variable More sociologically interesting predictors would be social class background, education, race, etc. –Example: Conservatism predicts approval of Bush.


Download ppt "Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission."

Similar presentations


Ads by Google