Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission
Announcements Proposals Due Today
Review: Regression Regression coefficient formulas: Question: What is the interpretation of a regression slope? Answer: It indicates the typical increase in Y for any 1-point increase along the X-variable –Note: this information is less useful if the linear association between X and Y is low
Example: Education & Job Prestige The actual SPSS regression results for that data: Estimates of a and b: “Constant” = a = Slope for “Year of School” = b = Equation: Prestige = Education A year of education adds 2.5 points job prestige
Review: Covariance Covariance (s YX ): Sum of deviation about Y-bar multiplied by deviation around X-bar: Measures whether deviation (from mean) in X tends is accompanied by similar deviation in Y –Or if cases with positive deviation in X have negative deviation in Y –This is summed up for all cases in the data
Review: Covariance Covariance: based on multiplying deviation in X and Y Y-bar =.5 X-bar = -1 This point deviates a lot from both means (3)(2.5) = 7.5 dev = 2.5 dev = 3 This point deviates very little from X-bar, Y-bar (.4)(-.25) =-.01
Review: Covariance and Slope The slope formula can be written out as follows:
Review: R-Square The R-Square statistic indicates how well the regression line “explains” variation in Y It is based on partitioning variance into: 1. Explained (“regression”) variance –The portion of deviation from Y-bar accounted for by the regression line 2. Unexplained (“error”) variance –The portion of deviation from Y-bar that is “error” Formula:
Review: R-Square Visually: Deviation is partitioned into two parts Y-bar “Explained Variance” Y=2+.5X “Error Variance”
Correlation Coefficient (r) The R-square is very similar to another important statistic: the correlation coefficient (r) –R-square is literally the square of r Formula for correlation coefficient: r is a measure of linear association Ranges from –1 to 1 Zero indicates no linear association 1 = perfect positive linear association -1 = perfect negative linear association
Correlation Coefficient (r) Example: Education and Job Prestige SPSS can calculate the correlation coefficient –Usually listed in a matrix to allow many comparisons Correlation of “Year of School” and Job Prestige: r =.521
Covariance, R-square, r, and b Covariance, R-square, r, and b are all similar –All provide information about the relationship between X and Y Differences: Covariance, b, and r can be positive or negative –r is scaled from –1 to +1, others range widely b tells you the actual slope –It relates change in X to change in Y in real units R-square is like r, but is never negative –And, it tells you “explained” variance of a regression
Correlation Hypothesis Tests Hypothesis tests can be done on r, R-square, b Example: Correlation (r): linear association Is observed positive or negative correlation significantly different from zero? –Might the population have no linear association? –Population correlation denoted by greek “r”, rho ( ) H0: There is no linear association ( = 0) H1: There is linear association ( 0) We’ll mainly focus on tests regarding slopes But the process is similar for correlation (r)
Correlation Coefficient (r) Education and Job Prestige hypothesis test: Here, asterisks signify that coefficients are significantly different from zero, =.01 “Sig.” is a p-value: The probability of observing r if = 0. Compare it to !
Hypothesis Tests: Slopes Given: Observed slope relating Education to Job Prestige = 2.47 Question: Can we generalize this to the population of all Americans? –How likely is it that this observed slope was actually drawn from a population with slope = 0? Solution: Conduct a hypothesis test Notation: slope = b, population slope = H0: Population slope = 0 H1: Population slope 0 (two-tailed test)
Example: Slope Hypothesis Test The actual SPSS regression results for that data: t-value and “sig” (p- value) are for hypothesis tests about the slope Reject H0 if: T-value > critical t (N-2 df) Or, “sig.” (p-value) less than
Hypothesis Tests: Slopes What information lets us to do a hypothesis test? Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic –It is the distribution of every value of the slope, based on all possible samples (of size N) If certain assumptions are met, the sampling distribution approximates the t-distribution –Thus, we can assess the probability that a given value of b would be observed, if = 0 –If probability is low – below alpha – we reject H0
0 Sampling distribution of the slope Hypothesis Tests: Slopes Visually: If the population slope ( ) is zero, then the sampling distribution would center at zero –Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero If =0, observed slopes should commonly fall near zero, too b If observed slope falls very far from 0, it is improbable that is really equal to zero. Thus, we can reject H0.
Bivariate Regression Assumptions Assumptions for bivariate regression hypothesis tests: 1. Random sample –Ideally N > 20 –But different rules of thumb exist. (10, 30, etc.) 2. Variables are linearly related –i.e., the mean of Y increases linearly with X –Check scatter plot for general linear trend –Watch out for non-linear relationships (e.g., U- shaped)
Bivariate Regression Assumptions 3. Y is normally distributed for every outcome of X in the population –“Conditional normality” Ex: Years of Education = X, Job Prestige (Y) Suppose we look only at a sub-sample: X = 12 years of education –Is a histogram of Job Prestige approximately normal? –What about for people with X = 4? X = 16 If all are roughly normal, the assumption is met
Bivariate Regression Assumptions Normality: Examine sub-samples at different values of X. Make histograms and check for normality. Good Not very good
Bivariate Regression Assumptions 4. The variances of prediction errors are identical at every value of X –Recall: Error is the deviation from the regression line –Is dispersion of error consistent across values of X? –Definition: “homoskedasticity” = error dispersion is consistent across values of X –Opposite: “heteroskedasticity”, errors vary with X Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc. –Are the errors around line similar? Or different?
Bivariate Regression Assumptions Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? Here, things look pretty good.
Bivariate Regression Assumptions Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. This looks pretty bad.
Bivariate Regression Assumptions Notes/Comments: 1. Overall, regression is robust to violations of assumptions –It often gives fairly reasonable results, even when assumptions aren’t perfectly met 2. Variations of OLS regression can handle situations where assumptions aren’t met 3. But, there are also further diagnostics to help ensure that results are meaningful… –We’ll discuss them next week.
Regression Hypothesis Tests If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution Standard deviation of the sampling distribution is called the standard error of the slope ( b ) Population formula of standard error: Where e 2 is the variance of the regression error
Regression Hypothesis Tests Estimating e 2 lets us estimate the standard error: Now we can estimate the S.E. of the slope:
Regression Hypothesis Tests Finally: A t-value can be calculated: –It is the slope divided by the standard error Where s b is the sample point estimate of the standard error The t-value is based on N-2 degrees of freedom
Example: Education & Job Prestige T-values can be compared to critical t... SPSS estimates the standard error of the slope. This is used to calculate a t-value The t-value can be compared to the “critical value” to test hypotheses. Or, just compare “Sig.” to alpha. If t > crit or Sig < alpha, reject H0
Regression Confidence Intervals You can also use the standard error of the slope to estimate confidence intervals: Where t N-2 is the t-value for a two-tailed test given a desired -level Example: Observed slope = 2.5, S.E. =.10 95% t-value for 102 d.f. is approximately 2 95% C.I. = 2.5 +/- 2(.10) Confidence Interval: 2.3 to 2.7
Regression Hypothesis Tests You can also use a T-test to determine if the constant (a) is significantly different from zero –But, this is typically less useful to do Hypotheses ( = population parameter of a): H0: = 0, H1: 0 But, most research focuses on slopes
Regression: Outliers Note: Even if regression assumptions are met, slope estimates can have problems Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample Outliers can result from: –Errors in coding or data entry –Highly unusual cases –Or, sometimes they reflect important “real” variation Even a few outliers can dramatically change estimates of the slope (b)
Regression: Outliers Outlier Example: Extreme case that pulls regression line up Regression line with extreme case removed from sample
Regression: Outliers Strategy for dealing with outliers: 1. Identify them Look at scatterplots for extreme values Or, ask SPSS to compute outlier diagnostic statistics –There are several statistics to identify cases that are affecting the regression slope a lot –Examples: “Leverage”, Cook’s D, DFBETA –SPSS can even identify “problematic” cases for you… but it is preferable to do it yourself.
Regression: Outliers 2. Depending on the circumstances, either: A) Drop cases from sample and re-do regression –Especially for coding errors, very extreme outliers –Or if there is a theoretical reason to drop cases –Example: In analysis of economic activity, communist countries differ a lot… B) Or, sometimes it is reasonable to leave outliers in the analysis –e.g., if there are several that represent an important minority group in your data When writing papers, identify if outliers were excluded (and the effect that had on the analysis).