Regression Chapter 5 January 24 – 25 2012 Part II.

Regression Chapter 5 January 24 – 25 2012 Part II

Mad Correlators and Regressors on the loose Regression lines and why they are important The least square line Don’t even think about doing this without software Residuals and why they matter Influential observations Cautions about correlations and regression Association DOES NOT equal causation

Regression lines and why they are important As social scientists there are few things we want to do including: –Describe phenomena and events –Explain why phenomena and events are the way they are –Predict what will happen Regression lines and the equations that produce them help us do all of this

What is a regression line A regression line is a straight line that describes how a dependent variable (y) changes in response to changes in an independent variable (x). It will have the format: y = a +bx (+e) –“a” is where the line intercepts the Y axis –“b” is the slope of the line –“e” is error (you always have some)

As with before, As we learned in the last lesson, it is best to start your exploration of the association with a scatterplot. You can usually fit a regression line right on to your scatter plot

The Least Square Regression Line The least square regression line is the line that is mathematically closest to each and every point in the scatterplot. As the book notes, it is a relatively simple procedure and one that almost any statistical software package will do.

Here it is in Excel using data from Ex. 5.5 in your book

And here are the results

Please Remember When drafting a regression line it matters which variable you identify as response and explanatory Slope (b) tells us what happens to (y) when we have a one unit change in (x) (beta) simply expresses (b) in standardized or “z” units. Strength of association is given by “r” And something new is added “goodness of fit”

R 2 or Rsq R 2 is a measure of “goodness of fit” used in regression. This statistic seeks to offer an answer to the question: How well does the sum of squares line of best fit, fit the data? It is the fraction of the variation in the values of y that is explained by the regression of y on x. R 2 varies between 0 (no meaningful fit) and 1 (perfect fit). In truth you should always be suspicious of results that approach 1 too closely.

Regression requires certain assumptions be met to yield meaningful results The y variable in an OLS (ordinary least squares) regression must be measured at the interval or ratio level, the x variable can be at any level of measure. –However, Interval and Ratio level (x) variables produce outputs that are easier to interpret –Ordinal and Nominal level (x) variables must be restated in binary terms. If they are not already measured as yes/no, this/the other, 1,0 then they must be recoded into dummy variables For example, imagine a variable showing which party respondents to a survey voted for. There are no “no answers” or “other parties” involved. It is a nominal coded: –Party (a) = 1 –Party (b) = 2 –Party (c) = 3 To create dummy variables we recode these as –Variable a voted for party a = 1 did not vote for party a = 0 –Variable b voted for party b = 1 did not vote for party b = 0 We don’t have a variable for Party C as it is already included. If someone did not vote for party a or party b then by default they must have voted for party C.

Regression Assumptions and Requirements continued The relationship must be linear All of the observations in an OLS regression must be independent of one another. Including one case ought not to cause another to be automatically included. These are things you will check before you even begin a regression

Some things you need to check after doing a regression analysis (my favourite) for each value of the independent variable, the values of the dependent variable must be normally distributed. The variance of the distribution of the dependent variable must be the same for all values of the independent variable

Residuals (errors) We can check that our regression meets these requirements for the independence of observations, the normal distribution of values for the dependent variable for each value of the independent variable and for a constant variance of the dependent variable for all values of the independent variable. We do this by plotting and analyzing the residuals.

Residuals are the error terms for each case in our scatter plot. The regression line predicts the (y,x) location for each case or observation. The distance from there to where the observation really is, is the residual for the observation. Residuals can be calculated in different ways to suit different tasks Later in the term we will look at how you can use plots of residuals to test that your analysis meets the regression assumptions.

An important point about the residuals of an OLS is that they have a mean of zero. For now we just want to use them to check goodness of fit and see if we can find any points that look like they deviate too far from the line.

Influential Observations Observation 16 is clearly an outlier. It is an “influential observation” that is potentially distorting the analysis The question is what to do with it? If you can formulate a methodologically viable reason why the influential observation ought to be removed from the analysis you can. The author of the textbook has provided an applet that shows how removing this outline can impact on the resulting regression line

More cautions about regression Did I mention before that this only works if your data exhibits a linear relationship between the variables? Correlations and OLS are not resistant to extreme values for the variables Beware of extrapolating too far, the book gives nice example of growth rates for children. If you know the rate for 8 and 10 year olds don’t assume the slope of this line will continue for 25 year olds. Beware of lurking (intervening) and possibly hidden variables that impact on your analysis

And the biggest warning Association does not equal proof of causation. Having said that some associations are better than others –The association is strong and statistically significant –The association is consistent and demonstrated repeatedly in different studies –Higher doses of the explanatory variable are associated with stronger responses

The alleged cause must proceed the response in time The alleged cause is plausible (a theoretical argument can be made as to why the association ought to exist).

Regression Chapter 5 January 24 – 25 2012 Part II.

Similar presentations

Presentation on theme: "Regression Chapter 5 January 24 – 25 2012 Part II."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Chapter 5 January 24 – 25 2012 Part II.

Similar presentations

Presentation on theme: "Regression Chapter 5 January 24 – 25 2012 Part II."— Presentation transcript:

Similar presentations

About project

Feedback