Regression continued…

Regression continued…
One dependent variable, one or more dependent variables

Overview Last week we focused on bivariate linear regression
But with only one explanatory variable, we could not tell the full story. Today, we will introduce other variables to help us better predict our outcome variable.

Announcements Exam next week (!)

Announcements Exam next week
Where & When: Here – same time, same place. What: A mix of short answer, multiple choice, problems to work out (math) You should bring a calculator You can use your phone, though I’d prefer you use a calculator The exam is closed book But… you are allowed one page (letter size) of hand written equations to assist with some of the math. If you use such a sheet you must turn it in with your exam.

Announcements Exam topics:
Covers everything we covered up to and including bivariate regression Exam topics: Hypothesis testing Concepts covered in the exam: Knowing which statistical test to use Working with survey distributions Presenting and interpreting data And how Measures of association Data (variable) types Regression Descriptive statistics US Census & American Community Survey Survey methods and sampling Survey analysis Probability Inferential statistics

More regression Woo-hoo!

Conceptual basis We are still going to use the same basic conceptual model

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict …

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]?

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict …

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]?

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict ...

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict [supporting sustainble planning efforts]?

Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict [supporting sustainble planning efforts]? Other factors certainly influence these relationships but for now, we will focus on relationships between two variables.

Ordinary Least Squares (OLS)
Basic simple regression model: Y= α + βX+ 𝜀

Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable

Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable and X is the independent variable

Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable and X is the independent variable Y can also be called the “outcome variable” or the “left-hand-side” variable. X can be also be called the “predictor variable” or the “right-hand-side” variable.

Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept

Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept and β is the slope of the line

Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept and β is the slope of the line 𝜀 is an independent normal variable N(0,σ2), also called the error term

Basic simple regression model: Y= α + βiXi+ 𝜀 Y is the dependent variable and X is a vector of independent variables α is the intercept and β is a vector of coefficients for each i 𝜀 is an independent normal variable N(0,σ2), also called the error term

Least Squares Estimators
Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn

Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X

Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b)

Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b) Least Squares Estimators minimizes the sum of the squares of the residuals around the line

multiple This week, in our bivariate regression, we had n sample observations: X1i, X2i, … Xni Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b) Least Squares Estimators minimizes the sum of the squares of the residuals around the line Now, we have more than one independent variables

Regression assumptions
Linear relationship The relationship between our outcome and predictors is a linear relationship Multivariate normality The data are normally distributed No or little multicollinearity The predictors are not too highly correlated No auto-correlation The errors are not correlated Homoscedasticity Variance is constant across different values.

Regression assumptions
Linear relationship The relationship between our outcome and Multivariate normality The data are normally distributed No or little multicollinearity The predictors are not too highly correlated No auto-correlation The errors are not correlated Homoscedasticity Variance is constant across different values.

Overall F Test Test for the significance of the overall multiple regression model The Null Hypothesis, Ho The Alternative Hypothesis, Ha

Overall F Test Test for the significance of the overall multiple regression model The Null Hypothesis, Ho No linear relationship exists between the dependent and independent variables. The Alternative Hypothesis, Ha A linear relationship exists between the dependent and independent variables.

Dummy Variables Used to represent variable factors such as temporal and spatial effects, qualitative variables, groupings of quantitative variables Represented as binary variables, for example X1 = 1 the case is an inner city census tract X1 = 0 the case is not an inner city census tract Be careful of interpretations when an intercept term is included in the equation

Dummy Variables And when we use dummy variables, we always omit one from the model as the base (reference case). We do this for interpretation And to avoid multicolinarity

Things that go wrong … Omitted explanatory variables (misspecification) Nonlinear relationships

Things that go wrong … Outliers

Things that go wrong … Residuals not normally distributed

Transformations for Skewed Variables

Things that go wrong … Residual variance is not constant against X - heteroskedasticity

Things that go wrong … Multicollinearity: using two or explanatory variables that are so correlated that they essentially are the same thing in your data Example: Modeling transit use in Detroit as a function of: Income Car ownership In Detroit, income and car ownership are nearly the same thing.

Interaction Terms

Interactions. Let’s say we are doing a model of wages and include years of education as our only predictor—but then we wanted to know if the effect of education differs for men and women. As in: do women get less (in terms of cash) out of schooling than men do? We would interact gender and education. We include (1) education; (2) a dummy variable for being female and (3) the interaction (aka multiplication) of those two things. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20

Interactions. The data table below shows what that interaction would look like. It’s just multiplication! Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. We can think of the regression results as components of change. Let’s start with the constant, which is $22,000 that everyone gets, regardless of their level of education or their sex. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. The graph to the right shows this constant component.
Regardless of education, everyone has the same constant component.

Interactions. Now let’s look at years of education. For each additional year of education, everyone earns an extra $1,000 from this component. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. The graph to the right shows this education component.
As education rises, this component “gives you” more income.

Interactions. Now let’s look at the female component.
Every woman, regardless of education, earns $6,000 less in our model. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. The graph to the right shows this female component. (Notice we’re below the horizontal axis—in negative territory.)

Interactions. Now here’s the interaction.
For each year of education in this component, you lose $375, but this is only true for women (men have a “0” here). Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. The graph to the right shows this interaction.

Interactions. There! We’ve gone through all the components in our simple model. Let’s look at how these components combine. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1

Interactions. Men feel the effect of these two components

Interactions. Women feel the effect of all four components

Interactions. If we sum up those two components for men, we get the black line. If we sum up all four components for women, we get the red line. Women are offset (the line is pushed down) because of the effect of the variable “female.” And they have a shallower slope (lower returns for each year of education) compared to men – that’s the interaction term.

Interactions. While these data are completely made up, the general findings are at least in broad brushstrokes consistent with real-world research on this topic. The table to the right shows the results of an actual model using the General Social Survey, in constant dollars. Table X. Model of inflation-adjusted wages for men and women, General Social Survey Coefficient Sig. Years of education 4022 *** Female -4756 Female * years of education -926 Constant -13680 N 33,3325 R-squared 0.17

Interactions. We focus mostly on ordinary least squares (OLS) in methods classes. It’s the simplest, clearest model type, and the most common model you will see used.* Here are some other model types. * It’s not uncommon to see OLS used when another type of model should have been used instead.

Rate models Other model types
Probably the most common misuse of OLS is in instances where modeling a rate would be more correct.

Examples: models trying to understand what causes the… number of muggings in a neighborhood number of bus users in a neighborhood number of pedestrian crashes at a crosswalk

Examples: models trying to understand what causes the… number of muggings in a neighborhood number of bus users in a neighborhood number of pedestrian crashes at a crosswalk These make little sense on their own. We also need to know how many people were not mugged; how many do not use the bus, and how many people cross the street safely. We need a rate. If 10 people are mugged, is that 100% of people or 1% of people?

Rate models Other model types Why? A clear example:
A researcher found that residential and employment density are strong predictors of pedestrian crashes. But there are many, many more people crossing the street in Manhattan than in New Brunswick. If we don’t control for “exposure” we miss the real story.

Rate models Other model types Common rate models include:
Poisson models Negative binomial models And for the case where there are a lot of zeroes (use of public transportation in census tracts): Zero-inflated negative binomial model

Binary models Other model types
Frequently, we want to model the “decision to do” (or the happening-of) something or not to do it. For instance, the “choice to work” (or the instance of employment).

The outcome variables in these instances are coded as zeroes and ones. “0” means the person doesn’t work; “1” means they do. Name Sex Age Yrs. Educ. Work George M 44 16 1 Martha F 42 18 John 22

Models used for this kind of data include: Binary logistic regression (binary “logit”) Binary probit model

Multiple-outcome “choice” models
Other model types Multiple-outcome “choice” models Some outcomes take on two states: working or not working. Many other outcomes take on many states: commuting to work by (a) bus (b) train (c) car (d) bike (e) on foot (f) by Segway, and so forth. For these types of outcomes, we use an extension of the logit/probit models: Multinomial logistic regression Multinomial probit

And really so, so many more
Other model types And really so, so many more There are dozens of other commonly-used models, too. There are tradeoffs – some are more “right” for certain phenomena, but they may be computationally intensive or require special software or even programming. Good number-crunchers are always talking to each other and reading up on methods.

Excellent resource UCLA Institute for digital research and education
Researchers around the world use UCLA’s statistics help page. It’s truly excellent. Perhaps the best thing on that site is the “annotated output” section.

UCLA Institute for digital research and education

Anova table Overall Model fit Parameter Estimates

Insititute for digital research and education

Regression continued…

Similar presentations

Presentation on theme: "Regression continued…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression continued…

Similar presentations

Presentation on theme: "Regression continued…"— Presentation transcript:

Similar presentations

About project

Feedback