Regression continued… One dependent variable, one or more dependent variables
Overview Last week we focused on bivariate linear regression But with only one explanatory variable, we could not tell the full story. Today, we will introduce other variables to help us better predict our outcome variable.
Announcements Exam next week (!)
Announcements Exam next week Where & When: Here – same time, same place. What: A mix of short answer, multiple choice, problems to work out (math) You should bring a calculator You can use your phone, though I’d prefer you use a calculator The exam is closed book But… you are allowed one page (letter size) of hand written equations to assist with some of the math. If you use such a sheet you must turn it in with your exam.
Announcements Exam topics: Covers everything we covered up to and including bivariate regression Exam topics: Hypothesis testing Concepts covered in the exam: Knowing which statistical test to use Working with survey distributions Presenting and interpreting data And how Measures of association Data (variable) types Regression Descriptive statistics US Census & American Community Survey Survey methods and sampling Survey analysis Probability Inferential statistics
More regression Woo-hoo!
Conceptual basis We are still going to use the same basic conceptual model
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict …
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]?
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict …
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]?
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict ...
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict [supporting sustainble planning efforts]?
Conceptual basis We can think of a whole host of questions that we might ask of the form: Does [X] predict [Y]? Do [GRE scores] predict [success in the GSAPP Masters of Urban Planning Program]? Does [the number of fancy coffee shops in a neighborhood] predict [gentrification]? Does [educational attainment] predict [supporting sustainble planning efforts]? Other factors certainly influence these relationships but for now, we will focus on relationships between two variables.
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable and X is the independent variable
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Where: Y is the dependent variable and X is the independent variable Y can also be called the “outcome variable” or the “left-hand-side” variable. X can be also be called the “predictor variable” or the “right-hand-side” variable.
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept and β is the slope of the line
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept and β is the slope of the line 𝜀 is an independent normal variable N(0,σ2), also called the error term
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βX+ 𝜀 Y is the dependent variable and X is the independent variable α is the intercept and β is the slope of the line 𝜀 is an independent normal variable N(0,σ2), also called the error term
Ordinary Least Squares (OLS) Basic simple regression model: Y= α + βiXi+ 𝜀 Y is the dependent variable and X is a vector of independent variables α is the intercept and β is a vector of coefficients for each i 𝜀 is an independent normal variable N(0,σ2), also called the error term
Least Squares Estimators Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn
Least Squares Estimators Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X
Least Squares Estimators Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b)
Least Squares Estimators Last week, in our bivariate regression, we had n sample observations: X1, X2, … Xn Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b) Least Squares Estimators minimizes the sum of the squares of the residuals around the line
Least Squares Estimators multiple This week, in our bivariate regression, we had n sample observations: X1i, X2i, … Xni Y1, Y2, … Yn We want to estimate a “best fitting” line Y = a + b X So we need “good” estimates of the intercept (a) and the slope (b) Least Squares Estimators minimizes the sum of the squares of the residuals around the line Now, we have more than one independent variables
Regression assumptions Linear relationship The relationship between our outcome and predictors is a linear relationship Multivariate normality The data are normally distributed No or little multicollinearity The predictors are not too highly correlated No auto-correlation The errors are not correlated Homoscedasticity Variance is constant across different values.
Regression assumptions Linear relationship The relationship between our outcome and Multivariate normality The data are normally distributed No or little multicollinearity The predictors are not too highly correlated No auto-correlation The errors are not correlated Homoscedasticity Variance is constant across different values.
Overall F Test Test for the significance of the overall multiple regression model The Null Hypothesis, Ho The Alternative Hypothesis, Ha
Overall F Test Test for the significance of the overall multiple regression model The Null Hypothesis, Ho No linear relationship exists between the dependent and independent variables. The Alternative Hypothesis, Ha A linear relationship exists between the dependent and independent variables.
Dummy Variables Used to represent variable factors such as temporal and spatial effects, qualitative variables, groupings of quantitative variables Represented as binary variables, for example X1 = 1 the case is an inner city census tract X1 = 0 the case is not an inner city census tract Be careful of interpretations when an intercept term is included in the equation
Dummy Variables And when we use dummy variables, we always omit one from the model as the base (reference case). We do this for interpretation And to avoid multicolinarity
Things that go wrong … Omitted explanatory variables (misspecification) Nonlinear relationships
Things that go wrong … Outliers
Things that go wrong … Residuals not normally distributed
Transformations for Skewed Variables
Things that go wrong … Residual variance is not constant against X - heteroskedasticity
Things that go wrong … Multicollinearity: using two or explanatory variables that are so correlated that they essentially are the same thing in your data Example: Modeling transit use in Detroit as a function of: Income Car ownership In Detroit, income and car ownership are nearly the same thing.
Interaction Terms
Interactions. Let’s say we are doing a model of wages and include years of education as our only predictor—but then we wanted to know if the effect of education differs for men and women. As in: do women get less (in terms of cash) out of schooling than men do? We would interact gender and education. We include (1) education; (2) a dummy variable for being female and (3) the interaction (aka multiplication) of those two things. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20
Interactions. The data table below shows what that interaction would look like. It’s just multiplication! Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. We can think of the regression results as components of change. Let’s start with the constant, which is $22,000 that everyone gets, regardless of their level of education or their sex. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. The graph to the right shows this constant component. Regardless of education, everyone has the same constant component.
Interactions. Now let’s look at years of education. For each additional year of education, everyone earns an extra $1,000 from this component. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. The graph to the right shows this education component. As education rises, this component “gives you” more income.
Interactions. Now let’s look at the female component. Every woman, regardless of education, earns $6,000 less in our model. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. The graph to the right shows this female component. (Notice we’re below the horizontal axis—in negative territory.)
Interactions. Now here’s the interaction. For each year of education in this component, you lose $375, but this is only true for women (men have a “0” here). Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. The graph to the right shows this interaction.
Interactions. There! We’ve gone through all the components in our simple model. Let’s look at how these components combine. Table X. Model of wages for men and women, made-up data, 2010 Coefficient Sig. Years of education 1000 *** Female -6000 Female * years of education -375 Constant 22000 N 10000 R-squared 0.20 Name Education Female Fem*educ George 12 Martha 1
Interactions. Men feel the effect of these two components
Interactions. Women feel the effect of all four components
Interactions. If we sum up those two components for men, we get the black line. If we sum up all four components for women, we get the red line. Women are offset (the line is pushed down) because of the effect of the variable “female.” And they have a shallower slope (lower returns for each year of education) compared to men – that’s the interaction term.
Interactions. While these data are completely made up, the general findings are at least in broad brushstrokes consistent with real-world research on this topic. The table to the right shows the results of an actual model using the General Social Survey, 1972-2012 in constant dollars. Table X. Model of inflation-adjusted wages for men and women, 1972-2012 General Social Survey Coefficient Sig. Years of education 4022 *** Female -4756 Female * years of education -926 Constant -13680 N 33,3325 R-squared 0.17
Interactions. We focus mostly on ordinary least squares (OLS) in methods classes. It’s the simplest, clearest model type, and the most common model you will see used.* Here are some other model types. * It’s not uncommon to see OLS used when another type of model should have been used instead.
Rate models Other model types Probably the most common misuse of OLS is in instances where modeling a rate would be more correct.
Rate models Other model types Examples: models trying to understand what causes the… number of muggings in a neighborhood number of bus users in a neighborhood number of pedestrian crashes at a crosswalk
Rate models Other model types Examples: models trying to understand what causes the… number of muggings in a neighborhood number of bus users in a neighborhood number of pedestrian crashes at a crosswalk These make little sense on their own. We also need to know how many people were not mugged; how many do not use the bus, and how many people cross the street safely. We need a rate. If 10 people are mugged, is that 100% of people or 1% of people?
Rate models Other model types Why? A clear example: A researcher found that residential and employment density are strong predictors of pedestrian crashes. But there are many, many more people crossing the street in Manhattan than in New Brunswick. If we don’t control for “exposure” we miss the real story.
Rate models Other model types Common rate models include: Poisson models Negative binomial models And for the case where there are a lot of zeroes (use of public transportation in census tracts): Zero-inflated negative binomial model
Binary models Other model types Frequently, we want to model the “decision to do” (or the happening-of) something or not to do it. For instance, the “choice to work” (or the instance of employment).
Binary models Other model types The outcome variables in these instances are coded as zeroes and ones. “0” means the person doesn’t work; “1” means they do. Name Sex Age Yrs. Educ. Work George M 44 16 1 Martha F 42 18 John 22
Binary models Other model types Models used for this kind of data include: Binary logistic regression (binary “logit”) Binary probit model
Multiple-outcome “choice” models Other model types Multiple-outcome “choice” models Some outcomes take on two states: working or not working. Many other outcomes take on many states: commuting to work by (a) bus (b) train (c) car (d) bike (e) on foot (f) by Segway, and so forth. For these types of outcomes, we use an extension of the logit/probit models: Multinomial logistic regression Multinomial probit
And really so, so many more Other model types And really so, so many more There are dozens of other commonly-used models, too. There are tradeoffs – some are more “right” for certain phenomena, but they may be computationally intensive or require special software or even programming. Good number-crunchers are always talking to each other and reading up on methods.
Excellent resource UCLA Institute for digital research and education Researchers around the world use UCLA’s statistics help page. It’s truly excellent. Perhaps the best thing on that site is the “annotated output” section.
UCLA Institute for digital research and education http://www.ats.ucla.edu/stat/stata/output/reg_output.htm
UCLA Institute for digital research and education Anova table Overall Model fit http://www.ats.ucla.edu/stat/stata/output/reg_output.htm Parameter Estimates
UCLA Institute for digital research and education
UCLA Institute for digital research and education
UCLA Institute for digital research and education
UCLA Institute for digital research and education http://www.ats.ucla.edu/stat/ http://www.ats.ucla.edu/stat/ Insititute for digital research and education