Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3.

Ordinary least squares regression nicholas charron associate prof. Dept. of political science

Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3 (27 jan): alternative estimation, interaction models (Wooldridge, chap Brambor et al article) Next topic: Limited dependent variables

Section Goals To understand the basic ideas and formulas behind linear regression Calculate (by hand) simple bivariate regression coefficient Working with ’real’ data, apply knowledge, perform regression & interpret results, compare effects of variables in multiple regression Understant the basic asumptions of OLS estimation How to check for violations, and what to do (more in later lectures also) What to do when X and Y relationship are not directly linear – interaction effect, variable transformation (logged variables) Apply knowledge in STATA!

ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Introductions! -Name, -Department -year as PhD student, -where you are from (country), -how much statisitica have you had? -What is your research topic?

Linear Regression: brief history
A bit Sir Francis Galton – interested in heredity of plants, ’regression toward mediocraty’, meaning in his time the median (now known more as regression toward the mean..) Emphais on ’on average’ what can we expect? was not a mathmatician however.. Karl Pearson (Galton’s biographer) took Galton’s work and developed several statistical measures Together with previous ’least squares’ method (Gauss 1812), regression analysis was born

Simple statistical methods: cross tabulations & correlations
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Simple statistical methods: cross tabulations & correlations Used widely, especially in survey research, probably many of you are familiar with this… At least one Categorical variable – nominal or ordinal If we want to know how two variables related in terms of: strength, direction & effect, we can use various tests Nominal level – only strength (Cramer’s V) Ordinal level – strength and direction (tau-b & tau-c, gamma) Interval (ratio) level – strength, direction (and effect) (Pearson’s r, regression)

WHY DO WE USE REGRESSION?
To test hypotheses about causal relationships for a continuous/ordinal outcome To make predictions The preferred method when your statistical model has more than two explanitory variables and you want to elaborate a causal relationship However, always remember that correlation is not causation! We test hypotheses about causal relationships but the regression doesn’t express causal direction (why theory is important!)

Key advantages of linear regression
Simplicity: a linear relationship is the most simple, non-trivial relationship. Plus, most people can even do the math by hand (as opposed to other estimation techniques..) Flexability: even if the relationship between X and Y is not really linear, the variables can be transformed (more later) Interprebility: we get strength, direction & effect in a simple easy to understand package

Some essential terminology
Regression: the mean of the outcome variable (Y) as a function of one or more independent variables (X) 𝜇 𝑌|𝑋 Regression Model: explaining Y’s in the ’real world’ is very complicated. A model is our APPROXIMATION (simplification) of our relationship Simple (bivariate) regression model 𝑌= 𝛽 0 +𝛽 1 X Y: the dependent Variable X: the independent variable 𝛽 0 : the intercept or ’constant’ (in other words??), also notated as 𝛼 (alpha) 𝛽 1 : the slope 𝛽’s are called ’coefficients’

More terminology Dependent variable (Y): aka explained variable, response variable Ind. Variable (X): aka. Explanitory variable, control variable Two types of models broadly speaking: 1. Deterministic model: Y=a+bx (the equation of a straight line) 2. Probabilistic model (e.g. what we are most interested in..): Y=a+bx+e –

A deterministic model: visual
Scatter diagram of expenditure for my reading: five people, each with its beer consumption and the associated costs.

A deterministic model: simple ex.
Calculation of the slope: Calculation of intercept: α = 50 – (2  15) = 20, or α = 80 – (4  15) = 20 The equation for the relationship: Y = α + βX = X Person # beers (X) Total expenses (Y) Stefan 20 Martin 2 50 Thomas 4 80 Rasmus 5 95 Christian 6 110 Here we have an overview of the cost - what each has drunk and paid In a deterministic model, we can calculate the slope and intercept from two random points and by simple insertion into the equation. The set equation shows us all that the hall is 20 kr and the price of beer is 15

The probabilistic model: with ‘error’
The example before was constructed: In fact, so was scatter chart like this: There were errors in the measurement of the dependent variable. Error example arisen due to drunkenness in the bar: They give wrong back, or we came to pay too much or too little  a non-deterministic context  we must use a probabilistic model that takes into account the possibility of error. It is normal in samfvidenskaberne that there are errors in measurement or contexts are perfektet: it is never in samfvidenskaben. The goal is still to determine the height and slope of the line - but now we can not use the simple approach, the model is not determinnistisk. What do we do?

Where 𝑟𝑒𝑠 𝑖 is normally written as 𝑒 𝑖
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Even more terminology and, most often, we’re dealing with ’probabilistic’ models: Fitted values (’Y hat’): for any observation ’i’ our model gives an expected mean: 𝒀 = 𝒇𝒊𝒕 𝒊 = 𝜷 𝒐 + 𝜷 𝟏 𝑿 Residual: is the error (how much our model is ’off’) for obs ’i’ 𝒓𝒆𝒔 𝒊 =𝒀 𝒊 − 𝒇𝒊𝒕 𝒊 = 𝒀 𝒊 − 𝒀 Where 𝑟𝑒𝑠 𝑖 is normally written as 𝑒 𝑖 Least Squares: our method to find the estimates that MINIMIZE the SUM of SQUARE RESIDUALS (SSE) 𝑖=1 𝑛 ( 𝑦 𝑖 − 𝛽 𝑜 + 𝛽 1 𝑥 1 ) 2 = 𝑖=1 𝑛 ( 𝑦 𝑖 − 𝑦 ) 2 = 𝑖=1 𝑛 𝑒 𝑖 ²

respondent’s income(Y)
OLS- Regression X(age) Y(income) Pelle 20 21 Lisa 19 22,4 Kalle 54 47,3 Ester 42 17 Ernst 39 35 Stian 67 23,8 Lise 40 39,3 Each dot represents a respondent’s income(Y) at age(X) Relationship between income and age

OLS- Regression Relationship between income and age – shows strength, direction & effect, but causality? Regression gives a good summary statistics – strength, direction and effect. However, causality is not solved – that is a matter of theory

Relationship between income and age
OLS- Regression Relationship between income and age Diff between observed values and line is the error term = everything else that explains variation in Y but is not included in our model

How to estimate the coefficients?
We use the ’least squares method’ of course! To calculate the slope coefficient (𝜷) of X: 𝜷 𝟏 = 𝒊=𝟏 𝒏 ( 𝒙 𝒊 − 𝑿 )( 𝒚 𝒊 − 𝒀 ) 𝒊=𝟏 𝒏 ( 𝒙 𝒊 − 𝑿 ) 𝟐 the slope coefficient is the covariance between X and Y over the variance of X, or the rate of change of Y relative to change in X And to calculate the constant: 𝜷 𝟎 = 𝒀 − 𝜷 𝟏 𝑿 Simply speaking, the constant is just the mean of Y – the mean of X times 𝛽

In class excercise – Calculation of beta and alpha value in OLS by hand!
X Y Pelle 2 4 Lisa 1 5 Kalle Ester Ernst 3

Calculation of b-value in OLS
X Y X-X Y-Y (X-X)2 (X-X)*(Y-Y) Pelle 2 4 3 -1 1 Lisa 5 -2 -4 Kalle Ester Ernst Summa: 15 10 -9

A cool property of least squares estimation is that the regression line will always cross the mean of X and mean of Y Now for every value of X, we have an expected value of Y

𝜷 compared to Pearson’s r
The effect in a linear regression = 𝜷 Correlation– Pearson’s r Same numerator, but takes the variance of Y also in the denomonater into account Q: When will these two be equal?

Interpretation of Pearson’s r
Source: wikipedia

Correlation and regression, a comparison
Pearson’s r is standardized and varies between -1 (perfect neg. relationship) and 1 (perfect pos. relationship) 0 = no relationship, n is not taken into account 𝒓 𝒙𝒚 𝑺 𝒀 𝑺 𝑿 The regression coefficient (𝜷) has no given minimum or maximum values and the interpretation of the coefficient depends on the range of the scale Unlike the correlation r, the regression is used to predict values of one variable, given values of another variable

Objectives and goals of linear regression
We want to know the probability distribution of Y as a function of X (or several X’s) Y is a strait line (e.g. linear) function of X, plus some random noise (error term) Goal is to find the ’best line’ that fits explains variation of Y with X Important! The marginal effect of X on Y is assumed to be CONSTANT across all values of X. What does this mean??

Applied bivariate example
Data: QoG Basic, two variables from World Value Survey Dependent variable (Y): Life happiness (1-4, lower=better) Independent variable (X): State of health (1-5, lower=better) Units of analysis: countries (aggregated from survey data ), 20 randomly selected Our Model: Y(Happiness) = α + β1(health) + e H1: the healthier a country feels, the happier it is on average Let’s Estimete of α and β1 based on our data! Open file in STATA from GUL: health_happy ex.dta ***To do what I’ve done in the slides, see the do.file

Some basic statistics Summary stats
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Summary stats 2. Pairwise correlations (Pearson’s r)

3. Scatterplot w/line – in STATA: twoway (scatter happiness health) (lfit happiness health)
We start with a scatter plot with dep-var on Y axis and indep-var on X axis

Now for the regression Observations F-test of model sig. R sq
Mean sq. error Coefficient (b) Constant (a) standard error t-test (of sig.) % interval of confidence

Now for the regression

Some interpretation What
is the predicted mean of happines for a country with a mean in health of 2.3? We include a regression line

Interpretation Y=.243 + .778(2.3) = 2.02
What is the predicted mean of happines for a country with a mean in health of 2.3? Answer: We include a regression line

Ok, now what?? Ok great, but the calculation of beta and alpha are just simple math… Now, we want to see how much we can INFER from this relationship – as we do not have ALL the observations (e.g. a ’universe’) with perfect data, we are only making an inference A key to doing this is to evaluate how ”off” our model predictions are relative to actual observations in Y We can do this both for our model on whole and for individual coefficients (both betas and alpha). We’ll start with calculating the SUM of SQUARES Two questions: how ’sure’ are we of our estimates, e.g. significance, or probability that the relationship we see is not just ’white noise’ Is this (OLS) actually the most valid estimation method?

Assumptions of OLS (more on this next week!)
OLS is fantastic if our data meets several assumptions, and before we make any inferences, we should always check: In order to make inference: The linear model is suitable The conditional standard deviation is the same for all levels of X (homoscedasticity) Error terms are normally distributed for all levels of X The sample is selected randomly There are no severe outliers There is no autocorrelation No multicollinearity Our sample is representative of the population (all estimation)

Regression Inference In order to test several of our assumptions, we need to observe the residuals in our estimation These allow us to both check OLS assumptions AND provide significance testing Plotting the residuals against the explanatory (X) variable is helpful in checking these conditions because a residual plot magnifies patterns. This you should ALWAYS look at

Least squares: Sum of the squared error term – getting our error terms
A measure of how far the line is from the observations, is the sum of all errors: The smaller it is, the closer is the line to the observations (and thus, the better our model.) In order to avoid positive and negative errors to cancel out in the calculation, we square them: The sum of all for observation i: The error term The Residual sum of squares (RSS)

Residual Sum of squares – a simple visual
The standard errors tells us something about the average deviation between our observed values and our predicted values.

Back to our exercise: The typical deviation around the bar (ie the conditional standard deviation) mean square We also get this box out, showing us some of the other sizes, we have talked about: SSE - the sum of the squared error term TSS - the sum of the deviations around the mean y - we return to Sigma 2 - variance of the deviations on the line - the standard error, we return to - says something about the variety of items around the bar and is thus a measure of how well the line, so to speak fits the observations. F test we will also return to.

Getting our standard errors for beta
The standard error (s.e.) tells us something about the average deviation between our observed values and our predicted values. The s.e. for a slope coefficient is then calculated as the square root of the RSS divided by the number of observations - the # of parameters: = Where RSS= Residual Sum of Square, aka Sum of Squared Errors n=number of cases K=number of parameters (in bivariate regression - intercept and b-coefficient = 2

Getting our standard errors for b
The precision of b depends (among others) of the variation around the slope – i.e. how large the spread is around the line? This spread, we have assumed constant for all levels of X, but how is it calculated? See. Earlier: The sum of squared deviations of the line is given by: As we just saw, the typical deviation around the bar (ie the conditional standard deviation) is then given by: As usual, we start by determining the standard error, as we both need to hypothesis testing and confidence intervals. We saw earlier that the deviations from the line is given by the SSE. When we then transform it back to the original scale and adjusts for the number of observations, we get the typical deviation around the bar - also called the conditional standard deviation c. Line. Conditional because we calculate the deviations from the Y-hat - thus we have conditional on x. Degrees of freedom is n-2-2 because we have estimated two parameters to estimate y-hat, namely a and b. That is a kind of average of the deviations around the bar - ie says something about how great variation that is on the line. Let's see the graphic.

Standard errors for b The standard error is then defined as the conditional standard error divided by the variation in X Factors affecting the standard error of Beta: The spread on the line σ - the smaller σ, the smaller the standard error Sample size: n - the larger n, the smaller the standard error The variation of X: - the greater the variation, the smaller the standard error Could doing well also emitted - but we have shown had enough math for today. 1 and 2 provide virtually themselves - we have seen before: Less variation in the population make our estimates more precise and it makes larger samples too. But three is a little new: So it is a good thing to have large spread of its independent variable, when one makes a regression - so you can get greater precision in its estimate.

Standard errors for b: the ‘ideal’
First, look here on marginal distribution: That distribution of Y for all x values: quite nicely spread out. But when we condition on x - ie takes the distribution for each x for themselves - then it is less if there is correlation between the two variables. For each x-value, there will be some variation around the bar, but when there is a connection, it will be less than the overall variation in y. And the way we see it, then, is to take udgpkt in the deviation around the bar as expressed in the formula for SSE. Thus we have a measure of the variation around the bar - and thus the uncertainty in our model of consistency of data - and that is what we need to make inference. And note that the variation is the same for all levels of x - we have provided. Coming back to how we investigate it, and what it means in general. Facing otherwise back into shape tomorrow.

Back to our example (see excel sheet for ‘by hand calculations..)
Standard error of a Standard error of b

Standard errors for b The standard errors can then be used for hypothesis testing. By dividing our slope- coefficients by their s.e.’s we receive their t-values. The t-value can then be used as a measure for statistical significance and allows us to calculate a p-value (what is this??) Old school: one can consult a t-table where the degrees of freedom (the number of observations - 1) and your level of security (p<.10, .05 or .01 for ex.) decides whether your coefficient is significantly different from zero or not (t-table – can be found as appendices in books in statistics, like Woodbridge) New school: rely on statistical programs (like STATA, SPSS) H0: β1= 0 H1: β1≠ 0

Hypothesis testing & confidence interval for β
Hypothesis test of independence between the variables (Ho: β = 0): Calculation of t-value: 95 pct. Confidence intervals: Hypothesis testing for independence between the variables - of course it corresponds to the slope is 0 - made usual. Standard using t-test also when n is greater than 30. It is the test all uses, and the one that is built into the programs. Then you do not have to worry about choosing z or t. Remember that t approaching to z when n is greater than the 30th KI sets quite as usual and with the same interpretation.

confidence intervals for β
H0 - b = 0 ± 1,96 H1 – b ≠ 0 ± 1,96 on a 95% confidence level 90% confidence interval. t=1.645 95% confidence interval. t=1.96 99% confidence interval. t=2,576 Forming a 95% confidence interval for a single slope-coefficient: bx ± t(SEbx)

Back to our example.. Observations F-test of model sig. R sq
Mean sq. error standard error t-test (of sig.) p-value from t-test % upper/lower limit

Taking a closer look ’under the hood of the car’… open excel file: Regression - happines_vs_health

Basic OLS model diagnostics: 1. R² 2. F-Test 3. MSE

1. R2 : EXPLAINED VARIANCE A.k.a. “coefficient of determination”
R2 ranges from 0 and1 (0 ≤ r2 ≤ 1) R2 defined how close to the estimated regression line the observed values (the dots) R2 is a direct measure of linearity but is interpreted as explained variance. When R2 =1, we’ve explained all variation in Y, when R2 =0, we’ve explained nothing… good way to compare models! In many (social science) research models building on survey data (individual level), R2 is often a low value (rarely exceeds .40) Is calculated using three sum of squares formulas

Calculating R2 - FIRST COMPONENT
. Total sum of squares (TSS) - as we’ve use in other equations, this is the sum value on the dependent variable for each observation minus the mean value of the dependent variable. e.g., This is the total variation in the dependent variable, 𝜎 2

Calculating R2 - SECOND COMPONENT
Explained sum of squares (ESS) – the sum of the predicted values of the dependent variable for each observation minus the mean value of the dependent variable, squared. If our regression does not explain any variation in the dependent variable, ESS = 0. Our best prediction is the mean value of Y and if our model has any explanatory power, ESS > 0 and the model adds something beyond the mean to our understanding of the outcome (Y). This is also called ‘regression sum of squares (RSS)’ (confusing, right??)

Calculating R2 - THIRD COMPONENT
Residual sum of squares (RSS) – which we covered a few slides ago, e.g. each observation’s value on the dependent variable minus the predicted value. This is the variation our model cannot explain and is therefore labeled as the error term (or residual). This is also called the error sum of squares (ESS) (huh, wtf??)

EXPLAINED VARIANCE As noted, R2 is defined as: The total variation in Y – TSS – can be divided into two parts : The closer ESS is to TSS, or the lower RSS is relative to TSS – the higher the R2 value Therefore, R2 is commonly interpreted as the part of the variation in Y explained by X Note! R2 will be lower if the relationship between our variables have a non- linear relationship!

ESS=explained (model)
sum of squares RSS=residual sum of squared TSS=total sum of squares

Amount of explained variance in happiness explained by health
R2 = 1-RSS/TSS = 1- (0.805/1.476) = 0.45

A visual of how R² works: R² = 1 - (rss/tss)

B values vs. R2 values – important distinction!
R2 = 0.10 b = 4.33 R b ≈ R2 ≈ 0.90 R2 doesn’t say ANYTHING about the effect size

2. Testing model significance: F-test
If our null is of the form, Ho : β1 = β2 = = βk = 0, then we can write the test statistic in the following way: 𝑓0= ( 𝑅𝑆𝑆 1 − 𝑅𝑆𝑆 2 )/( 𝑃 2 − 𝑃 1 ) 𝑅𝑆𝑆 2 /(n − ( 𝑃 2 )) This compares whether any/all betas we put in a model explained variation significantly better than an empty model with just a constant It is basically the explained variance over the residual variance Degrees of freedom - n is the number of observations, 𝑃 2 is the number of independent variables total in our ‘restricted’ model, while 𝑃 1 is just the more ’unrestricted’ model, e.g. in this case just a constant. Where 𝑃 2 > 𝑃 1 This can also be used to test ’nested models’ (more later…) This terminology may seem a bit strange at first. We are “restricting” the general model by imposing supposing that the null is true and removing variables from the model. Thus, the difference (SSRr −SSRur) is telling us how much bigger the residuals are in the model where the null hypothesis is true. If the residuals are a lot bigger in the restricted model, then F0 will also be big. When the residuals are bigger, we know that this means the fit of the regression is worse. Thus, F0 is big when the restriction makes the fit of the regression a lot worse which is exactly when we would question the null hypothesis. If these variables really had no predictive power, then removing them should not affect the residuals. We will discuss how big F0 needs to be to reject the null hypothesis a bit later.

2. Testing model significance: F-test
H0: β1= β2 =…. βk = 0 Ha: At least one β is different from 0 If p<0,05, we reject the null hypothesis in favor of Ha Note! A significant F value does not necessarily mean we have a good model. However, if we cannot reject H0, our model is indeed bad!

Mean Squared Error (MSE)
MSE tells us how ‘off’ the models is on average, in the unit of the DV. Some units less ‘intuative’ than others, when less so, compare MSE with the st. Dev. Of the DV Also useful to compare different models with same DV The MSE here tells us that our predictions on average are ‘off’ by 0.21

Adding additional variables: multiple regression

Last week Regression introduction Basics of OLS –
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Last week Regression introduction Basics of OLS – calcualtions of beta, alpha, error term, etc. bivariate analysis Basic model diagnositics: R2, F-tests, MSE Today: multivariate regression Assumptions of OLS, dection of violations and what to do about it..

Back to Sir Galton….

Multiple regression So far, we’ve just kept it simple with bivariate regression models: 𝑌= 𝛽 0 +𝛽 1 X+ 𝑒 With multiple regression, we’re of course adding more variables (’parameters’), to our model. In stats terms, we’re estimating a more ’constrained’ or ’restricted’ model: 𝑌= 𝛽 0 + 𝛽 1 X+ 𝜷 𝟐 𝐙+ 𝑒 We’re thus able to account for a greater number of explanations as to why Y varies. Additional variables can be included for a number of reasons: controls, additional theoretical, interactions (later)

How now to interpret our coefficients?
𝛽 𝑛 = the change in Y for a one unit change in xn, holding all other variables constant, (or ‘all things being equal’, or ‘ceteris paribus’). In other words, the average marginal effect across all values of additional X’s in the model 𝛼 (intercept) – is the estimated value of Y when all other X’s are held at ‘0’. This may or may not be relaistic

G: The variation in the dependent variable NOT explained by the independent variables – this is the variation that could be explained by additional independent variables (RSS) Circle Y: The total variation in the dependent variable (TSS) A: The unique covariance between the independent variable x1 and y B: The unique covariance between the indepdent variable x2 and y C: Shared covariance between all three variables D: Covariance between the two independent variables not including the dependent variable Circle x2: The total variation in the second independent variable Variation in x1 (E) and x2 (F) respectively that is not associated with the other variables Circle x1: The total variation in the first independent variable

B coefficients in multiple regression, cont.
Regression for y (dependent) och x2(independent) Area C and B are predicted by the equation: Area A and G are shown in w (error), which equals: Areas A and G are secured in y through w. Now, we can calculate the unique effect of x1 on y under control for x2. To secure that the b coefficient for x1 corresponds to area A, we must remove that variation that has to do with x2 in the dependent variable, that is area B and C. We do this by a regression between the dependent variable and x2. Y-hat is the variation in y explained by x2, that is areas B and C. The only thing remaining in the equation is is w, that is the unexplained variance in y and this needs to be included in the calculation of the b coefficient of x1.

Calculation of the b coefficients in multiple regression
y= B1 + Bx2 +Bx3 +e "B1" = intercept Thankfully, programs like STATA do this for us…

Starting simple: dummy variables in regression
If an independent variable is a nominal, we can still use them by creating dummy variables (if >2 categories) A dummy variable is a dichotomous variable coded 0 and 1 (based on an original nominal or ordinal variable) The number of dummy variables needed depends on the number of categories on the original variable Number of categories on the original variable minus 1 = number of dummy variables. Ex. Party affiliation: Alliansen, R-G, SD – we would include a dummy for 2 groups and these betas are compared with the third (omitted) group We can also do this for ordinal IV’s, like low, middle and high f/e.

In any regression, the intercept will equal the mean on the dependent variable when X’s =0, thus for a dummy variable this =Y for the reference category (RC). The coefficients shows each category’s difference from the mean relative to the RC If we add other independent variables in our model, the interpretations of the intercept is when ALL independent variables are 0. Still, the interpretation of the coefficients for the dummy variables should be in relation to the reference category but under control for the additional variable we entered into the model.

Example: support for EU integration, EES data (on GUL)
Let’s say we’re interested in explaining why support for further EU integration varies at the individual level in Sweden. DV: Some say European unification should be pushed further. Others say it already has gone too far. What is your opinion? Please indicate your views using a scale from 0 to 10, where '0‘ means unification "has already gone too far" and '10' means it "should be pushed further". 3 IV’s: gender (0=men, 1=female), education (1=some post-secondary+, 0 if otherwise) and European identity (attachment, 0-3, greater, 0=very unattached, 3=very attachment) 𝐸𝑈 𝑆𝑢𝑝𝑝𝑜𝑟𝑡= 𝛽 0 + 𝛽 1 𝒇𝒆𝒎𝒂𝒍𝒆 + 𝛽 2 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 + 𝛽 3 (𝑬𝒖𝒓𝒐 𝒊𝒅𝒆𝒏𝒕𝒊𝒕𝒚)+e

Summary stats DV ranges from 0-10 2 binary IV’s 1 ordinal IV

intercept: ? the predicted level of the DV when all variables = 0 (men, w/out college, who are strongly detached from Europe) 2. female: the effect of gender is significant. Holding constant education and European identity, females support further EU integration by -0.4 on average. 3. Education: the effect is also signfificant. Having some post-secondary education increases support for EU integration by 0.37 holding gender and and European identity constant 4. European attachment: is signficant: Holding constant education and gender, a one unit increase in attachment results in an increase in suppport for the DV by 1.05 on average.

A visual with gender and identity
Effects ofboth variables holdingthe other constant, as well as education.

Some predictions from our model
What is the predicted level of support for further EU integration for a: male with some university and a strong European identity (3) ? = (0) (1) (3) = 5.73 2. Female with no university and a very weak European attachment (0) ? = (1) (0) (0) = 1.81

Comparing marginal effects
Significance values - not always interesting ...most everything tends to become significant with many observations, like in large survey data… Another great feature of OLS is that we can compare both marginal and total effects of all B’s when you are about to publish your results you often want to say which variables have the greatest impact in this model? Here we can show both the marginal effects (showed in the regression output). These effects/b-values only show the change in Y caused by on unit change in X, AND, the total effects (min to max effect, or the effects within a certain range) one has to consider the scale. Question: what is the marginal and total effect of our 3 variables?

Answer.. For binary variables, marginal and total are the same
For ordinal/continuous variables, we can do a few things to check this: ’normalize’ (re-scale) the variable to 0/1 (see do file for this) Compare standardized coefficients (just add command ’beta’) Alternative – use ’margins’ command (more later..)

For our model…. variable Marginal Effect Total (max-min) effect female
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) For our model…. variable Marginal Effect Total (max-min) effect female -0.4 education 0.37 Euro attach 1.05 3.15

Direct comparison: Standardized coefficients
Standardized coefficients can be used to make direct comparison of effects of IV’s When standardized coefficients are used (beta values), the scale unit of all variables are deviations from the mean – number of standard deviations Thus, we gain comparison but loose the intuitive feeling of our interpretation of the results, but we can always report both ‘regular’ betas and standardized. Standardize x1 – each observation minus the mean of the variable through the standard deviation of the variable

STANDARDIZED COEFFICIENTS (BETAS)
The standardization of b: standardized scores are also known as z-scores, so often they are labeled with a ‘z’ In STATA: gen zy=(y - r(mean))/r(sd) gen zx=(x - r(mean))/r(sd) Beta=b*zy/zx

Another way of reporting comparative effects… (Bauhr and Charron 2017)

Ordinary least squares regression Day 2 nicholas charron associate prof. Dept. of political science

OLS is ’BLUE’ What is this? It is the Best Linear Unbiased Estimator
Aka ’Gauss–Markov theorem’ “states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators.

Assumptions of OLS OLS is fantastic if our data meets several assumptions, and before we make any inferences, we should always check: In order to make inference: Correct model specification - the linear model is suitable No severe multicollinearity The conditional standard deviation is the same for all levels of X (homoscedasticity) Error terms are normally distributed for all levels of X The sample is selected randomly There are no severe outliers There is no autocorrelation

1) Model specification a) - causality in the relationships - not so much a problem for the statistical model but rather a theoretical problem. Better data and modelling - use panel data, experiments or theory! b) is the relationships betweenDV and IV LINEAR? if not - OLS regression will give biased results c) all theoretically relevant variables should be included. - if they are not this will lead to "omitted variable bias", - if an important variable is being left out in a model - this will influence the coefficients of the other variables in the model. remedy? Theory, previous literature. Motivate all variables. Some statistical tests/checks

Linear model is suitable
When 1 or more IV’s has a non-linear effect on the DV, thus a relationship exists, but cannot be properly detected in standard OLS This one is probably one of the easiest to detect Bivariate Scatterplot: If the scatter plot doesn’t show an approximately linear pattern, the fitted line may be almost useless. Ramsey RESET test (F-test) theory If X and Y do not fit a linear pattern, there are several measures you can take

Checking for this: health and happieness (in GUL)
Scatter looks ok, but let’s check more formally with the Ramsey RESET test: 3 steps: Run regression in STATA Run command linktest Run command ovtest The linktest estimates your DV with the residual and squared residual of your model as IVs. Ovtest, Ho: model is specified correctly A significant squared residual or F-stat implies that the model is incorrectly specified If sig., make adjustment and re-run regression & test

Example with health and happiness data
The 3 steps What do you see?

Non-linearity can be detected

Issues with non-linearity
Problems with curve-linear relationships - we will under- or overestimate the effect in the dependent variable for different values of the independent variable. However, this is a ’sexy’ problem to have at times.. OLS can be used for relationships that are not strictly linear in y and x by using non-linear functions of y and x 3 standard approaches depending on the data: 1. natural log of x, y or both (e.g. loggorithm) 2. quadratic forms of x or y 3. interactions of x variables Or adding more data/observations… the natural logarithm will downplay extreme values and make it more normally distributed.

Variable transformation: natural loggorithm
Log models are invariant to the scale of the variables since they are now measuring percent changes. Sometimes done to constrain extreme outliers, and downplay their effect in the model, make distribution more ‘compact’. Standard variables in social science that researchers tend to log: 1. Positive variables representing wealth (personal income, country GDP, etc.). 2. Other variables that take large values – population, geographic area size, etc Important to note- the rank order does not change from the original scale!

Transforming your variables
Using the natural logarithm (e.g. the inverse of the exponential function). Only for x>0. Ex. corruptio nexplained by country size (population) Population and corruption logged population and corruption In Stata: reg DV IV gen logIV= log(IV) reg DV logIV

Interpretation of transformations with logs
1. Logged DV and non-logged IV: ln(y) = β0 + β1x + u β1 is approximately the percentage change in y given an absolute change in x. a 1 step increase in the IV gives the coefficient*100 percent increase in the DV. (%Δy=100⋅β1) 2. Logged IV and non-logged DV: y = β0 + β1ln(x) + u β1 is approximately the absolute change in y for a percentage change in x. 1 percent increase in the IV gives the coefficient/100 increase in the DV in absolute terms. (Δy=(β1/100)%Δx) 3. Logged DV and IV: ln(y) = β0 + β1ln(x) + u β1 is the elasticity of y with respect to x (%Δy=β1%Δx) β1 is thus the percentage change in y for a percentage change in x NOTE: The interpretation is only applicable for log base e (natural log) transformations.

Rules for interpreation of Beta with logged transformed variables

Quadratic forms (e.g. squared)
Ex. Democracy versus corruption Explained later by an interaction with economic development Charron, N., & Lapuente, V. (2010). Does democracy produce quality of government?. European Journal of Political Research, 49(4),

Quadratic forms –capture diminishing or increasing returns
How to model this? Quite simple, add a squared term of the non-linear IV

Quadratic forms: interpretation
Analyses including quadratic terms can be viewed as a special case of interactions (more on Friday on this topic) Include both original variable and the squared term in your model: y = β0 + β1x + β2x2 + u For ‘u’ shaped curves, B1 should be negative, while B2 should be positive Including the squared term means that β1 can’t be interpreted alone as measuring the change in y for a unit change in x, we need to take into account β2 as well since: 𝑆𝑙𝑜𝑝𝑒= ∆𝑦 ∆𝑥 ≈ 𝛽 1 +2 𝛽 2 𝑥

In stata 2 approaches: Generate a new squared variable:
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) In stata 2 approaches: Generate a new squared variable: gen democracy2 = democracy*democracy 2. Tell STATA in the regression with the ’#’ sign: For continuous or ordinal variables we need to add the ’c.’ prior to the variable: Ex. reg corruption c.democracy c.democracy#c.democracy Comparing the results, we see..

Quadratic forms – getting concrete model predictions using the margins command
𝑆𝑙𝑜𝑝𝑒= ∆𝑦 ∆𝑥 ≈ 𝛽 1 +2 𝛽 2 𝑥 margins, at (fh_polity2 =(0 (1)10)) marginsplot

Otehr things to watch for for assumption 1
The sample is a simple random representative sample (SRS) from the population. Model has correct values Data is valid and accurately measures concepts No omitted variables (exogeneity)

No omitted IV’s - exogeneity
Error term has zero population mean (E(εi)=0). Error term is not correlated with X’s, ‘exogeneity’, E(εi|X1i,X2i,…, XNi,)=0, This assumption is also called ’exogeneity’. It basically means that X’s are not correlated with the error term in any systematic way. This is a result of ommitted variable bias Can be checked via checking the correlations and scatterplots with the residual and the IV’s – if a correlation/pattern exists, this can lead to bias (more later on this)

2. No severe multicollinearity
What is multicollinearity? ’perfect’ multicollinearity is when two variables X1 and X2 are correlated at 1 (or -1), but is also a problem when X1 and X2 are highly correlated, say above/below 0.6 or -0.6. Example: if we esitmate one’s shoe size with height, and include measures of height in cm and inches

Cont. Since an inch = 2.54 cm, we know that if someone is 63 inches then they are 160cm for example What happens? 𝑠ℎ𝑜𝑒 𝑠𝑖𝑧𝑒 𝑖 = 𝛽 0 + 𝛽 1 ℎ𝑒𝑖𝑔ℎ𝑡_𝑖𝑛𝑐ℎ𝑒𝑠+ 𝛽 2 ℎ𝑒𝑖𝑔ℎ𝑡_𝑐𝑚+ 𝑒 𝑖 What is the effect of 𝛽 1 on Y? The effect of inches on shoe size when holding constant cm ( 𝛽 2 ) – but inches don’t vary when holding cm constant! So 𝛽’s will be=0, undefined..

multicollinearity Other examples:
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) multicollinearity Other examples: Nominal/categorical variables: employment (1. private sector, 2. public sector, 3. not working), must exclude one category as a ’reference’ But these examples are mainly error by us.. What heppens if X1 and X2 are just highly correlated? OLS BLUE is not violated, estiamtes still unbiased, but they become less EFFICIENT (higher standard errors)

Detecting multicollinearity
You run a model where none of the X’s are sig., but the overall F-test is significant Look at a Pearson’s correlation table – if any variables are correlated at above (rule of thumb) 0.6 or -0.6, then this could be an issue A post regression, VIF (variance inflation factor) test. This tests if any/all X’s in the model are in linear combination with any other X 𝑉𝐼𝐹 𝑗 = 1 1− 𝑅² 𝑗 If there is no correlation between Xj and any other X’s, then Rj=0, and thus VIFj=1, which is the lowest value. You will get a VIF for each X and the model on whole. Any value above 10 (ruole of thumb) is considered a problem. In STATA, post regression: estat vif

What to do about multicollinearity??
If X1 and X2 are highly correlated, drop one (the least important) – -this points to a possible trade-off between BIAS and EFFICIENCY 2. Increase N, multicollienarity has a larger impact on smaller sample sizes 3. Combine the variables into an index. This can be done via principle component or factor analysis for example, 4. Do nothing and just be clear about the problem

Short exercise Open dataset on GUL: practicedata.dta
Explain share of women in parliament (DV) as a function of corruption, population, and spending on primary education Check scatterplots, correlations, and do a multivariate regression Interpret all coefficients, check model statistics Test/examine whether the linear relationship is appropriate for all IV’s Make the proper transformation if necessary Run regression with transformed variable. Compare results in terms of Beta, p-values and R2 with the non-transformed regression output – what do you see? Check for multicollinearity Based on correlation tables & Run a VIF test – what do you see?

Assumptions of OLS OLS is fantastic if our data meets several assumptions, and before we make any inferences, we should always check: In order to make inference: Correct model specification - the linear model is suitable No severe multicollinearity Error terms are normally distributed for all levels of X The conditional standard deviation is the same for all levels of X (homoscedasticity) There are no severe outliers There is no autocorrelation The sample is selected randomly/ is representative

3. No extreme outliers Outliers?
Outliers if undetected, can have a severe impact on your bet estimates. You must check for these, especially where Y’s or X’s are continuous. Three ways we should think about outlying observations: Leverage outlier – an observation far from the mean of Y or X (for ex., 2 or 3+ st. deviations from mean) Reisdual outlier – an observation that ’goes against our prediction’ (e.g. has a lot of error) Influence: if we take this observation out, do the results change signfincantly? A leverage outlier is not necessarily a problem (if it is in line with our predictions). However, a leverage outlier makes things very misleading if it is also a big residual outlier, meaning it will be an influence observation.

use http://www.ats.ucla.edu/stat/stata/dae/crime, clear
Run a regression explaining crime in a state (# of violent crimes/100,000 people): 3 IV’s %metro area, poverty rate % %of single parent households Interpretation?

Detection of influence of obs: lvr2plot
A simple leverage residual plot can give us a clear visual We do this after a regression in STATA Y-axis = leverage x- axis_ residual Any near top right corner can especially -bias results!

outliers via ‘studentized’ residuals
We can check with normal residuals, but they are dependent on their scale, which make it hard to compare different models. As our model is an estimate of the ‘true’ relationship, so are the errors The issue is that although the variance of the error term is equal (assumption, homoskedastic), the estimates are often not equal for all levels of X. The variance might decrease as X increases f/e. Studentized residuals are adjusted. They are re-calculated residuals whereby the regression line is re-calculated by taking out each observation one at a time. We then compare the first estimates (all obs) with the estimates removing each obs, for each obs. For obs where the line moves a lot, the obs has a larger studentized residual..

Normal (raw), vs. studentized residuals
Normal Studentized Z and Stud can be related To the Z-score where 95% of The resid. fall within ± 2 std.dev

Looking at obs on extremes of distribution
Command ’hilo’ Specify with ’show(#) how many you want to see (default =10) Any obs -2 or +2 (esp. -3 or +3) should be further looked at

Influence of each observation: Cook’s D
In STATA, after a regression: predict d, cooksd If Cook’s d = 0 for an obs, than the obs has no influence, the higher the d value, the greater the influence. It is calcualted via an F-test, testing whether Xi=Xi(minus obs i) The ’rule of thumb’ for observations with possibly troublesome influence is d > 4/n To avoid adding observations with missing data, specify: if d>4/51

Compare outlier’s stats on variables with sample

Measuring influence for each IV: DFBETA
dfbeta is a stastistic of influence of obs for each IV in the model It tells us how many standard errors the coefficient WOULD CHANGE if we removed the obs A new variable is generate for each IV Ex. DC increases Beta of %single parent by 3.13*se (or 3.13*15.5) compared to reg without DC Dependent on scale of Y and X! Caution for any dfbeta number above: 2 𝑛 = = 0.28

What to do about outliers?
Again, depends on what type of ’outlier’ an observation is! no ”right” answer here, just be aware of if they exsist and how much effect they have on the estimates, BUT: Check for data error! 2. Create an obs. dummy for the outliers ’gen outlier = 1 if ccode== x ’replace outlier=0 if outlier==. 3. *Take out the obs & re-run model & see if any differences, run ’lfit’ and compare R² stats.. Report any differences… 3. New functional form (log, normalize variables) 4. Do nothing, leave them in and footnote 5. Use weighted observations

Robust regression (rreg)
Robust regression can be used in any situation in which you would use OLS Can also be helpful in dealing with outliers After we decide that have no compelling reason to exclude them from the analysis. In normal OLS, all observations are weighted equally. The idea of robust regression is to weigh the observations differently based on how “well behaved” they are. Basically, it is a form of weighted and reweighted OLS (WLS) Robust regression can be used in any situation in which you would use least squares regression. When fitting a least squares regression, we might find some outliers or high leverage data points. We have decided that these data points are not data entry errors, neither they are from a different population than most of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a form of weighted and reweighted least squares regression.

Robust regression (rreg)
Stata's rreg command implements a version of robust regression. It runs the OLS regression, gets the Cook's D for each observation. Obs. with small residuals gets higher weight (1>), any obs. with Cook's distance greater than 1 (sever influence) are dropped. Using the Stata defaults, robust regression is about 95% as efficient as OLS (Hamilton, 1991). In short, the most influential points are dropped, and then cases with large absolute residuals are down-weighted. Looking at our example data on women in parliament… Stata's rreg command implements a version of robust regression. It first runs the OLS regression, gets the Cook's D for each observation, and then drops any observation with Cook's distance greater than 1. Then iteration process begins in which weights are calculated based on absolute residuals. The iterating stops when the maximum change between the weights from one iteration to the next is below tolerance. Two types of weights are used. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With biweighting, all cases with a non-zero residual get down-weighted at least a little. The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and biweights can have difficulties converging or may yield multiple solutions. Using the Huber weights first helps to minimize problems with the biweights. You can see the iteration history of both types of weights at the top of the robust regression output. Using the Stata defaults, robust regression is about 95% as efficient as OLS (Hamilton, 1991). In short, the most influential points are dropped, and then cases with large absolute residuals are down-weighted.

Short exercise Open ’practicedata’ dataset again, and we’ll do the same regression as in example 1 Again, examine a scatterplots between the DV and each IV. Run regression Search for outliers: Visual residual leverage plot: lvr2plot, mlabel( cname ) Cook’s d Dfbeta (you can look at all 3 , or dfbeta for each IV one at a time if easier): e.g. list cname _dfbeta_1 if _dfbeta_1 > 2 𝑛 (**dont forget to calculate the 𝟐 𝒏 ) What do you see? Do any observations break our 4/n Cook’s, or 2 𝑛 dbeta rule? Which countries are they? What would you do about this? Do your adjustments change your regression results?

Assumptions that are error-term violations: -normality -homoskadasticity -autocorrelation -independence of observations

4. mean of error=0, are normally distributed for all levels of X
Key issues: There is a probability distribution of Y for each level of X. A ‘hard’ assumption is that this distribution is normal (bell shaped) Given that µy is the mean value of Y, the standard form of the model is where  is a random variable with a normal distribution with mean 0 and standard deviation .

Normality distribution of error terms
While violations against any of these three former assumptions (1 Model specification – linearity 2, No extreme observations 3, (No strong multicollinearity)) could potentially result in bias in the estimated coefficients. However, violations against the assumptions concerning the residuals (4) absence of autocorrelation 5) normally distributed residuals and 6) homoskadasticity) may not necessarily not affect the estimated coefficients but it may affect and reduce your ability to perform inference and hypothesis testing. But they can, so it’s always good to check! This since the residuals distribution is the foundation to perform significance tests for the coefficients - it's the distribution that underlies the calculation of t- and P-values. This is especially true for smaller data samples. a prerequisite in small samples is that the residuals are normally distributed.

Analysis of Residual Always important to do – for several assumptions
To examine whether the regression model is appropriate for the data being analyzed, we can check the residual plots. Later we can do more ‘advanced’ tests to see if we’ve violated some assumptions Residual plots: 1. histogram of the residuals 2. Scatterplot residuals against the fitted values (y-hat). 3. Scatterplot residuals against the independent variables (x). 4. Scatterplot residuals over time if the data are chronological (more later in time series analysis).

Plotting the residuals
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Use the academic performance data, and regress academic performance on the %of ESL learniners, %of students with free meals, and average education of parents use regress api00 meals ell emer Then predict the residuals: predict r, resid Plot the desnity of the residuals against a normal bell curve – how close are they matched? kdensity r, normal A qnorm plot (plots the quantiles of a variable against the quantiles of a normal distribution) qnorm r

Density plot qnorm plot
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Density plot qnorm plot The further away our residuals are from expected line (in red), the bigger the potential problems…

More ’formal’ tests Shapiro-Wilk W test for normality. Tests proximity of our residual distribution compared wit the normal bell curve. Ho: residuals are normally distributed swilk r

5. Homoskadasticity Homoskedasticity: The error has a constant variance around our regression line The opposite of this is: Heteroskedasticity: The variance of the error depends on the values of Xs.

What does hetereoskadasticity look like?
Plotting the residuals against X, we should not variance around a fitted line

consequences If you find heteroscedasticity, like multicollinearity, this will effect the EFFICIENCY of the model. The calculation of standard errors, and thus P values will be uncertain, since differences in residuals dispersions is depending on the level of the variables. The effect of X on Y might be very significant at some levels of X, and less so at others, which makes a total significance calculation impossible. Heteroscedasticity does not necessarily result in biased parameter estimates but OLS is no longer BLUE. Risk for Type I or Type II error will increase (what are these??) E.g. ‘false positive’ & ‘false negative’

How to check for Heteroskadasticity
A visual plot of the residuals over the fitted values of Y: rvfplot, yline(0) Here we do not want to see any pattern – just a random insignificant scattering of dots.. Use the ’academic performance data’, and regress academic performance (api100) on the %of ESL learniners (ell), %of students with free meals (meals), and average education of parents (ave_ed)

rvfplot, yline(0) What do we observe?
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) What do we observe? Looks kind of random, but error term seems to narro as fitted values get higher..

More ’formal’ tests 2. Breusch-Pagan / Cook-Weisberg test -Regresses Sq. Errors on X’s *good at detecting linear hetereoskadascticity, but not for non-linear forms. Ho: no heteroskadascticity 3. Cameron & Trivedi's IM test -similar, but includes also sqaured X’s in regression **both are sensitive and will often be signficant even with only slight hetero…

If we find something, we might check individual IV’s and residual plots, and look at correlations of IV’s and error

What to do about this? You don’t always have to do anything, but if severe: Try transforming X’s (non-linear, logged,, un log) to make relationship more linear Remove variables that are suspect or insignifincat and re-run regression. Add more variables ”weighted least squares regression” WLS , where certain observations, (maybe those that deviate most?), are weighted less then others, thus affecting the standard errors. Use strictor alpha in significance cut-off of P-vales – 0.01 instead of 0.05 to reduce risk of Type I error

6) Autocorrelation Same unobservable forces might be influencing the dependent variable in successive time points *It is defined as the correlation is between two values of the same variable at times Xt and Xt-1. For example, factors are likely to predict defense spending/voting/economic development, etc. at 1971 are likely to also predict 1972 and therefore whatever error remains from our estimation of Y in 1971 will persist in 1972. Can lead to BIAS and/or INEFICIENT estimates with OLS

4) Autocorrelation The problem also occurs when the order of observation by geographical location. serial vs. spatial autocorrelation. The consequences are quite serious. Positive autocorrelation will tend to increase the variation of the sample distributions for the estimated coefficients - which then can indicate great variation across different models. In a simple model the result would, on the contrary, result in an underestimation of standard errors of the actual estimates, so we can risk that a given coefficient is significant while it is not. It same goes for R2, which may also be overestimated.

4) Autocorrelation Detection:
The Durbin Watson goes between 0 to 4, where 0 indicates high positive autocorrelation and 4 high negative autocorr. while 2 indicates absence of autocorrelation. In Stata: estat dwatson Solutions In order to correct for auto-correlation one has to use time-series regression but in OLS one could consider to lag the dep. Variable (Y1 – Yt-1) and thereby remove all non-independent information in a variable (only works in time series autocorrelation).

More on autocorrelation in Stefan’s time series moment!

7) Independence of Errors
This assumption states that an error from one observation is independent of the error from another observation. Actually, it is not the dependency by itself that matters, it if whether they are correlated that matters.. dependency in errors often happens in financial and economic time series data and in cross-country multilevel data (e.g. survey data from multiple countries) Multi level - affect coefficients & sig. TSCS - affects mainly sigs. A Hausman test can be used to estimate this.

What needs to be considered depends on your data!
Model specification – linearity -Always important No extreme observations -more important in small samples No strong multicollinearity No autocorrelation -more important in time-series or cross-section data Errors have zero mean with a normal distribution Errors have constant variance -more important in large samples - in small samples - outliers are more severe Observations shall be independent of each other -more important in time-series or multi-level cross-section data

Interaction terms Back to our discussion about the assumption of ’proper model specification’ Sometimes our X variable has a non-linear effect due to an interaction with another IV. When testing this, it is called a ‘conditional hypothesis’ A conditional hypothesis is simply one in which a relationship between two or more variables depends on the value of one or more other variables. Ex. An increase in X is associated with an increase in Y when condition Z is met, but not when condition Z is absent. Ex. The effect of education on income is stronger in men than in women In technical terms, we compare the following two models: Additative multiple regression model Y=α+β1𝑥+β2𝑧+µ Multiplicative multiple regression model Y=α+β1𝑥+β2𝑧+β3𝑥𝑧+µ

Types of interaction terms
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Types of interaction terms Our X variables can take different shapes depending on their measurment These are the combinations in order of complexity to interpret: Two dummy variables: ex. gender*unemployed One dummy, one continuous/ordinal variable: ex. gender*age Two continuous/ordinal variales: ex. age*income the first inteaction can also be modelled as 3 dummy variabels in relation to one reference category, in this case unemployed males: Y=α+β1(𝑓𝑒𝑚𝑎𝑙𝑒_𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑)+β2(𝑚𝑎𝑙𝑒_𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑)+β3(𝑓𝑒𝑚𝑎𝑙𝑒_𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑+µ

One dummy, one continuous/ordinal: visual example
In this example Z=0 and 1 Figure 1 graphically Fig. 1 A graphical illustration of an interaction model consistent with hypothesis H1. 2There is no requirement that the form of the interaction term in interaction models be the product of the constitutive terms X and Z as is the case here. However, we focus on these multiplicative interaction models because they are the most common in political science. Understanding Interactions 65 Downloaded from at GOTESBORGS UNIVERSITETSBIBLIOTEK on June 30, 2014 illustrates a multiplicative interaction model that is consistent with hypothesis H1. Note that Fig. 1 assumes that b0 and b2 are both positive. While this simple example involves a dichotomous modifying variable—condition Z modifies the effect of X on Y—this basic interaction model can be extended to take account of continuous modifying variables or arguments with even greater causal complexity. We use a dichotomous modifying variable in this example and elsewhere purely for ease of presentation. However, the points that we make do not depend on whether the modifying variable Z is dichotomous or continuous. Taken from Brambor et. al. (2005)

Interaction Interpretations
Multiplicative multiple regression model Y=α+β1𝑥+β2𝑧+𝜷𝟑𝒙𝒛+µ when condition Z (a dummy variable) is absent (e.g. =0) the equation above simplifies to: Y=α+β1𝑥+µ Where B1 is the effect of X for observations that take 0 for Z And when condition Z is present (e.g. =1, or greater) the effect of x on y becomes: Y=(α+β2)+ β1+β3 𝑥+µ Now we see that B1X cannot be interpreted indpendently of B3

4 important points Interaction models should be used whenever the hypothesis they want to test is conditional in nature Include All ”Constitutive Terms”. These are just the two variales that make up the interaction (e.g. x and z). Do Not Interpret Constitutive Terms as Unconditional Marginal Effects Calculate Substantively Meaningful Marginal Effects and Standard Errors Brambor, Thomas, William Roberts Clark, & Matt Golder "Understanding Interaction Models: Improving Empirical Analyses." Political Analysis 14:

Include All Constitutive Terms
No matter what form the interaction term takes, all constitutive terms should be included. Thus, X should be included when the interaction term is X2 and X, Z, J, XZ, XJ, and ZJ should be included when the interaction term is XZJ. b1 does not represent the average effect of X on Y; it only indicates the effect of X when Z is zero b2 does not represent the average effect of Z on Y; it only indicates the effect of Z when X is zero Excluding X or Z is equivalent to assuming that b1 and b2 is zero. Taken from Brambor et. al. (2005)

Include All Constitutive Terms
The constitutive terms (b2x and b3Z) captures the difference in the intercepts between the regression lines for the case in which condition Z is present and the case in which condition Z is absent - omitting Z amounts to constraining the two regression lines to meet on the Y axis. Taken from Brambor et. al. (2005)

Multicollinearity Just as we discussed with the quadradic term, the coefficients in interaction models no longer indicate the average effect of a variable as they do in an additive model. As a result, they are almost certain to change with the inclusion of an interaction term, and this should not be interpreted as a sign of multicollinearity. Even if there really is high multicollinearity and this leads to large standard errors on the model parameters, it is important to remember that these standard errors are never in any sense ‘‘too’’ large—they are always the ‘‘correct’’ standard errors. High multicollinearity simply means that there is not enough information in the data to estimate the model parameters accurately and the standard errors rightfully reflect this.

Multicollinearity ‘solutions’ have been posited: re-scaling the variables, ‘centering’ Centering the IV:s around their mean does not solve the problem (Aiken and West (1991)) Regardless of the complexity of the regression equation, centering has no effect at all on the coefficients of the highest-order terms, but may drastically change those of the lower- order terms in the equation. Centering unstandardized IVs usually does not affect anything of interest. Simple slopes will be the same in centered as in un-centered equations, their standard errors and t- tests will be the same, and interaction plots will look exactly the same, but with different values on the x-axis.

3. Do Not Interpret Constitutive Terms as Unconditional Marginal Effects
When we have an interaction, the effect of the independent variable X on the dependent variable Y depend on some third variable Z (and vice versa). The coefficient on X only captures the effect of X on Y when Z is zero. Similarly, the coefficient of Z only captures the effect of Z on Y when X is zero. It is, therefore, incorrect to say that a positive and significant coefficient on X(or Z) indicates that an increase in X (or Z) is expected to lead to an increase in Y. Also, whether X modified Z or vice versa cannot be determined by the model, only by the researcher and the theory behind it!

4. Calculate Substantively Meaningful Marginal Effects and Standard Errors
typical results tables will report only the marginal effect of X when the conditioning variable is zero, i.e., b1. Similarly, Stata tables report only the standard error for this particular effect. As a result, the only inference we can draw is whether X has a sig. effect on Y when Z = 0 Basically, we want to know WHERE and HOW MUCH Z conditions X’s effect on Y, and the significance level. Results tables are often quite uninformative in this respect. Even a ‘significant’ interaction coefficient might not be that interesting, while even an insignificant one can actually be significant at certain levels of Z (or X) This is where the Margins command in Stata is very helpful (help margins)

Example: back to explaining % women in parliament
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Example: back to explaining % women in parliament This time let’s try a few differnet vairables: IV’s – level of democracy (0-10) and the % of protestants in a country (0-100)

Example: back to explaining % women in parliament
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Example: back to explaining % women in parliament This time let’s try a few differnet vairables: IV’s – level of democracy (0-10) and the % of protestants in a country (0-100)– now w/ interaction What does this tell us generally speaking?

Using margins for interpreation
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Using margins for interpreation We can show this interaction a number of ways: 1. the margina effect (1 unit increase) of democracy over a range of % protestant margins, dydx(fh_polity2 ) at (lp_protmg80=( )) Where ’dydx’ means we want to see a marginal effect (ΔY from a 1 unit increase in X) The numbers ( ) after the % protestant variable are just the min, mean +2 s.d. and max values. I got these from just doing the ‘sum’ command To see a visual plot, just type marginsplot

Using margins for interpreation
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT) Using margins for interpreation 2. compare predicted levels of % women in parliament for 2 ’meaningful’ vlaues of democracy over a range of % protestant margins, at (lp_protmg80=( )fh_polity2 =(0 10) ) Note the ’modifying variable’ (e.g. on the x axis) goes 1st after ’at’. The numbers ( ) after the % protestant variable are just the min, mean +2 s.d. and max values. 0 and 10 for democracy are just the min and max values. I got these from just doing the ‘sum’ command This is also what you’d do if you had a binary variable in the interaction (e.g. instead of 0 and 10, just type 0 1)… To see a visual plot, just type marginsplot

We will do more in the next section with this command! Next time: models for limited dependent variables: logit and probit

Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3.

Similar presentations

Presentation on theme: "Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3.

Similar presentations

Presentation on theme: "Section Outline Day 1: overview of OLS regression (Wooldridge, chap. 1-4) Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7) Day 3."— Presentation transcript:

Similar presentations

About project

Feedback