SC968 Panel data methods for sociologists Lecture 3

Slides:



Advertisements
Similar presentations
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Advertisements

PANEL DATA 1. Dummy Variable Regression 2. LSDV Estimator
Economics 20 - Prof. Anderson1 Panel Data Methods y it = x it k x itk + u it.
Panel Data Models Prepared by Vera Tabakova, East Carolina University.
Properties of Least Squares Regression Coefficients
COMM 472: Quantitative Analysis of Financial Decisions
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
The Simple Regression Model
Methods of Economic Investigation Lecture 2
Objectives 10.1 Simple linear regression
Random Assignment Experiments
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Lecture 8 (Ch14) Advanced Panel Data Method
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Correlation and regression
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
Econ 140 Lecture 151 Multiple Regression Applications Lecture 15.
Lecture 4 Econ 488. Ordinary Least Squares (OLS) Objective of OLS  Minimize the sum of squared residuals: where Remember that OLS is not the only possible.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Part 1 Cross Sectional Data
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Chapter 10 Simple Regression.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Chapter 5 Heteroskedasticity. What is in this Chapter? How do we detect this problem What are the consequences of this problem? What are the solutions?
Chapter 4 Multiple Regression.
Econ 140 Lecture 181 Multiple Regression Applications III Lecture 18.
Clustered or Multilevel Data
Econ 140 Lecture 171 Multiple Regression Applications II &III Lecture 17.
One-Way Analysis of Covariance One-Way ANCOVA. ANCOVA Allows you to compare mean differences in 1 or more groups with 2+ levels (just like a regular ANOVA),
Chapter 15 Panel Data Analysis.
1 MF-852 Financial Econometrics Lecture 6 Linear Regression I Roy J. Epstein Fall 2003.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Ordinary Least Squares
3. Multiple Regression Analysis: Estimation -Although bivariate linear regressions are sometimes useful, they are often unrealistic -SLR.4, that all factors.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Inference for regression - Simple linear regression
Hypothesis Testing in Linear Regression Analysis
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Lecture 3-3 Summarizing r relationships among variables © 1.
Error Component Models Methods of Economic Investigation Lecture 8 1.
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Funded through the ESRC’s Researcher Development Initiative Prof. Herb MarshMs. Alison O’MaraDr. Lars-Erik Malmberg Department of Education, University.
Ordinary Least Squares Estimation: A Primer Projectseminar Migration and the Labour Market, Meeting May 24, 2012 The linear regression model 1. A brief.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Application 3: Estimating the Effect of Education on Earnings Methods of Economic Investigation Lecture 9 1.
Lecture 7: What is Regression Analysis? BUEC 333 Summer 2009 Simon Woodcock.
Scatterplots & Regression Week 3 Lecture MG461 Dr. Meredith Rolfe.
Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.
6. Simple Regression and OLS Estimation Chapter 6 will expand on concepts introduced in Chapter 5 to cover the following: 1) Estimating parameters using.
ANOVA, Regression and Multiple Regression March
Lecture 8: Ordinary Least Squares Estimation BUEC 333 Summer 2009 Simon Woodcock.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Lecture 6 Feb. 2, 2015 ANNOUNCEMENT: Lab session will go from 4:20-5:20 based on the poll. (The majority indicated that it would not be a problem to chance,
Financial Econometrics Lecture Notes 5
Chapter 15 Panel Data Models.
Esman M. Nyamongo Central Bank of Kenya
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
PANEL DATA 1. Dummy Variable Regression 2. LSDV Estimator
Chapter 15 Panel Data Analysis.
Ch. 13. Pooled Cross Sections Across Time: Simple Panel Data.
Tutorial 1: Misspecification
Ch. 13. Pooled Cross Sections Across Time: Simple Panel Data.
Presentation transcript:

SC968 Panel data methods for sociologists Lecture 3 Fixed and random effects models continued

Overview Review Between- and within-individual variation Types of variables: time-invariant, time-varying and trend Individual heterogeneity Within and between estimators The implementation of fixed and random effects models in STATA Statistical properties of fixed and random effects models Choosing between fixed and random effects: the Hausman test Estimating coefficients on time-invariant variables in FE Thinking about specification

Between- and within-individual variation If you have a sample with repeated observations on the same individuals, there are two sources of variance within the sample: The fact that individuals are systematically different from one another (between-individual variation) Joe lives in Colchester, Jane lives in Wivenhoe The fact that individuals’ behaviour varies between observations (within-individual variation) Joe moves from Colchester to Wivenhoe

How to think about two sources of variation in panel data... Between variation How does an individual vary, on average, from the sample mean? Within variation How does an individual vary at any particular time point from his individual mean? W1 W2 W3 W4 W5 Person Mean Jane 20 Joe 15 5 6 4 10 Average income for sample: £10 per year

xtsum in STATA Similar to ordinary “sum” command

xtsum in STATA Similar to ordinary “sum” command Have chosen a balanced sample All variation is “between” Most variation is “between”, because it’s fairly rare to switch between having and not having a partner All variation is within, because this is a balanced sample

More on xtsum…. Observations with non-missing variable Number of individuals Average number of time-points Min & max refer to xi-bar Min & max refer to individual deviation from own averages, with global averages added back in.

Types of variable Those which vary between individuals but hardly ever over time Sex Ethnicity Parents’ social class when you were 14 The type of primary school you attended (once you’ve become an adult) Those which vary over time, but not between individuals The retail price index National unemployment rates Age, in a cohort study Those which vary both over time and between individuals Income Health Psychological wellbeing Number of children you have Marital status Trend variables Vary between individuals and over time, but in highly predictable ways: Age Year

Within and between estimators Individual-specific, fixed over time Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself) This is the “between” estimator And this is the “within” estimator – “fixed effects” θ measures the weight given to between-group variation, and is derived from the variances of ui and εi

Individual heterogeneity: one reason to used fixed effects A very simple concept: people are different! In social science, when we talk about heterogeneity, we are really talking about unobservable (or unobserved) heterogeneity. Observed heterogeneity: differences in education levels, or parental background, or anything else that we can measure and control for in regressions Unobserved heterogeneity: anything which is fundamentally unmeasurable, or which is rather poorly measured, or which does not happen to be measured in the particular data set we are using. Time invariant heterogeneity Height (among adults) Innate intelligence Antenatal care of mother Time variant kinds of heterogeneity Social network size Beauty Weight

Unobserved heterogeneity Extend the OLS equation we used in Week 1, breaking the error term down into two components: one representing the time invariant, unobservable characteristics of the person, and the other representing genuine “error”. In cross-sectional analysis, there is no way of distinguishing between the two. But in panel data analysis, we have repeated observations – and this allows us to distinguish between them.

Fixed effects (within estimator) Allows us to “net out” time-invariant unobserved characteristics Ignores between-group variation – so it’s an inefficient estimator However, few assumptions are required, so FE is generally consistent and unbiased Disadvantage: can’t estimate the effects of any time-invariant variables Also called least squares dummy variable model (LDV) Analysis of covariance (CV) model

Between estimator Not much used Except to calculate the θ parameter for random effects, but STATA does this, not you! It’s inefficient compared to random effects It doesn’t use as much information as is available in the data (only uses means) Assumption required: that ui is uncorrelated with xi Easy to see why: if they were correlated, how could one decide how much of the variation in y to attribute to the x’s (via the betas) as opposed to the correlation? Can’t estimate effects of variables where mean is invariant over individuals Age in a cohort study Macro-level variables

Random effects estimator Weighted average of within and between models Assumption required: that ui is uncorrelated with xi Rather heroic assumption – think of examples Will see a test for this later Uses both within- and between-group variation, so makes best use of the data and is efficient But unless the assumption holds that ui is uncorrelated with xi , it is inconsistent AKA one-way error components model, variance component model, GLS estimator (STATA also allows ML random effects)

Consistency versus efficiency. Random effects clearly does worse here….. “True” value of betas Inconsistent but efficient Consistent but inefficient

…. But arguably, random effects do a better job of getting close to the “true” coefficient here. “True” value of betas Random effects Fixed effects

Testing between FE and RE Hausman test Hypothesis H0: ui is uncorrelated with xi Hypothesis H1: ui is correlated with xi Fixed effects is consistent under both H0 and H1 Random effects is efficient, and consistent under H0 (but inconsistent under H1) Sex does not appear Example from last week Random effects rejected (inconsistent) in favour of fixed effects (consistent but inefficient)

HOWEVER Big disciplinary divide Economists swear by the Hausman test and rarely report random effects Other disciplines (eg psychology) consider other factors such as explanatory power.

Estimating FE in STATA “R-square-like” statistic Peaks at age 48 “u” and “e” are the two parts of the error term

Between regression: Not much used, but useful to compare coefficients with fixed effects Coefficient on “partner” was negative and significant in FE model. In FE, the “partner” coeff really measures the events of gaining or losing a partner

Random effects regression Option “theta” gives a summary of weights

And what about OLS? OLS simply treats within- and between-group variation as the same Pools data across waves

Comparing models Compare coefficients between models Reasonably similar – differences in “partner” and “badhealth” coeffs R-squareds are similar Within and between estimators maximise within and between r-2 respectively.

Test whether pooling data is valid If the ui do not vary between individuals, they can be treated as part of α and OLS is fine. Breusch-Pagan Lagrange multiplier test H0 Variance of ui = 0 H1 Variance of ui not equal to zero If H0 is not rejected, you can pool the data and use OLS Post-estimation test after random effects

Thinking about the within and between estimators….. Both between and FE models written with the same coefficient vector β, but no reason why they should be the same. Between: βj measures the difference in y associated with a one-unit difference in the average value of variable xj between individuals – essentially a cross-sectional concept Within: βj measures the difference associated with a one-unit increase in variable xj at individual level – essentially a longitudinal concept Random effects, as a weighted average of the two, constrains both βs to be the same. Excellent article at http://www.stata.com/support/faqs/stat/xt.html And lots more at http://www.stata.com/support/faqs/stat/#models

Examples Example 1: Consider estimating a wage equation, and including a set of regional dummies, with S-E the omitted group. Wages in (eg) the N-W are lower, so the estimated between coefficient on N-W will be negative. However, in the within regression, we observe the effects of people moving to the N-W. Presumably they wouldn’t move without a reasonable incentive. So, the estimated within coefficient may even be positive – or at least, it’s likely to be a lot less negative. Example 2: Estimate the relationship between family income and children’s educational outcomes The between-group estimates measure how well the children of richer families do, relative to the children of poorer families – we know this estimate is likely to be large and significant. The within-group estimates measure how children’s outcomes change as their own family’s income changes. This coefficient may well be much smaller.

FE and time-invariant variables Reformulating the regression equation to distinguish between time-varying and time-invariant variables: Residual Time-varying variables: income, health Time-invariant variables – eg sex, race Individual-specific fixed effect Inconveniently, fixed effects washes out the z’s, so does not produce estimates of γ. But there is a way! Requires the z variable to be uncorrelated with u’s

Coefficients on time-invariant variables Run FE in the normal way Use estimates to predict the residuals Use the between estimator to regress the residuals on the time-invariant variables Done! Only use this if RE is rejected: otherwise, RE provides best estimates of all coefficients Going back to the previous example,

From previous slide… Our estimate of 1.60 for the coefficient on “female” is slightly higher than, but definitely in the same ball-park as, those produced by the other methods.

Improving specification Recall our problem with the “partner” coefficient OLS and between estimates show no significant relationship between partnership status and LIKERT scores FE and RE show a significant negative relationship. FE estimates coefficient on deviation from mean – likely to reflect moving in together (which makes you temporarily happy) and splitting up (which makes you temporarily sad). Investigate this by including variables to capture these events

Generate variables reflecting changes Note: we will lose some observations

Fixed effects Coeff on having a partner now slightly positive; getting a partner is insignificant; losing a partner is now large and positive

Random effects similar Proportion of total residual variance attributable to the u’s - c.f. random slopes models later

Collating the coefficients:

Hausman test again Have we cleaned up the specification sufficiently that the Hausman test will now fail to reject random effects? No! Although the chi-squared statistic is smaller now (at 116.04), than previously (at 123.96)

Thinking about time Under FE, including “wave” or “year” as a continuous variable is not very useful, since it is treated as the deviation from the individual’s mean. We may not want to treat time as a linear trend (for example, if we are looking for a cut point related to social policy) Also, wave is very much correlated with individuals’ ages Can do FE or RE including time periods as dummies May be referred to as “two-way fixed effects” Generate each dummy variable separately, or…. local i = 1 while `i' <= 15 { gen byte W`i' = (wave == `i') local i = `i' + 1 }

Time variables insignificant here (as we would expect)

Extending panel data models to discrete dependent variables Panel data extensions to logit and probit models Recap from Week 1: These models cover discrete (categorical) outcomes, eg psychological morbidity; whether one has a job;. Think of other examples. Outcome variable is always 0 or 1. Estimate: OLS (linear probability model) would set F(X,β) = X’β + ε Inappropriate because: Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’β or 1-x’β More seriously, one cannot constrain estimated probabilities to lie between 0 and 1.

Extension of logit and probit to panel data: We won’t do the maths! But essentially, STATA maximises a likelihood function derived from the panel data specification Both random effects and fixed effects First, generate the categorical variable indicating psychological morbidity . gen byte PM = (hlghq2 > 2) if hlghq2 >= 0 & hlghq2 != .

Fixed effects estimates – xtlogit (clogit) Is losing a partner necessarily causing the psychological morbidity? Losing a partner, being unemployed or sick, and being in bad health are associated with psychological morbidity Negative in age throughout the human life span

Adding some more variables: We know that women sometimes suffer from post-natal depression. Try total number of children, and children aged 0-2 Total number of children is insignificant, but children 0-2 is significant. Next step???

Yes, we should separate men and women sort female by female: xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, fe Men Women Relationship between PM and young children is confined to women Any other gender differences?

Back to random effects Estimates are VERY similar to FE

Testing between FE and RE quietly xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, fe estimates store fixed quietly xtlogit PM partner get_pnr lose_pnr female ue_sick age age2 badh nch02, re hausman fixed . Random effects is rejected again.

Random effects probit No fixed effects command available, as there does not exist a sufficient statistic allowing the fixed effects to be conditioned out of the likelihood.

Why aren’t the sets of coefficients more similar? Remember the conversion scale from Week 1…