Download presentation
1
Methods of Economic Investigation Lecture 2
Data Structures Methods of Economic Investigation Lecture 2
2
Why are we doing this? Thus far: Most of econometrics teaching has been theory based Type of data can drive what you can do Type of data affects credibility and problems with analysis Can be hard to translate equations into applications and even into reading papers Rest of this course based on applications: this lecture will help with both lectures and exercises
3
Choosing your data.. Suppose interested in causal effect of X on y: How would you test this? If you could choose the way in which X is determined in your sample—what would you do may seem fanciful but field experiments becoming more common in economics Good thought experiment: If you could have any data in the world, is this question answerable (if not, move on!) Good reason to choose to do randomized controlled experiment 3
4
Where does data come from?
Surveys Response Rate Stratification/Clusters Reporting Error/Measurement Error Administrative Records Lots of different places Often kept real-time (so addresses “reporting” or “recollection” errors) May be missing, and that might not be random… Researchers (and you!) Often collected for specific project—so be careful what it has More “unique” with different types of data (e.g. content analysis)
5
Who Collects Data Government Service providers Third Parties
Official Statistics: Unemployment, GDP, etc Surveys: Labor Force, Consumption, etc. Records: Justice System, Social Programs Service providers Often this may be administrative (e.g. hospital records) Sometimes, internal surveys or evaluations which can be useful if you can get them Third Parties Critical for places with limited capacity (e.g. World Bank is a big source of this for developing countries) University or Survey Research Programs Newspapers and Media sources compile LOTS of things
6
Different Types of data
Cross-Sectional Data Time Series Data Panel Data Repeated Cross-Section
7
Cross-Sectional Data Cross section data covers a cross section of population and information is collected from this cross section during a given period of time. What does this look like Rows are units of observations (e.g. individuals) Columns may be variables
8
Cross-Section Data Simple descriptive statistics across individuals: can get sample mean and variance of various X’s Regressions: The standard formulas
9
AlgebraReality: Outcome Variables
Try to get a sense of data, to translate the matrix algebra into reality. What is the effect of education on income? We have an Outcome “y”, for example income
10
AlgebraReality: RHS Variables
There may be several (labeled by k) different X’s. So usually we think of this as meaning that: X is of dimensionality kxn We will estimate k coefficients Our X variables looks like:
11
Our Data Looks like: ID Income Race Sex Education 1 y1 x11 x21 x31 2
4 y4 x14 x24 x34 5 y5 x15 x25 x35 Our Data Example N=5 k=3 We can index our individuals by ID (useful later)
12
What does a regression tell us?
Remember, it’s minimizing the errors and will pick the 3 coefficients (one on race, one on sex, and one on education) to do that We are interested in the coefficient on education to tell use the “effect of education on earnings” We might still care about the effect of race and gender as “control” variables
13
Stata Output
14
AlgebraReality: Stata Output
Using our “data” if we regress y on our X’s To do this in stata we would tell stata: regress income race sex education Output: Coefficients Standard errors R-squared
15
Limitations… Lots of things vary over time
Can’t control for these issues in cross-section data Only source of variation is across individuals (or whatever the unit of observation) Identification: Need observations similar time characteristics (because we can’t control this) but different on some variable of interest
16
Now to time series data Pretty similar to panel data except data indexed by time instead of individual Year Income Inflation Growth Unempl 2000 y1 x11 x21 x31 2001 y2 x12 x22 x32 2002 y3 x13 x23 x33 2003 y4 x14 x24 x34 2004 y5 x15 x25 x35
17
Why is time series different?
Correlation between different observations Violates OLS assumptions (estimates ok but can’t do inference) More on this later… Lots of things about individuals are time-invariant so they don’t make sense in this context. Other things, often in time series data, are common across individuals (e.g. macroeconomic trends) Limits what we can do with these variables—we CAN’T “control” for time-invariant characteristics so all variation comes from time variation…
18
Estimating with Time Series Data
Two critical issues: Stationary: Mean and Variance not changing over time Stronger conditions sometimes required which is that distribution (e.g. all moments) same over time/space May need to do something to make your data stationary (e.g. de-mean, detrend, difference, etc.) Ergodic Given a sufficiently long set of realizations, can estimate statistical properties Worry about Unit roots (more on this later)
19
Panel Data Repeated observation on individuals
Common example: Labor Force Surveys Take information about individuals Usually contains time invarying for any individual (race, sex, education level) Usually contains time varying for any given individual (employed last week) Can contain or link to time varying but same across groups of individuals (local unemployment rate)
20
Example of Panel Data Multi-dimensional—so indexed by time & individual ID Year Income Employment Sex Education 1 2000 Y1,2000 X11,2000 X21,2000 X31,2000 2001 Y1,2001 X11,2001 X21,2001 X31,2001 2 Y2,2000 X12, 2000 X22,2000 X32,2000 Y2,2001 X12, 2001 X22,2001 X32,2001 2002 Y2,2002 X12,2002 X22,2002 X32,2002
21
Panel Data Regressions
Regressions need to be indexed by all dimensions (our example is time and individual but it could be time, state, and individual) May allow intercept shift (e.g. add a dummy for each year) May allow a slope shift (e.g. allow different coefficients for men and women)
22
What’s so great about Panel Data?
We can control for individual specific factors (e.g. error component models) ECM may solve some of our omitted variable bias issues (individual controls) Can use both “within” (for an individual over time) and “between variation (across individuals in a given time) Can be rare to have long panels Tend to span very short periods of time May make it difficult to study trends—can only see “breaks” at big changes
23
Repeated Cross-Section Data
More common—Annual or Frequent Surveys—not always same people Get repeated cross-section, of different cohorts of individuals Can do several things: Construct panel at more aggregate level Use time-series aspects to compare cohorts
24
Example of Cross-Section Data
Multi-dimensional—so indexed by time & individual ID Year Income Employment Sex Education 1 2000 Y1,2000 X11,2000 X21,2000 X31,2000 2 2001 Y2,2001 X12,2001 X22,2001 X32,2001 3 Y3,2000 X13, 2000 X23,2000 X33,2000 4 Y4,2001 X14, 2001 X24,2001 X34,2001 5 2002 Y5,2002 X15,2002 X25,2002 X35,2002
25
Repeated Cross-Section Regressions
Index by time and whatever “group” you want to use—for example: group 1 is men and group 2 is women, then you estimate: Use similarities between groups but can’t control of individual specific issues Cohort specific changes—selection issues, e.g. Can allow ‘fixed effects’ for time or group—but not as believable to control for unobservables
26
Next Steps: Using data can we:
Describe the data to understand what we’ve got Develop some “questions” to answer Test our hypotheses Application based class—will use Stata examples
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.