INFO 7470/ECON 7400 Statistical Tools: Complex Data Models John M. Abowd and Lars Vilhuber April 22, 2013.

Slides:

Advertisements

Similar presentations

Copula Representation of Joint Risk Driver Distribution

Advertisements

Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.

Estimation of Means and Proportions

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

General Linear Model With correlated error terms  =  2 V ≠  2 I.

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.

Chapter 4 Randomized Blocks, Latin Squares, and Related Designs

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

11 Simple Linear Regression and Correlation CHAPTER OUTLINE

INFO 7470/ILRLE 7400 Geographic Information Systems John M. Abowd and Lars Vilhuber March 29, 2011.

Ch11 Curve Fitting Dr. Deshi Ye

The General Linear Model. The Simple Linear Model Linear Regression.

The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.

The Simple Linear Regression Model: Specification and Estimation

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Chapter 4 Multiple Regression.

© John M. Abowd 2007, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2007.

© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.

© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

© John M. Abowd 2005, all rights reserved Modeling Integrated Data John M. Abowd April 2005.

Chapter 11 Multiple Regression.

Linear and generalised linear models

Linear and generalised linear models

Basics of regression analysis

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.

Lecture II-2: Probability Review

Separate multivariate observations

1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.

ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?

INFO 7470/ILRLE 7400 Statistical Tools: Basic Integrated Data Models John M. Abowd and Lars Vilhuber April 12, 2011.

Concepts of Database Management, Fifth Edition

CREST-ENSAE Mini-course Microeconometrics of Modeling Labor Markets Using Linked Employer-Employee Data John M. Abowd portions of today’s lecture are.

EM and expected complete log-likelihood Mixture of Experts

CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.

© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.

Multinomial Distribution

Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.

Yaomin Jin Design of Experiments Morris Method.

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.

Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.

G Lecture 7 Confirmatory Factor Analysis

Multivariate Statistics Confirmatory Factor Analysis I W. M. van der Veld University of Amsterdam.

CHAPTER 5 SIGNAL SPACE ANALYSIS

Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.

© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.

CREST-ENSAE Mini-course Microeconometrics of Modeling Labor Markets Using Linked Employer-Employee Data John M. Abowd portions of today’s lecture are the.

Lecture 2: Statistical learning primer for biologists

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

Tutorial I: Missing Value Analysis

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Computacion Inteligente Least-Square Methods for System Identification.

Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.

INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

INFO 7470 Statistical Tools: Hierarchical Models and Network Analysis John M. Abowd and Lars Vilhuber May 2, 2016.

Chapter 7. Classification and Prediction

CJT 765: Structural Equation Modeling

6-1 Introduction To Empirical Models

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

OVERVIEW OF LINEAR MODELS

OVERVIEW OF LINEAR MODELS

Presentation transcript:

INFO 7470/ECON 7400 Statistical Tools: Complex Data Models John M. Abowd and Lars Vilhuber April 22, 2013

Outline What Are “Linked,” “Integrated” and Other Complex Data Structures? The Relational Database Model Statistical Underpinnings of the Relational Database Model Graphical Representations of Integrated Data Estimating Models from Linked Data 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 2

What Are Linked, Integrated or Other Complex Data Structures? Already exposed to some complex data structures in the record linkage and GIS lectures Observations used in the analysis are sampled from different universes of entities Observations from the different entities relate to each other according to a system of identifiers Integration of the observations requires specifying a universe for the result and a rule for associating data from entities belonging to other universes with the observations in the result 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 3

Examples of Complex Data Structures Hierarchies – Population census: block-household-resident – Economic census: enterprise-establishment Relations – Person-job-employer – Customer-item-supplier – Distance-direction (GIS) 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 4

The Relational Database Model Informal characterization (formal model, 2011 link, 2007 link )formal model 2011 link 2007 link All data are represented as a collection of linked tables Each table has a unique key (primary key) that is defined for every entity in the table Each table may have data items defined for each entity in the table Each table may have items that refer to data from another table (foreign key) Views are created by specifying a reference table and gathering the values of data items based on the keys in the reference table and operations applied to the items retrieved by the foreign keys 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 5

Example of the Relational Database Model Table_Employer – Primary_key: Employer_ID – Foreign_key: NAICS – Items: Sales, Employees Table_Individual – Primary_key: Individual_ID – Foreign_key: Census_block – Items: Age, Education Table_Job – Primary_key: Job_ID – Foreign_key: Employer_ID – Foreign_key: Individual_ID – Items: Earnings Table_Industry – Primary_key: NAICS – Items: average_earnings 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 6

Example: Job View Select records from Table_Job (universe or sample) Look-up Sales, Employees, NAICS in Table_Employer using Employer_ID; compute sales_per_employee Look-up NAICS in Table_Industry; compute log_industry_average_earnings Look-up Age and Education in Table_Individual using Individual_ID; compute potential_experience Compute log_earnings Create Table_Output – Primary_key: Job_ID – Items: log_earnings, education, potential_experience, sales_per_employee, log_industry_average_earnings 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 7

Graphical Representation of Linked Data Graphs: – Nodes: list of entities – Edges: ordered (directed) or unordered (non- directed) pairs indicating a link between two nodes Example – Nodes: {Employer_IDs, Individual_IDs} – Edges: Ordered pairs (Individual_ID ‘works for’ Employer_ID) 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 8

Statistical Underpinnings of the Relational Database Model Tables are frames If every table is complete relative to its universe, then samples can be constructed by sampling records from the relevant table and linking data from the other tables If some tables are incomplete, then imputation of missing data is equivalent to imputing a link and estimating its items 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 9

Example: Industry View Select records from Table_Industry Look-up all Employer_IDs in NAICS in Table_Employer; compute variance_earnings Output Table_Output – Primary_key: NAICS – Item average_earnings, variance_earnings 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 10

Estimating Models from Linked Files Linked files are usually analyzed as if the linkage were without error Most of this class focuses on such methods There are good reasons to believe that this assumption should be examined more closely 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 11

Lahiri and Larsen Consider regression analysis when the data are imperfectly linked See JASA article March 2005 for full discussionJASA article March /22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 12

Setup of Lahiri and Larsen 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 13

Estimators 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 14

Problem: Estimating B 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 15

Does It Matter? Yes The bias from the naïve estimator is very large as the average q ii goes away from 1 The SW estimator does better The U estimator does very well, at least in simulations 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 16

STATISTICAL TOOLS: GRAPH-BASED DATA MODELS 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 17

Outline Basic graph theory Integrated labor market data Statistical modeling Graph theoretic identification Estimation by fixed-effects methods Estimation by mixed-effects methods 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 18

BASIC GRAPH THEORY 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 19

What Is A Graph? 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 20

Graphs Fully connected graph network 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 21

The Bipartite Labor Market Graph 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 22

Labor Market Graph: Realized Mobility Network The realized mobility network connects employers to employees in a dynamic graph This graph can be constructed from the sequence of “star” clusters that represent employment at a point in time Only one employer/employee is modeled for any time period 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 23

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 24

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 25

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 26

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 27

Adjacency Matrices 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 28

INTEGRATED LABOR MARKET DATA 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 29

Longitudinal Integrated Data Matched longitudinal data on employers and employees allows us to identify the separate effects of unobserved personal and firm heterogeneity Large samples are required with simple random sampling because the identification is based on within sample mobility of workers between firms For multi-stage sample surveys, geographic clustering permits identification of employer effects because the within sample employment mobility is generally within the same geographic area 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 30

Year 2 Longitudinal Sampling Plan for Individuals Begin with an administrative file of individuals Select the same individuals in successive years Sample individuals randomly Year 3Year 1 4/22/2013© John M. Abowd and Lars Vilhuber 2013, all rights reserved31

Structure of Resulting Data Continuously employed at the same employer Entry and a change of employers Exit and a change of employer Continuously employed with multiple employers ID errors Other sequences Example shows 7 person IDs and 7 different employers 4/22/2013© John M. Abowd and Lars Vilhuber 2013, all rights reserved32

Other Sampling Plans Sample firms with probability proportional to employment; sample employees randomly from selected firms As above plus all employers ever employing selected individuals As in first bullet, plus subsequent employers of selected individuals 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 33

Building Integrated Labor Market Data Examples from the LEHD infrastructure files Analysis can be done using workers, jobs or employers as the basic observation unit Want to model heterogeneity due to the workers and employers for job level analyses Want to model heterogeneity due to the jobs and workers for employer level analyses Want to model heterogeneity due to the jobs and employers for individual analyses 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 34

STATISTICAL MODELING 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 35

The dependent variable is compensation The function J(i,t) indicates the employer of i at date t The first component is the person effect The second component is the firm effect The third component is the measured characteristics effect The fourth component is the statistical residual, orthogonal to all other effects in the model NOTE: This is NOT a “fixed-effects” model. It can be estimated by fixed, random, mixed, or Bayesian methods without changing any of the basic modeling assumtions. Basic Statistical Model 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 36

Matrix Notation: Basic Model All vectors/matrices have row dimensionality equal to the total number of observations. Data are sorted by person-ID and ordered chronologically for each person. D is the design matrix for the person effect: columns equal to the number of unique person IDs. F is the design matrix for the firm effect: columns equal to the number of unique firm IDs times the number of effects per firm. 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 37

True Industry Effect Model The function K(j) indicates the industry of firm j The first component is the person effect The second component is the firm effect net of the true industry effect The third component is the true industry effect, an aggregation of firm effects since industry is a property of the employer The fourth component is the effect of personal characteristics The fifth component is the statistical residual See Abowd et al. (2012) /22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 38

Matrix Notation: True Industry Effect Model The matrix A is the classification matrix taking firms into industries The matrix FA is the design matrix for the true industry effect The true industry effect  can be expressed as 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 39

Raw Industry Effect Model The first component is the raw industry effect The second component is the measured personal characteristics effect The third component is the statistical residual The raw industry effect is an aggregation of the appropriately weighted average person and average firm effects within the industry, since both have been excluded from the model The true industry effect is only an aggregation of the appropriately weighted average firm effect within the industry, as shown above 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 40

Industry Effects Adjusted for Person Effects Model The first component is the industry effect adjusted for person effects. The second component is individual effect (with firm effects omitted) The third component is the measured personal characteristics effect. The fourth component is the statistical residual. The industry effects adjusted for person effects are also biased. 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 41

Relation: True and Raw Industry Effects The vector  ** of industry effects can be expressed as the true industry effect  plus a bias that depends upon both the person and firm effects The matrix M is the residual matrix (column null space) after projection onto the column space of the matrix in the subscript. For example, 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 42

Relation: Industry, Person and Firm Effects The vector  ** of raw industry effects can be expressed as a matrix weighted average of the person effects  and the firm effects  The matrix weights are related to the personal characteristics X, and the design matrices for the person and firm effects (see Abowd, Kramarz and Margolis, 1999)1999 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 43

IDENTIFICATION 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 44

Estimation by Fixed-effect Methods The normal equations for least squares estimation of fixed person, firm and characteristic effects are very high dimension Estimation of the full model by either fixed- effect or mixed-effect methods requires special algorithms to deal with the high dimensionality of the problem 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 45

Least Squares Normal Equations The full least squares solution to the basic estimation problem solves these normal equations for all identified effects 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 46

Identification of Effects Use of the decomposition formula for the industry (or firm-size) effect requires a solution for the identified person, firm and characteristic effects. The usual technique of eliminating singular row/column combinations from the normal equations won’t work if the least squares problem is solved directly 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 47

Identification by Finding Connected Sub-graphs Firm 1 is in group g = 1. Repeat until no more persons or firms are added: – Add all persons employed by a firm in group 1 to group 1 – Add all firms that have employed a person in group 1 to group 1 For g= 2,..., repeat until no firms remain: – The first firm not assigned to a group is in group g. – Repeat until no more firms or persons are added to group g: Add all persons employed by a firm in group g to group g. Add all firms that have employed a person in group g to group g. Each group g is a connected sub-graph Identification of  : drop one firm from each group g Identification of  : impose one linear restriction 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 48

Connected Sub-graphs of the Labor Market 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 49

Normal Equations after Group Blocking The normal equations have a sub-matrix with block diagonal components This matrix is of full rank and the solution for (  ) is unique 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 50

Necessity of Identification Conditions For necessity, we want to show that exactly N+J-G person and firm effects are identified (estimable), including the grand mean  y Because X and y are expressed in deviations from the mean, all N effects are included in the equation but one is redundant because both sides of the equation have a zero mean by construction So the grand mean plus the person effects constitute N effects There are at most N + J-1 person and firm effects including the grand mean The grouping conditions imply that at most G group means are identified (or, the grand mean plus G-1 group deviations) Within each group g, at most N g and J g -1 person and firm effects are identified. Thus the maximum number of identifiable person and firm effects is: 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 51

Sufficiency of Identification Conditions For sufficiency, use an induction proof Consider an economy with J firms and N workers Denote by E[y it ] the projection of worker i's wage at date t on the column space generated by the person and firm identifiers. For simplicity, suppress the effects of observable variables X The firms are connected into G groups, then all effects  j, in group g are separately identified up to a constraint of the form: 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 52

Sufficiency of Identification Conditions II Suppose G=1 and J=2 Then, by the grouping condition, at least one person, say 1, is employed by both firms and we have So, exactly N+2-1 effects are identified 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 53

Sufficiency of Identification Conditions III Next, suppose there is a connected group g with J g firms and exactly J g -1 firm effects identified Consider the addition of one more connected firm to such a group Because the new firm is connected to the existing J g firms in the group there exists at least one individual, say worker 1 who works for a firm in the identified group, say firm J g, at date 1 and for the supplementary firm at date 2. Then, we have two relations Exactly J g effects are identified with the new information 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 54

ESTIMATION BY FIXED-EFFECTS METHODS 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 55

Estimation by Direct Solution of Least Squares Equations Once the grouping algorithm has identified all estimable effects, solve for the least squares estimates by direct minimization of the sum of squared residuals This method produces a unique solution for all estimable effects 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 56

Least Squares Conjugate Gradient Algorithm The matrix  is chosen to precondition the normal equations The data matrices and parameter vectors are redefined as shown Can also be solved directly with preconditioned CG on sparse matrices 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 57

LSCG (II) The goal is to find  to solve the least squares problem shown The gradient vector g figures prominently in the equations The initial conditions for the algorithm are shown – e is the vector of residuals – d is the direction of the search 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 58

LSCG (III) The loop shown has the following features: – The search direction d is the current gradient plus a fraction of the old direction – The parameter vector  is updated by moving a positive amount in the current direction – The gradient, g, and residuals, e, are updated – The original parameters are recovered from the preconditioning matrix. 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 59

LSCG (IV) Verify that the residuals are uncorrelated with the three components of the model – Yes: the LS estimates are calculated as shown – No: certain constants in the loop are updated and the next parameter vector is calculated 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 60

ESTIMATION BY MIXED-EFFECTS METHODS 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 61

Mixed Effects Assumptions The assumptions above specify the complete error structure with the firm and person effects random For maximum likelihood or restricted maximum likelihood estimation assume joint normality Software: ASREML, cgmixed. See Abowd and Stinson (2012)ASREMLcgmixed2012 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 62

Estimation by Mixed Effects Methods Solve the mixed effects equations Techniques: Bayesian EM, Restricted ML 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 63

Bayesian ECM The algorithm is illustrated for the special case of uncorrelated residuals and uncorrelated random effects The initial conditions are taken directly from the LS solution to the fixed effects problem 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 64

Bayesian ECM (II) At each loop of the algorithm the E step is used to compute the parameters The conditional M step is used to update the variances. 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 65

Relation Between Fixed and Mixed Effects Models Under the conditions shown above, the ME and estimators of all parameters approaches the FE estimator 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 66

Correlated Random Effects vs. Orthogonal Design Orthogonal design means that characteristics, person design, firm design are orthogonal Uncorrelated random effects means that  is diagonal 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 67

Software SAS: proc mixed, proc hpmixed (better in 9.2, best in 9.3) ASREML and Genstat: REML ASREML aML (no longer maintained, but open-source) aML STATA: xtreg, gllamm, xtmixed Matlab: Statistics toolbox R: the lme() function, also call to ASREML S+: linear mixed models SPSS: Linear Mixed Models Grouping (connected sub-graphs) Custom software on the VRDC 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 68

Estimation of Industry Effects From Abowd et al. (2012)2012 Fixed-effects estimation of person and firm effects Aggregation by the AKM (1999) formulae noted above Parallel procedure on similar data from the U.S. (LEHD) and France (DADS) 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 69

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 70

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 71

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 74

Estimation by Mixed-effects See Abowd and Stinson (2012) for setup2012 Notation is the same as in this lecture Random person and employer effects fit with mixed-effects design and correct fixed-effects estimator of regression coefficients Estimation using SIPP linked to SSA employer- level earnings histories (DER) 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 75

Correlated Person and Firm Effects In the Abowd-Stinson model there is a separate equation for SIPP and administrative (SSA) earnings outcomes, for each job Each equation has its own regression, person, and employer effects 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 76

Estimation by Bayesian Methods Person, employer and match effects specified as latent random classes Distribution of person, firm and match effects simultaneous with model of mobility Complete-data likelihood function assumes that the worker, employer and match classes are known Model fit to a (small) random sample of LEHD in IL, IN, and WI Markov Chain Monte Carlo used for estimation See Abowd and Schmutte (2013); preliminary estimates 4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 78

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 79

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 80

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 81

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 82

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 83

4/22/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 84