INFO 7470 Statistical Tools: Hierarchical Models and Network Analysis John M. Abowd and Lars Vilhuber May 2, 2016.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Non response and missing data in longitudinal surveys.
Measuring Inequality A practical workshop On theory and technique San Jose, Costa Rica August 4 -5, 2004.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
INFO 7470/ILRLE 7400 Geographic Information Systems John M. Abowd and Lars Vilhuber March 29, 2011.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
© John M. Abowd 2007, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2007.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
© John M. Abowd 2005, all rights reserved Economic Surveys John M. Abowd March 2005.
INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.
Lecture II-2: Probability Review
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
Determining How Costs Behave
Prentice Hall, Inc. © A Human Resource Management Approach STRATEGIC COMPENSATION Prepared by David Oakes Chapter 8 Building Market-Competitive.
INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.
INFO 7470/ILRLE 7400 Statistical Tools: Basic Integrated Data Models John M. Abowd and Lars Vilhuber April 12, 2011.
A Stochastic LCA Framework for Embodied Greenhouse Gas Analysis Dr David Shipworth School of Construction Management and Engineering University of Reading.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
CREST-ENSAE Mini-course Microeconometrics of Modeling Labor Markets Using Linked Employer-Employee Data John M. Abowd portions of today’s lecture are.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
1 Item 7: National Accounts And Employment Data Using Employment Statistics in the Russian National Accounts Alexander Surinov Deputy Head of Rosstat Joint.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
G Lecture 7 Confirmatory Factor Analysis
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
Foundations of Sociological Inquiry Statistical Analysis.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
CREST-ENSAE Mini-course Microeconometrics of Modeling Labor Markets Using Linked Employer-Employee Data John M. Abowd portions of today’s lecture are the.
Lecture 2: Statistical learning primer for biologists
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
Tutorial I: Missing Value Analysis
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
INFO 7470/ECON 7400 Statistical Tools: Complex Data Models John M. Abowd and Lars Vilhuber April 22, 2013.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
Biostatistics Case Studies Peter D. Christenson Biostatistician Session 3: Missing Data in Longitudinal Studies.
CREST-ENSAE Mini-course Microeconometrics of Modeling Labor Markets Using Linked Employer-Employee Data John M. Abowd portions of today’s lecture are joint.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
No Free Lunch: Working Within the Tradeoff Between Quality and Privacy
Missing data: Why you should care about it and what to do about it
John M. Abowd and Lars Vilhuber February 16, 2011
An Update on Business Employment Dynamics
Introduction to Probabilistic Record Linking
Multi-task learning approaches to modeling context-specific networks
Determining How Costs Behave
Database Systems: Design, Implementation, and Management Tenth Edition
Sampling: Theory and Methods
Identifying Worker Characteristics Using LEHD and GIS
Discrete Event Simulation - 4
Selected Components of the Health Care Delivery System
Warsaw Summer School 2017, OSU Study Abroad Program
Fixed, Random and Mixed effects
Spreadsheets, Modelling & Databases
Non response and missing data in longitudinal surveys
Chapter 17 Designing Databases
Database Systems: Design, Implementation, and Management
Presentation transcript:

INFO 7470 Statistical Tools: Hierarchical Models and Network Analysis John M. Abowd and Lars Vilhuber May 2, 2016

Outline What Are “Linked,” “Integrated” and Other Complex Data Structures? The Relational Database Model Statistical Underpinnings of the Relational Database Model Graphical Representations of Integrated Data Estimating Models from Linked Data 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 2

What Are Linked, Integrated or Other Complex Data Structures? Already exposed to some complex data structures in the record linkage and GIS lectures Observations used in the analysis are sampled from different universes of entities Observations from the different entities relate to each other according to a system of identifiers Integration of the observations requires specifying a universe for the result and a rule for associating data from entities belonging to other universes with the observations in the result 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 3

Examples of Complex Data Structures Hierarchies – Population census: block-household-resident – Economic census: enterprise-establishment Relations – Person-job-employer – Customer-item-supplier – Distance-direction (GIS) 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 4

The Relational Database Model All data are represented as a collection of linked tables Each table has a unique key (primary key) that is defined for every entity in the table Each table may have data items defined for each entity in the table Each table may have items that refer to data from another table (foreign key) Views are created by specifying a reference table and gathering the values of data items based on the keys in the reference table and operations applied to the items retrieved by the foreign keys 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 5

Example of the Relational Database Model Table_Employer – Primary_key: Employer_ID – Foreign_key: NAICS – Foreign_key: Census_block (workplace) – Items: Sales, Employees Table_Individual – Primary_key: Individual_ID – Foreign_key: Census_block (residence) – Items: Age, Education Table_Job – Primary_key: Job_ID – Foreign_key: Employer_ID – Foreign_key: Individual_ID – Items: Earnings, Start_date, End_date Table_Industry – Primary_key: NAICS – Items: average_earnings Table_Geography – Primary_key: Census_block – Items: Population, Workforce 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 6

Example: Job View Select records from Table_Job (universe or sample) Look-up Sales, Employees, NAICS in Table_Employer using Employer_ID; compute sales_per_employee Look-up NAICS in Table_Industry; compute log_industry_average_earnings Look-up Age and Education in Table_Individual using Individual_ID; compute potential_experience Compute log_earnings Create Table_Output – Primary_key: Job_ID – Items: log_earnings, education, potential_experience, sales_per_employee, log_industry_average_earnings 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 7

Graphical Representation of Linked Data Graphs: – Nodes: list of entities – Edges: ordered (directed) or unordered (non- directed) pairs indicating a link between two nodes Example – Nodes: {Employer_IDs, Individual_IDs} – Edges: Ordered pairs (Individual_ID “works for” Employer_ID) 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 8

Statistical Underpinnings of the Relational Database Model Tables are frames If every table is complete relative to its universe, then samples can be constructed by sampling records from the relevant table and linking data from the other tables If some tables are incomplete, then imputation of missing data is equivalent to imputing a link and estimating its items 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 9

Example: Industry View Select records from Table_Industry Look-up all Employer_IDs in NAICS in Table_Employer; compute percentiles of earnings 01 to 99 Output Table_Output – Primary_key: NAICS – Items percentile_earnings_01, … _99 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 10

Estimating Models from Linked Files Linked files are usually analyzed as if the linkage were without error Most of this class focuses on such methods There are good reasons to believe that this assumption should be examined more closely 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 11

Statistical Analysis with Incomplete Links Lahiri and Larson (JASA 2005) Jadinle and Fienberg (JASA 2013) Stoerts, Hall and Fienberg (JASA 2015) 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 12

STATISTICAL TOOLS: BAYESIAN HIERARCHICAL MODELS 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 13

Spatial-Temporal Modeling of the Quarterly Workforce Indicators Bradley, Holan and Wikle (Annals of Applied Statistics 2015) Average monthly earnings, measured quarterly, for men and women in detailed geography (county) and industry (NAICS sector) Used all available data Modeled available and missing data 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 14

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 15

Multivariate Spatial-Temporal Model 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 16

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 17

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 18

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 19

STATISTICAL TOOLS: GRAPH-BASED DATA MODELS 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 20

Outline Basic graph theory Integrated labor market data Statistical modeling 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 21

What Is A Graph? 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 22

Graphs Fully connected graph network 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 23

The Bipartite Labor Market Graph 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 24

Labor Market Graph: Realized Mobility Network The realized mobility network connects employers to employees in a dynamic graph This graph can be constructed from the sequence of “star” clusters that represent employment at a point in time Only one employer/employee is modeled for any time period 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 25

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 26

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 27

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 28

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 29

Adjacency Matrices 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 30

STATISTICAL MODELING OF INTEGRATED LABOR MARKET DATA 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 31

Building Integrated Labor Market Data Examples from the LEHD infrastructure files Analysis can be done using workers, jobs or employers as the basic observation unit Want to model heterogeneity due to the workers and employers for job level analyses Want to model heterogeneity due to the jobs and workers for employer level analyses Want to model heterogeneity due to the jobs and employers for individual analyses 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 32

The dependent variable is compensation The function J(i,t) indicates the employer of i at date t The first component is the person effect The second component is the firm effect The third component is the measured characteristics effect The fourth component is the statistical residual, orthogonal to all other effects in the model NOTE: This is NOT a “fixed-effects” model. It can be estimated by fixed, random, mixed, or Bayesian methods without changing any of the basic modeling assumtions. Basic Statistical Model 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 33

Matrix Notation: Basic Model All vectors/matrices have row dimensionality equal to the total number of observations. Data are sorted by person-ID and ordered chronologically for each person. D is the design matrix for the person effect: columns equal to the number of unique person IDs. F is the design matrix for the firm effect: columns equal to the number of unique firm IDs times the number of effects per firm. 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 34

True Industry Effect Model The function K(j) indicates the industry of firm j The first component is the person effect The second component is the firm effect net of the true industry effect The third component is the true industry effect, an aggregation of firm effects since industry is a property of the employer The fourth component is the effect of personal characteristics The fifth component is the statistical residual See Abowd et al. (2012) /2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 35

Matrix Notation: True Industry Effect Model The matrix A is the classification matrix taking firms into industries The matrix FA is the design matrix for the true industry effect The true industry effect  can be expressed as 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 36

Raw Industry Effect Model The first component is the raw industry effect The second component is the measured personal characteristics effect The third component is the statistical residual The raw industry effect is an aggregation of the appropriately weighted average person and average firm effects within the industry, since both have been excluded from the model The true industry effect is only an aggregation of the appropriately weighted average firm effect within the industry, as shown above 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 37

Industry Effects Adjusted for Person Effects Model The first component is the industry effect adjusted for person effects. The second component is individual effect (with firm effects omitted) The third component is the measured personal characteristics effect. The fourth component is the statistical residual. The industry effects adjusted for person effects are also biased. 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 38

Relation: True and Raw Industry Effects The vector  ** of industry effects can be expressed as the true industry effect  plus a bias that depends upon both the person and firm effects The matrix M is the residual matrix (column null space) after projection onto the column space of the matrix in the subscript. For example, 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 39

Relation: Industry, Person and Firm Effects The vector  ** of raw industry effects can be expressed as a matrix weighted average of the person effects  and the firm effects  The matrix weights are related to the personal characteristics X, and the design matrices for the person and firm effects (see Abowd, Kramarz and Margolis, 1999)1999 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 40

Estimation by Bayesian Methods Person, employer and match effects specified as latent random classes Distribution of person, firm and match effects simultaneous with model of mobility Complete-data likelihood function assumes that the worker, employer and match classes are known Model fit to a (small) random sample of LEHD in IL, IN, and WI Markov Chain Monte Carlo used for estimation See Abowd and Schmutte (2013); preliminary estimates 5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 41

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 42

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 43

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 44

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 45

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 46

5/2/2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved 47