Experiences with multiple propensity score matching Jan Hagemejer & Joanna Tyrowicz University of Warsaw & National Bank of Poland.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Housekeeping: Variable labels, value labels, calculations and recoding
EViews Student Version. Today’s Workshop Basic grasp of how EViews manages data Creating Workfiles Importing data Running regressions Performing basic.
Using the SmartPLS Software “Structural Model Assessment”
Using MatLab and Excel to Solve Building Energy Simulation Problems Jordan Clark
Teaching Statistics Using Stata Software Susan Hailpern BSN MPH MS Department of Epidemiology and Population Health Albert Einstein College of Medicine.
Logit & Probit Regression
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Stata Programming February 21.
Section 2.3 Gauss-Jordan Method for General Systems of Equations
By Hrishikesh Gadre Session II Department of Mechanical Engineering Louisiana State University Engineering Equation Solver Tutorials.
Statistics for Linguistics Students Michaelmas 2004 Week 3 Bettina Braun
Stata Introduction Sociology 229A, Class 2 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
An Introduction to Logistic Regression
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Process Modeling SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED. 1 Roberta M. Roth.
Simulation Walk Through Seeing how a simulation could work on your course.
STATA User Group September 2007 Shuk-Li Man and Hannah Evans.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
L2: BECOMING SELF- SUFFICIENT IN STATA Getting started with Stata Angela Ambroz May 2015.
Unit 11, Part 2: Introduction To Logarithms. Logarithms were originally developed to simplify complex arithmetic calculations. They were designed to transform.
Spreadsheet-Based Decision Support Systems Chapter 22:
Project organisation in Stata Adrian Spoerri and Marcel Zwahlen Department of Social and Preventive Medicine University of Berne, Switzerland Research.
06/10/ Working with Data. 206/10/2015 Learning Objectives Explain the circumstances when the following might be useful: Disabling buttons and.
1 Functions 1 Parameter, 1 Return-Value 1. The problem 2. Recall the layout 3. Create the definition 4. "Flow" of data 5. Testing 6. Projects 1 and 2.
API-208: Stata Review Session Daniel Yew Mao Lim Harvard University Spring 2013.
Key Data Management Tasks in Stata
Tricks in Stata Anke Huss Generating „automatic“ tables in a do-file.
Introduction to STATA for Clinical Researchers Jay Bhattacharya August 2007.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
L3: BIG STATA CONCEPTS Getting started with Stata Angela Ambroz May 2015.
VB Games: Preparing for Memory Brainstorm controls & events Parallel structures (again), Visibility, LoadPicture, User-defined procedures, Do While/Loop,busy.
Organizing a project, making a table Biostatistics 212 Lecture 7.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Analyses using SPSS version 19
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
O FFICE M ANAGEMENT T OOL - II B BA -V I TH. Abdus Salam2 Week-7 Introduction to Query Introduction to Query Querying from Multiple Tables Querying from.
Comparison of different output options from Stata
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
FINAL MEETING – OTHER METHODS Development Workshop.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
Chapter 11: Sequential File Merging, Matching, and Updating Programming Logic and Design, Third Edition Comprehensive.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
CAD CAM. 2 and 3 Dimensional CAD: Using 2-dimensional CAD software, designers can create accurate, scaled drawings of parts and assemblies for designs.
Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Propensity Score Matching in SPSS: How to turn an Audit into a RCT
Analysis of financial data Anders Lundquist Spring 2010.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
EMPA Statistical Analysis
Chapter 6: Modifying and Combining Data Sets
Econometrics 704 Emilio Cuilty
ECONOMETRICS ii – spring 2018
Dynamic Programming.
Stata Basic Course Lab 2.
SEM: Step by Step In AMOS and Mplus.
Ordinary Least Square estimator using STATA
Presentation transcript:

Experiences with multiple propensity score matching Jan Hagemejer & Joanna Tyrowicz University of Warsaw & National Bank of Poland

Plan 1. Standard solutions to the automatisation challenge 2. Where they do not work? Example of propensity score matching  Using loops and global function together  Generating the resultssets for atypical estimations.  Difficulties with using bootstrap (and obtaining resultssets ) 3. Summary comments … and some (hard learned) advices SUGM London, Jan Hagemejer & Joanna Tyrowicz

The standard route  Problem: several estimations of similar form + need to compare results.  Three simple solutions:  Solution 1: brute force = sit & type (copy / paste from output)  Solution 2: use parmest (Roger Newson) if estimations on simple categories in data (limitations of „by” command)  Solution 3: use loops  outreg/outreg2  nicely formatted tables,  publication-ready,  in many formats, even directly to Word or LaTeX.  Note: if you need nice summary statistics, you can use outsum either with by or within loops SUGM London, Jan Hagemejer & Joanna Tyrowicz

Where the problems come from?  2nd and 3rd solution works only with regression-type estimations  However, some procedures are incompatible with pre-cooked solutions  Need to report:  output of the procedure  sample properties after matching  balancing properties of matching  Problem1: actually, none of these is in the typical output  Problem2: we need it for many estimations looped over many variables and each one of them takes a looooong time SUGM London, Jan Hagemejer & Joanna Tyrowicz

Detailed problem description  Analyse the effects of privatisation  Take two firms A and A’. Firm A gets privatized. Firm A’ does not get privatised (ever). Want to compare firms A and A’ each year before and after privatisation of firm A (in fact we are comparing private firms to privatized SOEs due to few SOEs left in the sample)  Observe what happens before and after the „event” of privatisation  E.g. firm A may be one year before privatisation in 1999 and firm B in 2006, so „event” is an anchor and time „runs” both ways.  Effects may be observed in many spheres:  E.g. profits, investments, international competitiveness, employment, productivity  Effects may be due to self-selection  E.g. only better firms are privatised, so difference in performance is not due to privatisation (there might be other effects why firms are privatised related to, for instance, budget presure).  Use propensity score matching to compare privatised firms to non- privatised firms SUGM London, Jan Hagemejer & Joanna Tyrowicz

What we want to get: SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 6

Detailed problem description  Thus, in our case:  Many time periods (for each „time-to-anchor” a separate estimation)  Many variables (for each variable separate outcomes, but within one „anchor” the same balancing properties)  Two ways of estimating: regular and bootstrapping (especially the latter made things complex)  Each estimation: roughly hours (big dataset)  Over a hundred estimations  To verify if matching is ok, need to check balancing properties  Additional pitfalls :  We needed some statistics for all estimations and they were not in the return list  More precisely: procedure computes them to be able to produce output, but they were not added to the return list by authors SUGM London, Jan Hagemejer & Joanna Tyrowicz

Summary of the problems SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 8 Our problem was quite specific… BUT consisted of many general problems: 1. Loops take a lot of time – need to find efficient ways 2. Some things cannot be obtained fast => even more reasons to run it automatically 3. Obtaining datasets of the results we need (so-called resultssets )  Getting visible data if they are not an output  Using invisible data 4. Getting around with bootstrap

The structure of our estimations SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 9 Specific loops Balancing properties Before and after matching statistics Loop for variables (15 variables) Run standard estimations Run bootstrap estimation Loop for time (12 periods)

How global function can be usefull?

Using the global function for estimations SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 11  Our application: observe the same firms back and forth from the moment of privatisation („anchor”)  „Anchors” happen in different years  But we can only match on one dimension: has or has not the „anchor”  Conceptual solution: use lags and forwards to get the time dimension  Technical problem: many outcome variables and de facto many loops  Technical solution: define separately matching variables and output variables global in=„capital roa export_status etc…” MATCHING VARS! global out=„productivity employment efficiency etc…” COMPARISON VARS! global outf=„forwards of $out”

Getting from results to „resultssets”

Why (and what) do we need (in) the resultssets? SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 13  Why?  Most importantly: without resultssets we cannot  analyse the changes over time  decompose the observed differentials  If we do not do it automatically, it would have to be copied manually from logs – many estimations, many variables, etc  What ? Step 1: Find out the reality 1. Size of each of the three groups: treated, total and control (= matched) 2. Averages in all three groups (medians, etc.) 3. Knowledge if in fact they are different (= test of the statistical significance based on difference and standard error of this difference)  What? Step 2: find out, how good the findings are statistically 1. Balancing properties!

Our solution to step 1 SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 14  Initialize the store for our resultsets using postfile. Index the result table with variable names, years and other things that the code loops around tempname memhold postfile memhold indices variable_names_for_results  Start the big loop (event) forvalues d=6(1)18 {  Run pscore (needed for bootstrap) and subsequently psmatch psmatch2 d`d' our_pscore_`d', out($out $outf $outl) some options

Our solution to step 1  Run pscore and psmatch psmatch2 d`d' our_pscore_`d', out($out $outf $outl) some options  Start the loop foreach out in $out $outf1 $outf2 {  Generate means and standard errors for treaded/matched/unmatched, using output from psmatch (some more about this later) local se_after=r(seatt_`out')  Post the `locals’ to the postfile using post command in each loop iteration SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 15

Our solution to step 2 SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 16  For balancing properties we need to use pstest over all the matching variables pstest $in  In order to produce nice tables, we need to loop over all the matching variables in $in and create some ‚locals’ in memory to later save them as separate variables: foreach in in $in { capture local bias_reduction=r(bired_`in') capture local pvalue_bef=r(pbef_`in') capture local pvalue_after=r(paft_`in') capture gen b_red_`in'=`bias_reduction' capture gen pval_ber_`in'=`pvalue_bef' capture gen pval_aft_`in'=`pvalue_after‚ }  Spit out everything to a spreadsheet (alternatively you can use postfile again): outsheet b_red* pval* using stats_priv_`d', replace  Make some graphs and clean up psgraph graph save priv_support_`d', replace drop b_red* pval*

„Missing statistics”

Solving problem of „missing” statistics  Psmatch produces nice tables with all the required statistics. However, they are only shown on the screen and vanish right after that  Look into the „ado” file you are using (procedure)  Throughout the file, there are commands return scalar x=`somelocal’  Sometimes – for clarity – scalars are dropped at the end of procedure  Your prefered statistic (if it is in the output, it has to be at least a local) would simply have to have a local like that too  If it does not – you can always generate it based on your preferences and available locals => Modify the original ado file SUGM London, Jan Hagemejer & Joanna Tyrowicz

Solving problem of „missing” statistics – example 1 Original ado file – line 380 Modified ado file – line 380 SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 19 qui foreach v of varlist `varlist' { replace _`v' =. if _support==0 tempname m1t m0t u0u u1u att dif0 sum `v' if _treated==1, mean scalar `u1u' = r(mean) sum `v' if _treated==0, mean scalar `u0u' = r(mean) sum `v' if _treated==1 & _support==1, mean scalar `m1t' = r(mean) local n1 = r(N) sum _`v' if _treated==1 & _support==1, mean scalar `m0t' = r(mean) scalar `att' = `m1t' - `m0t' scalar `dif0' = `u1u' - `u0u‘ return scalar att = `att' return scalar att_`v' = `att‚ /no „return” of needed scalars/ qui foreach v of varlist `varlist' { replace _`v' =. if _support==0 tempname m1t m0t u0u u1u att dif0 … /all the same as earlier plus / return scalar diff = `dif0' return scalar diff_`v' = `dif0‘ return scalar mean0 = `u0u' return scalar mean0_`v' = `u0u‘ return scalar mean1 = `u1u' return scalar mean1_`v' = `u1u'

Solving problem of „missing” statistics – example 2 return scalar seatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalar seatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalar seols = `seols‘ return scalar seols_`v' = `seols' SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 20 Original ado file – line 440 Modified ado file – line 440

Problems with bootstrap

 The psmatch procedure does not take into account when calculating se’s that the propensity score is estimated. A possible solution to this is to use bootstrap.  What problems with bootstrap ?  Need to run it separately for each variable (it bootstraps only one standard error at a time)  Output is given in a totally different form  It takes a looong time  New piece of code for just BS standard errors => new variable loops within each time loop SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 22

Problems with bootstrap SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 23  Again, create the postfile  Run the actual bootstrap in loops (post results in every iteration) foreach out in $out $outf1 $outf2 { use data, clear bootstrap r(att): psmatch2 d`d‘ $in, out(`out') some options matrix mat = e(b), e(se) /without this, no resultssets/ svmat mat /convert matrix to variables/ rename mat1 a`d'_diff_after_bs_`out‘ /create meaningful names/ rename mat2 a`d'_se_after_bs_`out‘ gen time_of_event=`d post `postfile’ indices (a`d'_diff_after_bs_`out‘) (a`d'_se_after_bs_`out‘) } postfile close

Final steps 1. Merge files obtained from bootstrap on „anchor” (to have a complete resultsset within each „anchor” period) 2. Organise the data 3. Produce tables and graphs (again in loops) 4. Write paper SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 24

The resulting graphs (1)  6 figures showing levels for 3 groups (15 matches each) SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 25

The resulting graphs (2)  6 figures showing the decomposition of the treated- unmatched difference (15 matches each) SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 26

The resulting graphs (3)  6xn figures showing the „balanced panel” version for all variables of the treated-unmatched difference SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 27

Some advices we did not take at the right time 1. Use „sample 10” for testing procedures - saves a lot of time 2. Leaving mess is not useful if we ever want to come back  Your memory lasts shorter than that of saved files – describing dofiles really helps  Loops are better than copy&paste – and less messy too 3. Beware of changes in STATA syntax (all the time…) SUGM London, 2012Jan Hagemejer & Joanna Tyrowicz 28

Thank you for your attention! Jan Hagemejer & Joanna Tyrowicz University of Warsaw and National Bank of Poland