Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee Melbourne Institute of Applied Economic and Social Research The University.

Slides:



Advertisements
Similar presentations
Review bootstrap and permutation
Advertisements

Introduction to Propensity Score Matching
REGRESSION, IV, MATCHING Treatment effect Boualem RABTA Center for World Food Studies (SOW-VU) Vrije Universiteit - Amsterdam.
Reliable Causal Inference via Genetic Matching: A new matching method jointly developed by Alexis Diamond and Jasjeet Sekhon Alexis Diamond Software by.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 7 Sampling and Sampling Distributions
Resampling techniques
Lecture 9: One Way ANOVA Between Subjects
One-way Between Groups Analysis of Variance
Topic 3: Regression.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Introduction to PSM: Practical Issues, Concerns, & Strategies Shenyang Guo, Ph.D. School of Social Work University of North Carolina at Chapel Hill January.
Chapter 10: Estimating with Confidence
Hypothesis Testing Using The One-Sample t-Test
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
Bootstrapping applied to t-tests
12 Autocorrelation Serial Correlation exists when errors are correlated across periods -One source of serial correlation is misspecification of the model.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Overview of Statistical Hypothesis Testing: The z-Test
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
Modeling and Simulation CS 313
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
One-Way Analysis of Variance Comparing means of more than 2 independent samples 1.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Continuous Probability Distributions Continuous random variable –Values from interval of numbers –Absence of gaps Continuous probability distribution –Distribution.
PARAMETRIC STATISTICAL INFERENCE
Factorial Design of Experiments Kevin Leyton-Brown.
t(ea) for Two: Test between the Means of Different Groups When you want to know if there is a ‘difference’ between the two groups in the mean Use “t-test”.
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
ANOVA (Analysis of Variance) by Aziza Munir
Testing Hypotheses about Differences among Several Means.
Copyright © 2011 Pearson Education, Inc. Analysis of Variance Chapter 26.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Generalizing Observational Study Results Applying Propensity Score Methods to Complex Surveys Megan Schuler Eva DuGoff Elizabeth Stuart National Conference.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Impact Evaluation Sebastian Galiani November 2006 Matching Techniques.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Using Propensity Score Matching in Observational Services Research Neal Wallace, Ph.D. Portland State University February
1 Hester van Eeren Erasmus Medical Centre, Rotterdam Halsteren, August 23, 2010.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Math 4030 – 7b Normality Issues (Sec. 5.12) Properties of Normal? Is the sample data from a normal population (normality)? Transformation to make it Normal?
Rerandomization to Improve Covariate Balance in Randomized Experiments Kari Lock Harvard Statistics Advisor: Don Rubin 4/28/11.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 26 Analysis of Variance.
[Part 5] 1/43 Discrete Choice Modeling Ordered Choice Models Discrete Choice Modeling William Greene Stern School of Business New York University 0Introduction.
MATCHING Eva Hromádková, Applied Econometrics JEM007, IES Lecture 4.
Statistical Inferences for Variance Objectives: Learn to compare variance of a sample with variance of a population Learn to compare variance of a sample.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Bootstrapping and Randomization Techniques Q560: Experimental Methods in Cognitive Science Lecture 15.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Alexander Spermann University of Freiburg, SS 2008 Matching and DiD 1 Overview of non- experimental approaches: Matching and Difference in Difference Estimators.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Looking for statistical twins
Constructing Propensity score weighted and matched Samples Stacey L
Methods of Economic Investigation Lecture 12
Explanation of slide: Logos, to show while the audience arrive.
The European Statistical Training Programme (ESTP)
Chapter: 9: Propensity scores
Presentation transcript:

Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee Melbourne Institute of Applied Economic and Social Research The University of Melbourne October 27, 2006

Definition of the Problem “The most obvious limitation at present is that multiple versions of the balancing test exist in the literature, with little known about the statistical properties of each one, or how they compare to one another given particular types of data.” (Smith and Todd, 2005)

Preview of Main Findings There is a difference between a ‘before matching balancing test’ and an ‘after matching balancing test.’ Current balancing tests as implemented in the literature have poor size properties. Improved balancing tests using non-parametric tests are suggested.

Propensity Score Matching Methodology Step 1: Estimate the probability of receiving treatment, prob(D = 1 | X) = p(X), using a logit or probit model. Step 2: Choose matching algorithm (e.g., stratification, nearest neighbour, kernel matching, caliper matching etc.) and match on p(X). Step 3: Perform matching diagnostics (like the balancing test). Step 4: Compare mean outcomes to get the Average Treatment Effect on the Treated (ATT).

A Matching Diagnostic: Balance A balancing test checks if the two groups ‘look the same’ in terms of the Xs after matching on p(X). The balancing property of propensity scores (Theorem 2, Rosenbaum and Rubin, 1983): X  D | p(X)  Given information on p(X), information on X is unnecessary for information on D.  Does not require any use of the outcome variable, so no bias.  Balance does not mean we have the correct Xs in the model (i.e., it does not equal the CIA).  No convenient tests for conditional independence exist.

Varieties of Balancing Tests Test 1: Test for equality of each covariate mean between groups, within strata of p(X) (t-test). (Done after step 1: estimating p(X) on full sample) Test 2: Standardised test of differences (of normalised covariates) between groups. (Done after step 2: matching on p(X)) Test 3: Test for equality of each covariate mean between groups (t-test). (Done after step 2: matching on p(X)) Test 4: Test for joint equality of all covariate means between groups (F-test or Hotelling test). (Done after step 2: matching on p(X))

QQ plots.  Austin and Mamdani (2006); Imai, King and Stuart (2006). Box plots.  Austin and Mamdani (2006). Binary response plots (Rubin-Cook scatter plots).  Lee (2006a). Undirected graphical models.  Lee (2006b). Some Other Alternative Before Matching Balancing Tests

Regression test.  Smith and Todd (2005). Pseudo R 2.  Sianesi (2004). Some Other Alternative After Matching Balancing Tests

Motivating Example: NSW Data This experimental data set was used in several studies to perform a ‘recovery exercise.’  See, for example, Dehejia and Wahba (1999, 2002) and Smith and Todd (2005). Dehejia and Wahba (1999) conducted test 1, performed stratification and nearest neighbour matching and obtained similar estimates as the experimental estimates.  Concluded that balancing test 1 is useful.

But Dehejia and Wahba (1999) did not conduct tests 2 to 4. What happens if they did?  After estimating p(X), balance is obtained using test 1.  After performing kernel matching using the same specification of p(X), balance is obtained if we use tests 2 to 4.  However, after performing nearest neighbour matching using the same specification of p(X), imbalance is obtained if we use tests 2 to 4. In summary, Dehejia and Wahba’s (1999) results from nearest neighbour matching that replicated the experimental benchmark came from a matched sample with imbalanced covariates.

Which balancing test should be used in practice? Is the within strata t-test (Test 1) useful as a specification test for p(X)?  Approach of Dehejia and Wahba (using test 1 together with nearest neighbour matching) still used as recently as Diaz and Handa (2006). Issue of multiple testing (e.g., Westfall and Young 1993).

Monte Carlo Simulations Generating balanced data:  If the error term in the treatment assignment equation is independent of X, then given X and β: D  X | Xβ  It follows that D  X | logit(Xβ) or D  X | p(X)

Monte Carlo Simulations using Generated Data The simulations:  assume a T-C ratio of  assume we know which Xs to use to estimate the true propensity score (CIA).  vary the number and distribution of covariates and the sample size. Test 1 performs terribly in terms of test size.  But it seems to work well with a Bonferroni correction. Tests 2 to 4 appear to have poor test sizes when there are more than 2 covariates.  In current practice, researchers often look at mean or median values (e.g. mean standardised difference) instead of using a “one unbalanced covariate and you’re out” rule.

Monte Carlo Simulations using NSW Data None of the balancing tests appear to work well.  For example, test 1 rejects balance about 20% of the time when  = 5%. Considered the issue of outliers but dropping these observations did not change the results. The only way to make things work appears to be dropping difficult to balance covariates.  But this is not a satisfactory solution!

Permutation Tests Instead of using the t-distribution for tests 1 and 3, or the Hotelling- distribution for test 4, we use permutation distributions instead. A similar approach used in Abadie (2002) in the context of the Kolmogorov-Smirnov statistic performing poorly in the presence of point masses.  The basic idea is to rearrange the labels on the observations, compute the test statistic and repeat many times to obtain the permutation distribution of the test statistic.  Permutation resampling is done without replacement. Monte Carlo simulations using the NSW data show balancing tests attain approximately the correct test sizes.

Power of the Tests What happens when there is an omitted variable in estimating p(X)? Using the NSW data, consider three DGPS. 1. p(X) contains RE74 and Y contains RE p(X) contains RE74 and Y does not contain RE p(X) does not contain RE74 and Y contains RE74. Estimate the propensity score using a set of variables that excludes RE74 (i.e., omitted variable).  All DGPs reject balance at approximately the chosen size.  Balancing tests couldn’t detect misspecification in p(X).  Bias on ATT largest for DGP1.  Smaller biases on ATT for DGP2 and DGP3. When CIA not fulfilled, balancing tests with low type 1 error rates are of limited use (i.e., Balance ≠ CIA).

Conclusions p(X) is a relative measure and not a permanent ID tag or permanent summary index score associated with each observation.  Matching creates weights that effectively changes the composition of the sample.  When the sample changes, it changes the nature of the balancing hypothesis X  D | p(X).

Important to distinguish between before matching and after matching balancing tests.  Test 1 is a before matching test and most appropriately used with matching by stratification (i.e., ATT is computed using the exact same strata as test 1).  Tests 2 to 4 are after matching balancing tests and most appropriately used with matching algorithms that match on p(X).

The DW test as described in Dehejia and Wahba (1999, 2002) has a poor test size when used as a before matching test. Conventional t-tests and Hotelling-tests do not appear to work well as tests for after matching balance.  Related to the problem of computing standard errors for matching estimators, which is still an open problem (i.e., no analytic solution). Balancing tests based on permutation tests appear to provide good test sizes.  But without fulfilling the CIA, their role as a diagnostic is limited.

Das Ende