Almost-Exact Matching for Causal Inference

Slides:



Advertisements
Similar presentations
Copyright © 2010 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Advertisements

N The Experimental procedure involves manipulating something called the Explanatory Variable and seeing the effect on something called the Outcome Variable.
AP Statistics Section 13.1 A. Which of two popular drugs, Lipitor or Pravachol, helps lower bad cholesterol more? 4000 people with heart disease were.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
1 Paired Differences Paired Difference Experiments 1.Rationale for using a paired groups design 2.The paired groups design 3.A problem 4.Two distinct ways.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Comments on Midterm Comments on HW4 Final Project Regression Example Sensitivity Analysis? Quiz STA 320 Design and Analysis of Causal Studies Dr. Kari.
Matching STA 320 Design and Analysis of Causal Studies Dr. Kari Lock Morgan and Dr. Fan Li Department of Statistical Science Duke University.
Copyright © 2009 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Statistical Analysis II Lan Kong Associate Professor Division of Biostatistics and Bioinformatics Department of Public Health Sciences December 15, 2015.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
1 Chapter 11 Understanding Randomness. 2 Why Be Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Research methods. Recap: last session 1.Outline the difference between descriptive statistics and inferential statistics? 2.The null hypothesis predicts.
Looking for statistical twins
Bootstrap and Model Validation
AP Statistics Exam Review Topic #4
Confidence Intervals for Proportions
Experiments, Simulations Confidence Intervals
Section 8.1 Day 2.
Selecting the Best Measure for Your Study
Statistical Data Analysis - Lecture /04/03
Debugging Intermittent Issues
Unit 5: Hypothesis Testing
Scientific Methodology
Sampling Population: The overall group to which the research findings are intended to apply Sampling frame: A list that contains every “element” or.
Debugging Intermittent Issues
Confidence Intervals for Proportions
Confidence Intervals for Proportions
Strategies to incorporate pharmacoeconomics into pharmacotherapy
Chapter 25 Comparing Counts.
CHAPTER 9 Testing a Claim
Explanation of slide: Logos, to show while the audience arrive.
BTEC Level 2 Sport Unit 2 – Practical Sports Performance
Chapter 13- Experiments and Observational Studies
Making Data-Based Decisions
CHAPTER 9 Testing a Claim
Chapter 4 Studying Behavior
Using Scatter Plots to Identify Relationships Between Variables
CHAPTER 9 Testing a Claim
Significance Tests: The Basics
Classification and Prediction
Lecture 12 Model Building
Significance Tests: The Basics
Two-Sample Between-Subjects Experiments and Independent-Samples t-Tests So far, we’ve talked about experiments in which we needed to take only one sample.
Chapter 17 Measurement Key Concept: If you want to estimate the demand curve, you need to find cases where the supply curve shifts.
Paired Samples and Blocks
CHAPTER 9 Testing a Claim
Chapter 26 Comparing Counts.
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Data Science in Industry
STEM Chemistry Basic Quantitative Skills
Confidence Intervals for Proportions
Improving Overlap Farrokh Alemi, Ph.D.
Confidence Intervals for Proportions
©Jiawei Han and Micheline Kamber
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
CHAPTER 9 Testing a Claim
Evaluating Classifiers
Chapter 26 Comparing Counts.
CHAPTER 9 Testing a Claim
Confidence Intervals for Proportions
A modest attempt at measuring and communicating about quality
Type I and Type II Errors
Enhancing Causal Inference in Observational Studies
Stratified Covariate Balancing Using R
CHAPTER 9 Testing a Claim
Enhancing Causal Inference in Observational Studies
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Almost-Exact Matching for Causal Inference Cynthia Rudin Associate Professor Departments of Computer Science, Electrical and Computer Engineering and Statistical Science Duke University I’m not talking about decision trees, I talk about them too much so I decided to change at the last minute to causal inference. When you’re working with observational data you ideally want to find identical twins in the treatment and control groups. But you can’t, because very few individuals are really alike in high dimensions. So we want to match them on as many relevant covariates as possible and that’s what my talk is about.

FLAME: Fast Large-scale Almost Matching Exactly (with Sudeepa Roy, Alexander Volfovsky, Tianyu Wang, Awa Dieng, Yameng Liu)

Matching in Potential Outcomes Framework n x p n x 1 n x 1 {0,1} X, Y, T observational data, SUTVA, strong ignorability I always say that causal inference is half supervised. For each treatment unit, you know their outcome, for each control unit, you know their outcome, but you never get the *treatment effect* for any individual. You never know what would happen if you gave the treatment to that control unit. Matching is a popular way to handle causal inference problems, because can be really interpretable. Causal Inference is ”half supervised” Matching is interpretable. “Identical” twins. Most matching methods don’t try to match exactly.

covariates: age, gender, heart conditions, blood pressure, toenail length, eyeball width, etc treated patient Marietta [ 50 F 1 0 1 1 68 1.5cm 2cm 1 0 3 0 ..... ] control patient 1 Lee Ann [ 50 F 1 0 1 1 68 14cm 1cm 4 1 5 6 ..... ]

FLAME: Fast Large-scale Almost Matching Exactly Goal: Match treatment and control units using as many important covariates as possible Handle large data sets Work fast BasicExactMatch subroutine uses an efficient database query:

FLAME: Fast Large-scale Almost Matching Exactly Use a holdout training set to determine How important a variable is for prediction error Whether a subset of variables predicts sufficiently well

FLAME: Fast Large Almost Matching Exactly Algorithm: Start with exact matching on all covariates. Find as many matched groups as possible (using BasicExactMatch). Eliminate the least important covariate Repeat, find as many matched groups as possible each time we eliminate a covariate. Match Quality on training set: MQ = -PredictionError + C*BalancingFactor Prediction error constraint: Always keep enough covariates to be able to predict the outcome well. Balance constraint: Do not ever eliminate too many points from either treatment or control groups.

Some (Insightful) Experiments No ground truth so need simulated data U=5 (no noise) 20K units, 10K treatment, 10K control

Regression cannot handle model misspecification

Some (Insightful) Experiments (no noise)

On the dataset I will show next

Collapsing FLAME Try to match each treatment unit to at least one control on as many variables as possible. Match on all variables. Temporarily remove variable 1, get matches, put it back. Temporarily remove variable 2, get matches, put it back. : Temporarily remove variables 1 AND 2, get matches, put them back. Temporarily remove variables 1 AND 3, get matches, put them back

Breaking the Cycle of Drugs and Crime in the United States First covariate eliminated Alabama, Florida, and Washington Participants were chosen to receive screening shortly after arrest and participate in a drug intervention under supervision, control group from same population. Last covariate eliminated

CORELS Rule List for treatment effect: Positive estimated treatment effect Negative estimated treatment effect Siong Thye Goh

Takeaway Most matching methods can't handle irrelevant variables. FLAME leverages ideas from ML + databases scalable fast accurate passes sanity checks FLAME's code is here: https://github.com/ty-w/FLAME

Thanks FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference (with Sudeepa Roy, Alexander Volfovsky, Tianyu Wang) Code: https://github.com/ty-w/FLAME