CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.

Slides:

Advertisements

Similar presentations

1 Learning Causal Structure from Observational and Experimental Data Richard Scheines Carnegie Mellon University.

Advertisements

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.

Topic Outline Motivation Representing/Modeling Causal Systems

Weakening the Causal Faithfulness Assumption

Outline 1)Motivation 2)Representing/Modeling Causal Systems 3)Estimation and Updating 4)Model Search 5)Linear Latent Variable Models 6)Case Study: fMRI.

Understanding the Variability of Your Data: Dependent Variable.

Omitted Variable Bias Methods of Economic Investigation Lecture 7 1.

CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.

1. Estimation ESTIMATION.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.

4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.

1 gR2002 Peter Spirtes Carnegie Mellon University.

Introduction to Probability and Statistics Linear Regression and Correlation.

Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Simple Linear Regression Analysis

1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.

Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.

INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.

Bayes Net Perspectives on Causation and Causal Inference

1 Part 2 Automatically Identifying and Measuring Latent Variables for Causal Theorizing.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Chapter 11 Simple Regression

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Hypothesis Testing in Linear Regression Analysis

1 Tetrad: Machine Learning and Graphcial Causal Models Richard Scheines Joe Ramsey Carnegie Mellon University Peter Spirtes, Clark Glymour.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

T tests comparing two means t tests comparing two means.

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

1 Statistical Inference Greg C Elvers. 2 Why Use Statistical Inference Whenever we collect data, we want our results to be true for the entire population.

STA Statistical Inference

From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.

10.1: Confidence Intervals – The Basics. Introduction Is caffeine dependence real? What proportion of college students engage in binge drinking? How do.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Matching Estimators Methods of Economic Investigation Lecture 11.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.

Inference for 2 Proportions Mean and Standard Deviation.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Academic Research Academic Research Dr Kishor Bhanushali M

INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.

BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity

Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.

BCS547 Neural Decoding.

Exploratory studies: you have empirical data and you want to know what sorts of causal models are consistent with it. Confirmatory tests: you have a causal.

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.

Lecture 2: Statistical learning primer for biologists

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Machine Learning 5. Parametric Methods.

 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

T tests comparing two means t tests comparing two means.

1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

1 Day 2: Search June 14, 2016 Carnegie Mellon University Center for Causal Discovery.

STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.

Chapter 9 Introduction to the t Statistic

Sampling Distributions and Estimation

Markov Properties of Directed Acyclic Graphs

Center for Causal Discovery: Summer Short Course/Datathon

Simple Linear Regression

Presentation transcript:

CAUSAL SEARCH IN THE REAL WORLD

A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox  Some real-world successes:  Learning based on more than just independence  Learning about latents & their structure

Short-run causal search  Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies  Of course, this is a problem for all inference: we might just be really unlucky  Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)

Convergence in search  In search, we would like to bound our possible error as we acquire data  I.e., we want search procedures that have uniform convergence  Without uniform convergence,  Cannot set confidence intervals for inference  Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

Pointwise convergence  Assume hypothesis H is true  Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted

Uniform convergence  Assume hypothesis H is true  Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.

Two theorems about convergence  There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993)  There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

Two theorems about convergence  What if we didn’t care about “small” causes?  ε -Faithfulness: If X & Y are d-connected given S, then ρ XY.S > ε  Every association predicted by d-connection is ≥ ε  For any ε, standard constraint-based algorithms are uniformly convergent given ε -Faithfulness  So we have error bounds, confidence intervals, etc.

Sample selection bias  Sometimes, a variable of interest is a cause of whether people get in the sample  E.g., measure various skills or knowledge in college students  Or measuring joblessness by a phone survey during the middle of the day  Simple problem: You might get a skewed picture of the population

Sample selection bias  If two variables matter, then we have:  Sample = 1 for everyone we measure  That is equivalent to conditioning on Sample  ⇒ Induces an association between A and B! Factor AFactor B Sample

Simpson’s Paradox  Consider the following data: MenWomen P(A | T) = 0.5P(A | T) = 0.39 P(A | U) = 0.45…P(A | U) = TreatedUntreated Alive320 Dead324 TreatedUntreated Alive163 Dead256 Treatment is superior in both groups!

Simpson’s Paradox  Consider the following data: Pooled P(A | T) = P(A | U) = TreatedUntreated Alive1923 Dead2830 In the “full” population, you’re better off not being Treated!

Simpson’s Paradox  Berkeley Graduate Admissions case

More than independence  Independence & association can reveal only the Markov equivalence class  But our data contain more statistical information!  Algorithms that exploit this additional info can sometimes learn more (including unique graphs)  Example: LiNGaM algorithm for non-Gaussian data

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = B D + ε  where B is permutable to lower triangular (because graph is acyclic)

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = A ε  where A = (I – B) -1

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = A ε  where A = (I – B) -1  ICA is an efficient estimator for A  ⇒ Efficient causal search that reveals direction!  C E iff non-zero entry in A

Non-Gaussian data  Why can we learn the directions in this case? AB AB Gaussian noiseUniform noise

Non-Gaussian data  Case study: European electricity cost

Learning about latents  Sometimes, our real interest…  is in variables that are only indirectly observed  or observed by their effects  or unknown altogether but influencing things behind the scenes Test score Reading level Math skills General IQ Sociability Size of social network Other factors

Factor analysis  Assume linear equations  Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error

Factor analysis  If we have one factor, then we find coefficients to minimize error in: F i = a i + b i U where U is the unobserved variable (with fixed mean and variance)  Two factors ⇒ Minimize error in: F i = a i + b i,1 U 1 + b i,2 U 2

Factor analysis  Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff  Also, the interpretation of the unobserved factors must be provided by the scientist  The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)

Factor analysis as graph search  One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U …

Factor analysis as graph search  Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U1U1 … U2U2

Better methods for latents  Two different types of algorithms: 1. Determine which observed variables are caused by shared latents BPC, FOFC, FTFC, … 2. Determine the causal structure among the latents MIMBuild  Note: need additional parametric assumptions  Usually linearity, but can do it with weaker info

Discovering latents  Key idea: For many parameterizations, association between X & Y can be decomposed  Linearity ⇒  ⇒ can use patterns in the precise associations to discover the number of latents  Using the ranks of different sub-matrices

Discovering latents BACD U

BACD UL

 Many instantiations of this type of search for different parametric knowledge, # of observed variables ( ⇒ # of discoverable latents), etc.  And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents

Other Algorithms  CCD: Learn DCG (with non-obvious semantics)  ION: Learn global features from overlapping local sets (including between not co-measured variables)  SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints  LoSST: Learn causal structure while that structure potentially changes over time  And lots of other ongoing research!

Tetrad project 