CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.

CAUSAL SEARCH IN THE REAL WORLD

A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox  Some real-world successes:  Learning based on more than just independence  Learning about latents & their structure

Short-run causal search  Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies  Of course, this is a problem for all inference: we might just be really unlucky  Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)

Convergence in search  In search, we would like to bound our possible error as we acquire data  I.e., we want search procedures that have uniform convergence  Without uniform convergence,  Cannot set confidence intervals for inference  Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

Pointwise convergence  Assume hypothesis H is true  Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted

Uniform convergence  Assume hypothesis H is true  Then  For any standard of “closeness” to H, and  For any standard of “successful refutation,”  There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.

Two theorems about convergence  There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993)  There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

Two theorems about convergence  What if we didn’t care about “small” causes?  ε -Faithfulness: If X & Y are d-connected given S, then ρ XY.S > ε  Every association predicted by d-connection is ≥ ε  For any ε, standard constraint-based algorithms are uniformly convergent given ε -Faithfulness  So we have error bounds, confidence intervals, etc.

Sample selection bias  Sometimes, a variable of interest is a cause of whether people get in the sample  E.g., measure various skills or knowledge in college students  Or measuring joblessness by a phone survey during the middle of the day  Simple problem: You might get a skewed picture of the population

Sample selection bias  If two variables matter, then we have:  Sample = 1 for everyone we measure  That is equivalent to conditioning on Sample  ⇒ Induces an association between A and B! Factor AFactor B Sample

Simpson’s Paradox  Consider the following data: MenWomen P(A | T) = 0.5P(A | T) = 0.39 P(A | U) = 0.45…P(A | U) = 0.333 TreatedUntreated Alive320 Dead324 TreatedUntreated Alive163 Dead256 Treatment is superior in both groups!

Simpson’s Paradox  Consider the following data: Pooled P(A | T) = 0.404 P(A | U) = 0.434 TreatedUntreated Alive1923 Dead2830 In the “full” population, you’re better off not being Treated!

Simpson’s Paradox  Berkeley Graduate Admissions case

More than independence  Independence & association can reveal only the Markov equivalence class  But our data contain more statistical information!  Algorithms that exploit this additional info can sometimes learn more (including unique graphs)  Example: LiNGaM algorithm for non-Gaussian data

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = B D + ε  where B is permutable to lower triangular (because graph is acyclic)

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = A ε  where A = (I – B) -1

Non-Gaussian data  Assume linearity & independent non-Gaussian noise  Linear causal DAG functions are: D = A ε  where A = (I – B) -1  ICA is an efficient estimator for A  ⇒ Efficient causal search that reveals direction!  C E iff non-zero entry in A

Non-Gaussian data  Why can we learn the directions in this case? AB AB Gaussian noiseUniform noise

Non-Gaussian data  Case study: European electricity cost

Learning about latents  Sometimes, our real interest…  is in variables that are only indirectly observed  or observed by their effects  or unknown altogether but influencing things behind the scenes Test score Reading level Math skills General IQ Sociability Size of social network Other factors

Factor analysis  Assume linear equations  Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error

Factor analysis  If we have one factor, then we find coefficients to minimize error in: F i = a i + b i U where U is the unobserved variable (with fixed mean and variance)  Two factors ⇒ Minimize error in: F i = a i + b i,1 U 1 + b i,2 U 2

Factor analysis  Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff  Also, the interpretation of the unobserved factors must be provided by the scientist  The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)

Factor analysis as graph search  One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U …

Factor analysis as graph search  Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U1U1 … U2U2

Better methods for latents  Two different types of algorithms: 1. Determine which observed variables are caused by shared latents BPC, FOFC, FTFC, … 2. Determine the causal structure among the latents MIMBuild  Note: need additional parametric assumptions  Usually linearity, but can do it with weaker info

Discovering latents  Key idea: For many parameterizations, association between X & Y can be decomposed  Linearity ⇒  ⇒ can use patterns in the precise associations to discover the number of latents  Using the ranks of different sub-matrices

Discovering latents BACD U

BACD UL

 Many instantiations of this type of search for different parametric knowledge, # of observed variables ( ⇒ # of discoverable latents), etc.  And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents

Other Algorithms  CCD: Learn DCG (with non-obvious semantics)  ION: Learn global features from overlapping local sets (including between not co-measured variables)  SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints  LoSST: Learn causal structure while that structure potentially changes over time  And lots of other ongoing research!

Tetrad project  http://www.phil.cmu.edu/projects/tetrad/current.html http://www.phil.cmu.edu/projects/tetrad/current.html

CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.

Similar presentations

Presentation on theme: "CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.

Similar presentations

Presentation on theme: "CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox."— Presentation transcript:

Similar presentations

About project

Feedback