CAUSAL SEARCH IN THE REAL WORLD
A menu of topics Some real-world challenges: Convergence & error bounds Sample selection bias Simpson’s paradox Some real-world successes: Learning based on more than just independence Learning about latents & their structure
Short-run causal search Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies Of course, this is a problem for all inference: we might just be really unlucky Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)
Convergence in search In search, we would like to bound our possible error as we acquire data I.e., we want search procedures that have uniform convergence Without uniform convergence, Cannot set confidence intervals for inference Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose
Pointwise convergence Assume hypothesis H is true Then For any standard of “closeness” to H, and For any standard of “successful refutation,” For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted
Uniform convergence Assume hypothesis H is true Then For any standard of “closeness” to H, and For any standard of “successful refutation,” There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.
Two theorems about convergence There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993) There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)
Two theorems about convergence What if we didn’t care about “small” causes? ε -Faithfulness: If X & Y are d-connected given S, then ρ XY.S > ε Every association predicted by d-connection is ≥ ε For any ε, standard constraint-based algorithms are uniformly convergent given ε -Faithfulness So we have error bounds, confidence intervals, etc.
Sample selection bias Sometimes, a variable of interest is a cause of whether people get in the sample E.g., measure various skills or knowledge in college students Or measuring joblessness by a phone survey during the middle of the day Simple problem: You might get a skewed picture of the population
Sample selection bias If two variables matter, then we have: Sample = 1 for everyone we measure That is equivalent to conditioning on Sample ⇒ Induces an association between A and B! Factor AFactor B Sample
Simpson’s Paradox Consider the following data: MenWomen P(A | T) = 0.5P(A | T) = 0.39 P(A | U) = 0.45…P(A | U) = TreatedUntreated Alive320 Dead324 TreatedUntreated Alive163 Dead256 Treatment is superior in both groups!
Simpson’s Paradox Consider the following data: Pooled P(A | T) = P(A | U) = TreatedUntreated Alive1923 Dead2830 In the “full” population, you’re better off not being Treated!
Simpson’s Paradox Berkeley Graduate Admissions case
More than independence Independence & association can reveal only the Markov equivalence class But our data contain more statistical information! Algorithms that exploit this additional info can sometimes learn more (including unique graphs) Example: LiNGaM algorithm for non-Gaussian data
Non-Gaussian data Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D = B D + ε where B is permutable to lower triangular (because graph is acyclic)
Non-Gaussian data Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D = A ε where A = (I – B) -1
Non-Gaussian data Assume linearity & independent non-Gaussian noise Linear causal DAG functions are: D = A ε where A = (I – B) -1 ICA is an efficient estimator for A ⇒ Efficient causal search that reveals direction! C E iff non-zero entry in A
Non-Gaussian data Why can we learn the directions in this case? AB AB Gaussian noiseUniform noise
Non-Gaussian data Case study: European electricity cost
Learning about latents Sometimes, our real interest… is in variables that are only indirectly observed or observed by their effects or unknown altogether but influencing things behind the scenes Test score Reading level Math skills General IQ Sociability Size of social network Other factors
Factor analysis Assume linear equations Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error
Factor analysis If we have one factor, then we find coefficients to minimize error in: F i = a i + b i U where U is the unobserved variable (with fixed mean and variance) Two factors ⇒ Minimize error in: F i = a i + b i,1 U 1 + b i,2 U 2
Factor analysis Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff Also, the interpretation of the unobserved factors must be provided by the scientist The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)
Factor analysis as graph search One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U …
Factor analysis as graph search Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: F1F1 FnFn F2F2 U1U1 … U2U2
Better methods for latents Two different types of algorithms: 1. Determine which observed variables are caused by shared latents BPC, FOFC, FTFC, … 2. Determine the causal structure among the latents MIMBuild Note: need additional parametric assumptions Usually linearity, but can do it with weaker info
Discovering latents Key idea: For many parameterizations, association between X & Y can be decomposed Linearity ⇒ ⇒ can use patterns in the precise associations to discover the number of latents Using the ranks of different sub-matrices
Discovering latents BACD U
BACD UL
Many instantiations of this type of search for different parametric knowledge, # of observed variables ( ⇒ # of discoverable latents), etc. And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents
Other Algorithms CCD: Learn DCG (with non-obvious semantics) ION: Learn global features from overlapping local sets (including between not co-measured variables) SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints LoSST: Learn causal structure while that structure potentially changes over time And lots of other ongoing research!
Tetrad project