Causal Inference and Graphical Models

Causal Inference and Graphical Models
Peter Spirtes Carnegie Mellon University With some slides from Thomas Richardson

Outline Introduction Three Frameworks From DAGs to Causal Inference
From Data to Sets of DAGs From Sets of DAGs to Causal Inference Latents and Selection Bias 11/25/2018

Introduction Examples Questions & Inputs Obstacles 11/25/2018

Finding Large Effects: Steckhoven et al.
Mouse-ear cress response Y: days to bolting (flowering) of the plant Covariates X: gene-expression profile Observational data with n = 47 and p = 21,326 Yeast, fmri, climate, SNPs, 11/25/2018

College Plans Sewell and Shah (1968) studied five variables from a sample of 10,318 Wisconsin high school seniors. SEX [male = 0, female = 1] IQ = Intelligence Quotient, [lowest = 0, highest = 3] CP = college plans [yes = 0, no = 1] PE = parental encouragement [low = 0, high = 1] SES = socioeconomic status [lowest = 0, highest = 3] 11/25/2018

Questions What are the causes of X?
What is the average causal effect of X on Y? What is the effect of the treatment on the treated? 11/25/2018

Inputs Data Background knowledge Assumptions
Could be from multiple data sets Could be experimental or observational, or combined Background knowledge Assumptions 11/25/2018

Possible Background Knowledge
Parametric Family Time order Known causes Known non-causes Context Experiment 11/25/2018

Obstacles Number of DAGs grows super-exponentially with sample size.
Unmeaured Confounders Selection Bias Measurement error Missing data Multiple Data Sets from different contexts 11/25/2018

Obstacles Aggregation of Data Non-linearity Feedback Non-stationarity
Causal interactions among units Sampling error Mixtures of distributions 11/25/2018

Causal Models – Three Frameworks
All aim to relate two types of situation: An observational world: A ‘natural’ process assigns ‘treatments’. Example: each patient chooses their own treatment. An experimental world ‘Treatments’ assigned via an ‘external’ process. Example: each patient is given the same treatment. Slide 3 11/25/2018

Basic Inferential Tasks
Given observational data make predictions about what would be observed in an experimental setting. Given experimental data predict what happens in an observational context. For example, where not everyone may wish to avail themselves of treatment. Combine experimental and observational data to predict the result of some experiment that was not performed. Slide 4 11/25/2018

High Level View of Frameworks: Ontology
All aim to relate observational and experimental, but with different objects: Potential Outcomes: Experimental and observational distributions are margins of a single joint: P(X, Y, Y(x= 0), Y(x= 1)) ) P(X, Y) P(Y(x0)) P(Y(x1)) All events defined on a single sample-space. Graphical Causal Models: Separate experimental and observational distributions P(X, Y) P(Y(x0)) P(Y(x1)) No single sample-space; Alternative notation emphasizes this: P(X, Y) P(Y | do(x = 0)) P(Y | do(x = 1)). Slide 5 11/25/2018

Potential Outcomes – Identification of ACE randomized
If the process that assigned X (in the ‘observational’ world) assigned X randomly then X Y(x0) and X Y(x1) (1) So: P(Y(xi) = 1) = P(Y(xi) = 1 | X = i) = P(Y = 1 | X = i) Thus: ACE(X → Y) = E[Y(x1) - Y(x0)] = E[Y | X = 1] –E[Y | X = 0]. Thus if (1) holds then ACE(X → Y) is identified from P(Y | X). Inference: ‘Observational’ world ⇒ Difference of two Exp. Worlds Slide 16 11/25/2018

Potential Outcome Framework: Main points
Postulates a joint distribution over outcomes in experimental and observational settings; (Typically) experimental outcomes are ‘primary’, of which observational outcomes are deterministic functions. + Rich language allowing many quantities of interest to be formulated, e.g. ETT, Natural Direct Effect, + ‘Reduces’ Causation to Missing Data; all outcomes ‘observable’ a priori; + Allows precise characterization of identification assumptions as conditional independence; Does not provide much qualitative guidance as to when assumptions will hold; Reasoning abstractly about multivariate conditional independence can be hard; Joint distribution over potential outcomes is not identified even from randomized experiments; Slide 27 11/25/2018

Non-Parametric Structural Equation Models (NPSEM)
Originates in Econometrics: Haavelmo (1943), Strotz&Wold (1960) System of equations describing the observational world: One equation for each variable V expressing V as a function fV(·, ·) of its direct causes and a ‘disturbance’ term "V. Simple scenario with covariate L, treatment X and outcome Y: L = fL(eL) X = fX(L, eX) Y = fY(L, X, eY) In general: distribution over errors induces a distribution over observed variables; A Non-Parametric Structural Equation Model with Independent Errors (NPSEM-IE) assumes that the error terms are mutually independent. Here eL eX eY. Slide 29 11/25/2018

Experimental World derived from Observational (I)
An experimental world is then derived by removing the equation for the variable that is being fixed: Example 1: fixing X to 0: Obs. L = fL(eL) X = fX(L, eX) Y = fY(L, X, eY) Example 2: fixing L to 0: Obs. Note: this breaks the first rule of algebra! Exp. L = fL(eL) X= 0 Y = fY(L, X, eY) ⇒ Exp. L= 0 X = fX(L, eX) Y = fY(L, X, eY) ⇒ Slide 30 11/25/2018

Summary: Structural Equation Approach
Specifies a data-generating process – with autonomous ‘mechanisms’ – for the observational distribution; Individual outcomes under intervention derived by removing equations + Intuitive specification of a generating process, encodes qualitative understanding + Guidance as to when assumptions will hold; Observational setting is primary: problematic since many examples where measurement of X is well-defined, but intervention or assignment of X is not. Typically assumed that interventions on all variables are well- defined; Error terms are not observable (even a priori); Assumption of independent errors is strong; Implicitly specifies a joint distribution over actual and potential outcomes, but without notation to express distinction. Slide 31 11/25/2018

From DAGs to Causal Inference
Probabilistic Interpretation Causal Interpretation Causal Markov Assumption Manipulations SWIGs 11/25/2018

Probabilistic Interpretation of DAGs
A DAG represents a distribution P when each variable is independent of its non-descendants conditional on its parents in the DAG. I(CP,SEX|PE,SES,IQ) I(SEX,SES) I(SEX,IQ) PE CP SEX SES IQ 11/25/2018

A DAG represents a distribution P when each the joint distribution factors as the product of the probability of each variable conditional on its parents. P(CP,SEX,PE,SES,IQ)= P(SEX)P(SES)(IQ|SES)P(PE|SEX)P(CP|PE,SES,IQ) PE CP SEX SES IQ 11/25/2018

D-separation is a graphical relationship between three sets of disjoint variables such that if X and Y are d-separated condition on Z in DAG G, then X and Y are entailed to be independent conditional on Z in every distribution represented by G. SEX and IQ are d-separated conditional on SES. 11/25/2018

Causal Interpretation of DAGs
There is a directed edge from A to B (relative to the variables in the DAG) when A is a direct cause of B. An acyclic graph is not a representation of reversible or feedback processes 11/25/2018

From DAG to Effects of Manipulation
Effect of Manipulation Causal DAGs Background Knowledge Causal Axioms, Prior Population Distribution Sampling and Distributional Sample Assumptions, Prior 11/25/2018

Manipulation Theorem - No Hidden Variables
P(PE,SES,SEX,CP,IQ,do(pe)) = P(PE)P(SEX)P(CP|pe,SES,IQ)P(IQ|SES) SES SES SEX PE CP IQ 11/25/2018

Invariance We say that P(CP|PE,SES,IQ) is invariant under manipulation because it is the same in the pre and post intervention populations An invariant quantity can be estimated from the pre-manipulation distribution SES SES SEX PE CP IQ 11/25/2018

PE CP SEX SES Calculating Effects IQ 11/25/2018

Notation In a causal Bayesian network G, Let X, Y, Z be arbitrary disjoint set of nodes : the graph obtained by deleting from G all directed links pointing to nodes in X : the graph obtained by deleting from G all directed links emerging from nodes in X : the deleting of both incoming and outgoing links 11/25/2018

Do-calculus For any disjoint subsets of variables X, Y, Z, and W:
Rule 1 (Insertion/Deletion of Observations): Rule 2 (Action/Observation Exchange): Rule 3 (Insertion/Deletion of Actions): Where Z(W) is the set of Z-nodes that are not ancestors of any W-nodes in 11/25/2018

Do-Calculus Do-calclulus is complete even if the DAG has unmeasured variables – if a quantity is identifiable from the causal DAG and the marginal probability over the observed variables it can be calculated by the do- calculus. There are algorithms for doing the computation when it exists. This has also been extended to counterfactuals (eg. The effect of the treatment on the treated.) 11/25/2018

Summary: Causal Graphical Approach
Specifies a data-generating process – with autonomous ‘mechanisms’ – for the observational distribution; Intervention distributions derived by removing edges and factors + Intuitive specification of a generating process, encodes qualitative understanding + Guidance as to when assumptions will hold; +? No joint over experimental and observational settings; don’t require counterfactuals to ‘exist’; Observational setting is primary: problematic since many examples where measurement of X is well-defined, but intervention or assignment of X is not. Typically assumed that interventions on all variables are well- defined; Without consistency, why should we care if p(y|x)= p(y|do(x))?; Without counterfactuals: no way to describe Effect of Treatment on the Treated, etc., Slide 35 11/25/2018

Notational Difference: NPSEMs and Potential Outcomes
Counterfactual Approach: Key Distinction between: Y: the outcome in the observational world; Y(x): the outcome in the experimental world. Structural Equations and Graphical Approach: The same variable Y is used for both; The context of the graph or equation system is used to make the distinction. ⇒ Relying on context can lead to confusion ⇒ Lose the ability to state when two variables are really the same! (e.g. the covariate L when we intervene only on X). Slide 38 Thomas Richardson Causal Working Group, April 2018

11/25/2018

But... H1 H1 H2 H2 L L X Y X x Y(x) X Y(x) X Y(x) | L.
Here we can read directly from the SWIG that X Y(x) However, X Y(x) | L. Ignorability without conditional ignorability! (pace Rubin) (This example shows graphs are essential.) 11/25/2018

Inferring Causality from Probability
Statistical Data Statistical Data 1. the data is statistical - the desired conclusions are causal 2. we need principle to bridge the gap Causal Structure Causal Structure

Assumptions for Causal Inference
Markov Faithfulness Statistical Data 1. we will use 2 - Markov and Faithfulness 2. widely assumed in statistics, but rarely mentioned 3. by making explicit, we can see when true, when false, prove theorems about search for causal structures (which has often been left informal in statistics) 4. limitation discussed in book Causal Structure

Causal Sufficiency A set of variables is causally sufficient if every cause of two variables in the set is also in the set. {PE,CP,SES} is causally sufficient {IQ,CP,SES} is not causally sufficient. PE CP SEX SES IQ 11/25/2018

Causal Markov Assumption
For a causally sufficient set of variables, the joint distribution is the product of each variable conditional on its parents in the causal DAG. P(SES,SEX,PE,CP,IQ) = P(SES)P(SEX)P(IQ|SES)P(PE|SES,SEX,IQ)P(CP|PE) PE CP SEX SES IQ 11/25/2018

Equivalent Forms of Causal Markov Assumption
In the population distribution, each variable is independent of its non-descendants in the causal DAG (non-effects) conditional on its parents (immediate causes). If X is d-separated from Y conditional on Z (written as <X,Y|Z>) in the causal graph, then X is independent of Y conditional on Z in the population distribution) denoted I(X,Y|Z)). PE CP SEX SES IQ 11/25/2018

Causal Markov implies that if X is d-separated from Y conditional on Z in the causal DAG, then X is independent of Y conditional on Z. Causal Markov is equivalent to assuming that the causal DAG represents the population distribution. What would a failure of Causal Markov look like? If X and Y are dependent, but X does not cause Y, Y does not cause X, and no variable Z causes both X and Y. 11/25/2018

Assumes that no unit in the population affects other units in the population If the “natural” units do affect each other, the units should be re-defined to be aggregations of units that don’t affect each other For example, individual people might be aggregated into families Assumes variables are not logically related, e.g. x and x2 Assumes no feedback 11/25/2018

From Data to Sets of DAGs
Simplicity Assumptions Constraint-based Algorithms PC Algorithm Score—based Algorithms Greedy Equivalence Search *Alternative Simplicity Assumptions 11/25/2018

From Data to Sets of DAGs

From Sample to Population to DAGs
Constraint - Based Uses tests of conditional independence Goal: Find set of DAGs whose d-separation relations match most closely the results of conditional independenc tests Score - Based Uses scores such as Bayesian Information Criterion or Bayesian posterior Goal: Maximize score 11/25/2018

Two Kinds Of Search Constraint Score No Yes
Use non conditional independence information No Yes Quantitative comparison of models Single test result leads astray Easy to apply to latent 11/25/2018

Bayesian Information Criterion
D is the sample data G is a DAG is the vector of maximum likelihood estimates of the parameters for DAG G N is the sample size d is the dimensionality of the model, which in DAGs without latent variables is simply the number of free parameters in the model Useful for comparing pairs of models, but not for deciding whether to accept a model. 11/25/2018

3 Kinds of Alternative Causal Models
SES SES SES SES SEX SEX PE PE CP CP IQ IQ True Model Alternative 1 SES SES SES SES SEX SEX PE PE CP CP IQ IQ 11/25/2018 Alternative 3 Alternative 2

Alternative Causal Models
SES SES SES SES SEX SEX PE PE CP CP IQ IQ True Model Alternative 1 Constraint - Based: Alternative 1 violates Causal Markov Assumption by entailing that SES and IQ are independent Score - Based: Use a score that prefers a model that contains the true distribution over one that does not. 11/25/2018

Simplicity - Causal Faithfulness Assumption
SES SEX PE CP IQ CP is dependent on PE because the Causal Markov Assumption does not entail that they are independent (they are d-connected given empty set) 11/25/2018

Causal Faithfulness Assumption
For a causally sufficient set of variables, if the causal graph does not entail a conditional independence (they are d- connected), then the conditional independence does not hold in the population. 11/25/2018

SES SES SES SES SEX SEX PE PE CP CP IQ IQ True Model Alternative 2 Constraint - Based: Assume that if Sex and CP are independent (conditional on some subset of variables such as PE, SES, and IQ) then Sex and CP are adjacent - Causal Adjacency Faithfulness Assumption. Score - Based: Use a score such that if two models contain the true distribution, choose the one with fewer parameters. The True Model has fewer parameters. 11/25/2018

When Not to Assume Faithfulness
Deterministic relationships between variables entail “extra” conditional independence relations, in addition to those entailed by the global directed Markov condition. If A  B  C, and B = A, and C = B, then not only I(A,C|B), which is entailed by the global directed Markov condition, but also I(B,C|A), which is not. The deterministic relations are theoretically detectible, and when present, faithfulness should not be assumed. Do not assume in feedback systems in equilibrium. 11/25/2018

SES SES SES SES SEX PE SEX PE CP CP IQ IQ True Model Alternative 3 Constraint - Based: Alternative 2 entails the same set of conditional independence relations - there is no principled way to choose. These two DAGs are Markov (or d-separation) equivalent. For Gaussian and multinomial distributions, there is no principled way to choose between them. 11/25/2018

Markov Equivalence An unshielded collider is a triple <X,Y,Z> where the DAG contains X →Y←Z, and X and Z are not adjacent. Two DAGs G1 and G2 are Markov equivalent iff they contain the same vertices, adjacencies, and unshielded colliders. 11/25/2018

Markov Equivalence If two DAGs are Markov equivalent, without further background knowledge there is no principled way to decide between them using just conditional independence relations. In the case of Gaussian and multinomial distributions, the set of distributions that can be represented by equivalent DAGs is exactly the same. In the case of other distribution families, it may be that two Markov equivalent DAGs do not represent the same set of distributions in the family. 11/25/2018

SES SES SES SES SEX PE SEX PE CP CP IQ IQ True Model Alternative 2 Score - Based: Whether or not one can choose depends upon the parametric family. For multinomial, or Gaussian families, there is no way to choose - the BIC scores will be the same. 11/25/2018

Patterns A pattern (or p-dag) represents a set of DAGs that all have the same d-separation relations, i.e. a Markov equivalence class of DAGs. The adjacencies in a pattern are the same as the adjacencies in each DAG in the Markov equivalence class. An edge is oriented as A  B in the pattern if it is oriented as A  B in every DAG in the equivalence class. An edge is oriented as A  B in the pattern if the edge is oriented as A  B in some DAGs in the equivalence class, and as A  B in other DAGs in the equivalence class. 11/25/2018

Patterns to Graphs All of the DAGs in a d-separation equivalence class can be derived from the pattern that represents the d-separation equivalence class by orienting the unoriented edges in the pattern. Every orientation of the unoriented edges is acceptable as long as it creates no new unshielded colliders. That is A  B  C can be oriented as A  B C, A  B  C, or A  B  C, but not as A  B  C. 11/25/2018

Patterns Markov Equivalence Class Pattern SES SES PE CP SEX SES IQ SEX
1. represents set of conditional independence and distribution equivalent graphs 2. same adjacencies 3. undirected edges mean some contain edge one way, some contain other way 4. directed edge means they all go same way 5. Pearl and Verma -complete rules for generating from Meek, Andersson, Perlman, and Madigan, and Chickering 6. instance of chain graph 7. since data can’t distinguish, in absence of background knowledge is right output for search 8. what are they good for? SEX PE CP IQ 11/25/2018 Pattern

PC Algorithm - Adjacencies
SES SEX PE CP IQ I(SEX,SES) I(SEX,CP|PE,SES,IQ) I(SEX,IQ) In checking SES-CP edge, only have to consider subsets of size 2. 11/25/2018

PC Algorithm - Orientations
For i := 1 to n-2 do For each pair of variables X and Y, if there is an edge between X and Y, remove the edge if there exists a set of variables S of size i that is included in the set of variables adjacent to X, or included in the set of variables adjacent to Y, such that X and Y are d-separated by S, and record S in Sepset(X,Y). 11/25/2018

PC Algorithm - Orientations
SES SEX PE CP IQ away from cycle unshielded collider away from collider away from cycle unshielded collider 11/25/2018

PC - Orientation Repeat If X Y  Z and X  Z, orient as X  Z.
If X Y  Z, and X and Z are not adjacent, then orient as Y  Z. Orient X Z Y as X Z Y W W Until no more edges can be oriented

When Causal Faithfulness Assumption False
Some cases of determinism A → B → C A = B = C Path Cancellations A → B → C e(A → B) * e(B → C) – e(A → C) = 0 Averaging A → B ← C 11/25/2018

Surfaces of Unfaithfulness in Parameter Space
Surfaces of unfaithfulness in parameter space are Lebesgue measure 0. 11/25/2018

Good News Give the Causal Markov and Causal Faithfulness Assumptions, and no feedback, there are algorithms that in the large sample limit with probability 1 will Output the Markov Equivalence class of the true causal DAG Output correct point estimates of some causal strengths Give non-trivial bounds of causal strengths Even if there are unmeasured confounders and selection bias (FCI). 11/25/2018

Three Faces of Faithfulness
Identifiability If A is independent of B then prefer A → C ← B to A → C → B Computational Efficiency If A – C – B and A is independent of B then don’t need to check if A is independent of B conditional on C. Statistical Efficiency The Markov equivalence class can be found without testing independence conditional on a set with more than maximum degree of any variable in the true causal graph. 11/25/2018

Weaker Simplicity Assumption – Triangle Faithfulness
We can weaken Faithfulness to a conjunction of two much weaker conditions: Minimality and Triangle-Faithfulness and underdetermination is reduced to the same extent (when Faithfulness happens to be true). Causal Minimality - The population distribution is not Markov to any proper subDAG of the true causal DAG. Triangle Faithfulness: Basically if X, Y, and Z are all adjacent, then there are no unfaithful independencies between X and Y conditional on a set that contains Z. 11/25/2018

Surfaces of Unfaithfulness in Parameter Space
Surfaces of unfaithfulness in parameter space are Lebesgue measure 0. 11/25/2018

Bad News Given the Causal Markov Assumption and Causal Faithfulness Assumptions, and no feedback, and no unmeasured confounders, and no selection bias, there is no algorithm that can non-trivially bound the probability of an error in the output of a Markov Equivalence class at any sample size. Because of almost unfaithfulness 11/25/2018

Alternative Simplicity Assumptions: Parametric
Suppose the underlying model is Y = f(X, EY), where EY is an unobserved error term s.t. EY is independent of X. Then, for a wide range of functional forms and distributions, there is no model in the reverse direction (i.e., X = g(Y, EX) s.t. EX is independent of Y) that is compatible with the joint distribution of X and Y. e.g., f is linear, and X, EY are non-Gaussian. f(X, EY) = g(X) + EY, where g is nonlinear, and X, EY are Gaussian. 11/25/2018

Illustration: Linear Model (Shimizu et al. 2006)
EY 11/25/2018

Alternative Simplicity Assumptions: Sparsest Permutation
Look for the directed acyclic grapht that has the fewest edges among those that satisfy the Markov condition. This is strictly weaker than the Causal Faithfulness Assumption. A strong version of this assumes that if a conditional independence is not entailed then it is greater than l, which still allows for uniform consistency. 11/25/2018

Stronger Simplicity Assumption -l-strong Faithfulness
A distribution is l-almost unfaithful to a DAG if the DAG does not entail I(X,Y|Z) , but D(X,Y|Z) < l, where D(X,Y|Z) is a measure of the dependence between X and Y conditional on Z. In the Gaussian case, D(X,Y|Z) can be given by the correlation of X and Y conditional on Z. The l-strong Faithfulness Assumption - The joint distributions is not l-almost unfaithful to the causal DAG (i.e. there are no partial correlations between 0 and l. What to do about almost unfaithfulness? Find better algorithms; strengthen assumptions; strengthen data; weaken sense of success. All 4 have been tried Using Gaussian now. 11/25/2018

l-strong Faithfulness Assumption
Under the l-strong Faithfulness Assumption, the probability of the output Markov equivalence class being correct can be bounded within a distance from 1, even if the number of variables and the density of the graph is allowed to grow (at a maximum rate) with the sample size. (Kalish and Buhlmann). 11/25/2018

l-strong Faithfulness Assumption
Uhler et al.: assumption tends to be violated fairly often, if the parameters are assigned randomly, and l is not very small. The problem gets worse as the number of variables increases. It entails that there are no very weak edges – because a very weak edge will entail that there is a partial correlation that is quite small. As sample size grows, and weak edges are detectible at the larger sample sizes, these weak edges hide the orientation of edges that were correctly oriented at smaller sample sizes -lose correct information as sample size grows. 11/25/2018

Alternative Simplicity Assumption - k-Triangle Faithfulness
If X, Y, and Z are all adjacent, the k-Triangle Faithfulness Assumption makes the assumption that the minimal partial correlation between X and Y conditional on a set containing Z a function of the strength of the edge between X and Y. Unlike the l-Faithfulness Assumption, this allows for the possibility of very weak edges. Unlike the l-Faithfulness Assumption, it does not lose information about orientations as the sample size grows. 11/25/2018

k-Triangle Faithfulness
k-Triangle Faithfulness allows: If you leave out a weak edge, you are “close” to the true Markov equivalence class. The probability of the output being far from the true Markov equivalence class can be bounded at each sample size. The probability of an estimated direct causal effect being far from the true direct causal effect can be bounded at each sample size. The computational complexity is much higher than for the Causal Faithfulness Assumption. Is it still too strong – is it likely to be violated? 11/25/2018

From Sets of DAGs to Causal Inference
Find all intervention effects from DAG compatible with pattern Intervention Effects When the DAG is Absent (IDA – Matthuis et al.) Causal Stability Ranking (Stekhoven et al.) Find all intervention effects where all DAGs in pattern agree Generalized Adjustment Criterion (Perkovic et al.) 11/25/2018

Intervention Effects When the DAG is Absent (IDA) – Maathuis et al
Goal – identify a set of n edges whose total effects are in the top m% fnr arcA sodA fnr arcA sodA fnr arcA sodA for arcA --> sodA {0,0,0,q1,q2,q3} score = 0 fnr z arcA sodA fnr arcA sodA fnr arcA sodA fnr arcA sodA q1 q2 q3 11/25/2018

Intervention Effects When the DAG is Absent (IDA) – Maathuis et al
Run PC. For each pair of variables calculate total effect q of manipulating first variable on second in each graph. Assign minimum q as pair score. Order pairs by score. fnr arcA sodA fnr arcA sodA fnr arcA sodA for arcA --> sodA {0,0,0,q1,q2,q3} score = 0 fnr z arcA sodA fnr arcA sodA fnr arcA sodA fnr arcA sodA q2 q3 q1 11/25/2018

Stability Selection Step
Sample 50% of the original data set (with replacement) one hundred times. For each subsample, run the IDA algorithm. Take the output of the IDA algorithm for that subsample and orders the variables by the size of the estimated (by IDA) lower bound of the total effect of the variable on the target.

Stability Selection Step
Record the frequency with which a given variable appears in the top q of the total effect sizes, for a user selected value q. Select the variables that appear with the highest frequency (that is are judged most often to have lower bounds of total effects on the target that are large)

Stability Selection Π1 > Π2 . . . Π3.
Π1 > Π Π3. Define the stably selected genes (covariates) as for some threshold 0.5 < πthr ≤ 1. Denote the wrongly selected genes (false positives) by where Sfalse is the set of (covariates) whose true lower bound βj is 0.

Bootstrapped IDA (CStar) application
Hughes lab data on S. cerevisiae Goal: prediction of the true large intervention effects (in top 10%) based on observational data with no knock- downs The truth is “known in good approximation” (thanks to intervention experiments) 234 gene knock downs p = 5361 genes (expression of genes) 63 observational 11/25/2018

Cstar Results 11/25/2018

Generalized Adjustment Criterion
Z is an adjustment set for the effect of X on Y if 11/25/2018

Generalized Adjustment Criterion
A path from X to Y is possibly directed if there is no edge that points towards X. Z satisfies the generalized adjustment criterion for the effect of X on Y if: Every possibly directed path from X to Y begins with a directed edge out of X. If W is on a possibly directed path from X to Y, then there is no possibly directed path from W to a member of Z. Every path between X and Y with an edge pointing toward X in which it is possible to tell whether every triple is a collider or a non-collider is blocked by Z. 11/25/2018

Finding Large Effects: Steckhoven et al.
Mouse-ear cress response Y: days to bolting (flowering) of the plant Covariates X: gene-expression profile Observational data with n = 47 and p = 21,326 Yeast, fmri, climate, SNPs, 11/25/2018

Experimental Confirmation: Steckhoven et al.
PC + IDA + stability selection (Gaussian, no feedback, no unmeasured confounders, no selection bias) Performed experiment on 14 of the top 20 (not previously known, easily available mutant) applied to yeast data, and got good results verifying results of existing experiments 11/25/2018

Results: Arabidopsis thaliana
9 among the 14 mutants survived 4 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant) also applied to yeast data where outperformed other algorithms 11/25/2018

Testing the Output For various parametric families there are standard tests of the fit of the model taking into account the likelihood of the data and the complexity of the model. Linear models – chi-squared test. If the output fails the test, that is good reason to reject the model. If the output does not fail the test, the worry is that some other model also may not fail the test. 11/25/2018

Testing the Output If the output does not fail the test, the worry is that some other model also may not fail the test. Example: Markov equivalent models in the Gaussian or multinomial case. Example: Other models may perform almost as well. 11/25/2018

Latents, Selection Bias
Mixed Ancestral Graphs Inducing Paths Partial Ancestral Graphs FCI Algorithm Latent Variable Interventions Without Directed Graph Generalized Adjustment Criterion 11/25/2018

From Data to Sets of DAGs - Possible Hidden Variables

Why Latent Variable Models?
For classification problems, introducing latent variables can help get closer to the right answer at smaller sample sizes - but they are needed to get the right answer in the limit. For causal inference problems, introducing latent variables are needed to get the right answer in the limit. 11/25/2018

Score-Based Search Over Latent Models
Structural EM interleaves estimation of parameters with structural search Can also search over latent variable models by calculating posteriors But there are substantial computational and statistical problems with latent variable models 11/25/2018

DAG Models with Latent Variables
Facilitates construction of causal models Provides a finite search space ‘Nice’ statistical properties: Always identified Correspond to a set of distributions characterized by independence relations Have a well-defined dimension Asymptotic existence of ML estimates 11/25/2018

Solution Embed each latent variable model in a ‘larger’ model without latent variables that is easier to characterize. Disadvantage - uses only conditional independence information in the distribution. Latent variable model Model imposing only independence constraints on observed variables Sets of distributions 11/25/2018

Alternative Hypothesis and Some D-separations
SES SEX PE CP L L2 IQ <L2,{SES,L1,SEX, PE}|> <SEX,{L1,SES,L2,IQ}|> <L1,{SES,L2,SEX}|> <SEX,CP|{PE,SES}) These entail conditional independence relations in population. <CP,{IQ,L1,SEX}|{L2,PE,SES}> <PE,{IQ,L2}|{L1,SEX,SES}> <IQ,{SEX,PE,CP}|{L1,L2,SES}> <SES,{SEX,IQ,L1,L2}|> 11/25/2018

D-separations Among Observed
SES SEX PE CP L L2 IQ <L2,{SES,L1,SEX, PE}|> <SEX,{L1,SES,L2,IQ}|> <L1,{SES,L2,SEX}|> <SEX,CP|{PE,SES}) <CP,{IQ,L1,SEX}|{L2,PE,SES}> <PE,{IQ,L2}|{L1,SEX,SES}> <IQ,{SEX,PE,CP}|{L1,L2,SES}> <SES,{SEX,IQ,L1,L2}|> 11/25/2018

D-separations Among Observed
SES SEX PE CP L L2 IQ It can be shown that no DAG with just the measured variables has exactly the set of d-separation relations among the observed variables. In this sense, DAGs are not closed under marginalization. 11/25/2018

Mixed Ancestral Graphs
Under a natural extension of the concept of d-separation to graphs with , MAG(G) is a graphical object that contains only the observed variables, and has exactly the d-separations among the observed variables. SES SEX PE CP IQ SES SEX PE CP IQ L1 L2 11/25/2018 Latent Variable DAG Corresponding MAG

Mixed Ancestral Graph Construction
There is an edge between A and B if and only if for every <{A},{B}|C>, there is a latent variable in C. If A and B are adjacent, then A  B if and only if A is an ancestor of B. If A and B are adjacent, then A  B if and only if A is not an ancestor of B and B is not an ancestor of A. 11/25/2018

Suppose SES Unmeasured
SEX PE CP IQ SEX PE CP IQ L1 L2 DAG Corresponding MAG SEX PE CP IQ Another DAG with the same MAG L1 L2 11/25/2018

Mixed Ancestral Models
Can score and evaluate in the usual ways Not every parameter is directly interpreted as a structural (causal) coefficient Not every part of marginal manipulated model can be predicted from mixed ancestral graph Because multiple DAGs can have the same MAG, they might not all agree on the effect of a manipulation. It is possible to tell from the MAG when all of the DAGs with that MAG all agree on the effect of a manipulation. 11/25/2018

Mixed Ancestral Graph Mixed ancestral models are closed under marginalization. In the linear normal case, the parameterization of a MAG is just a special case of the parameterization of a linear structural equation model. There is a maximum liklihood estimator of the parameters (Drton). The BIC score is easy to calculate. 11/25/2018

Some Markov Equivalent Mixed Ancestral Graphs
SEX PE CP IQ SEX PE CP IQ SEX PE CP IQ SEX PE CP IQ These different MAGs all have the same d-separation relations. 11/25/2018

Partial Ancestral Graphs
SEX PE CP IQ SEX PE CP IQ SEX PE CP IQ o o o o SEX PE CP IQ SEX PE CP IQ o o Partial Ancestral Graph 11/25/2018

Partial Ancestral Graph represents MAG M
A is adjacent to B iff A and B are adjacent in M. A  B iff A is an ancestor of B in every MAG d-separation equivalent to M. A  B iff A and B are not ancestors of each other in every MAG d-separation equivalent to M. A o B iff B is not an ancestor of A in every MAG d-separation equivalent to M, and A is an ancestor of B in some MAGs d-separation equivalent to M, but not in others. A oo B iff A is an ancestor of B in some MAGs d-separation equivalent to M, but not in others, and B is an ancestor of A in some MAGs d-separation equivalent to M, but not in others. 11/25/2018

Partial Ancestral Graph
represents ancestor features common to MAGs that are d-separation equivalent d-separation relations in the d-separation equivalence class of MAGs. Can be parameterized by turning it into a mixed ancestral graph Can be scored and evaluated like MAG 11/25/2018

FCI Algorithm In the large sample limit, with probability 1, the output is a PAG that represents the true graph over O If the algorithm needs to test high order conditional independence relations then Time consuming - worst case number of conditional independence tests (complete PAG) Unreliable (low power of tests) Modified versions can halt at any given order of conditional independence test, at the cost of more “Can’t tell” answers. Not useful information when each pair of variables have common hidden cause. There is a provably correct score-based search, but it outputs “can’t tell” in most cases 11/25/2018

Output for College Plans
SES SEX PE CP IQ SES SEX PE CP IQ o o o o o o Output of FCI Algorithm PAG Corresponding to Output of PC Algorithm These are different because no DAG can represent the d-separations in the output of the FCI algorithm. 11/25/2018

From Sets of DAGs to Effects of Manipultions - May Be Hidden Common Causes

Manipulation Model for PAGs
Same strategies as for Patterns Latent variable IDA (Malinsky) Generalized Adjustment Criterion for PAGs (Pierkovic) 11/25/2018

Comparison with non-latent case
FCI P(cp|pe||P’(PE)) = P(cp|pe). P(CP=0|PE=0||P’(PE)) = .063 P(CP=1|PE=0||P’(PE)) = .937 P(CP=0|PE=1||P’(PE)) = .572 P(CP=1PE=1||P’(PE)) = .428 PC P(CP=0|PE=0||P’(PE)) = .095 P(CP=1|PE=0||P’(PE)) = .905 P(CP=0|PE=1||P’(PE)) = .484 P(CP=1PE=1||P’(PE)) = .516 11/25/2018

Good News In the large sample limit, there is an algorithm (FCI) whose output is arbitrarily close to correct (or output “can’t tell”) with probability 1 (pointwise consistency). At every finite sample size, every method will be arbitrarily far from truth with high probability for some values of the truth (no uniform consistency.) 11/25/2018

Other Constraints The disadvantage of using MAGs or FCI is they only use conditional independence information In the case of latent variable models, there are constraints implied on the observed margin that are not conditional independence relations, regardless of the family of distributions These can be used to choose between two different latent variable models that have the same d-separation relations over the observed variables In addition, there are constraints implied on the observed margin that are particular to a family of distributions 11/25/2018

Active Areas of Research
Complete non-parametric manipulation calculations for partially known DAGs with latent variables Define strong faithfulness for the latent case. Calculating constraints (non-parametric or parametric) from latent variable DAGs Using constraints (non-parametric or parametric) to guide search for latent variable DAGs Latent variable score-based search over PAGs Parameterizations of MAGs for other families of distsributions Completeness of do-calculus for PAGs Time series inference 11/25/2018

Problems with Score-based Search
In order for the highest scoring model to be the true causal model, similar assumptions about the relationship between the joint probability distribution and the true causal model have to be made. For Gaussian and multinomial models, the same kind of underdetermination of the true causal model by the probability distribution exists. Computationally difficult to extend to hidden variable models 11/25/2018

Where We Are (V)CPC LiNGAM RESIT ANLTS MSL ION SAT Large Search Space
✔ No Ordering Hidden Variable Selection Bias Missing Data Feedback Relax Faithfulness Time Series Overlapping datasets Beyond CI Full Structure Measurement error Fewer checkmarks is better 11/25/2018

Important Open Questions
Given the output on thousands of variables, what is the best way to make the output usable? Is there any way to locate the parts of the output that are most reliable? Calculating p-values for oriented edges, and controlling false discovery rate There are often thousands or tens of thousands of journal articles related to a given area. How can this information automatically be incorporated into causal search algorithms? What is the best way to incorporate causal relations between relations, rather than between properties? Does being siblings (a relationship between people) cause rivalry (another relationship between people)? 11/25/2018

Acknowledgements NSF DMS - CDS&E-MSS 1317428.
National Institutes of Health Award Number U54HG008540 DARPA W911NF 11/25/2018

Selected References Kalisch, M., and P. Bühlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research 8, 613–636. M. H. Maathuis, M. Kalisch, & P. Buhlmann. (2009). Estimating high-dimensional intervention effects from observational data. Annals of Statistics, 37(6A), Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Pearl, J. (2000) Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press, 2000. Emilija Perković, Johannes Textor, Markus Kalisch, Marloes H. Maathuis Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs. arXiv: Richardson, T. S. and J. M. Robins (2013). Single World Intervention Graphs (SWIGs): A unifi- cation of the counterfactual and graphical approaches to causality. Technical Report 128, Center for Statistics and the Social Sciences, University of Washington 11/25/2018

Selected References Shpitser & J. Pearl. (2008). Complete Identification Methods for the Causal Hierarchy. Journal of Machine Learning Research, 9, Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. J. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7: 2003–2030. Spirtes, P., Glymour, C., and Scheines, R. (2000) Causation, Prediction, and Search, 2nd edition with added material, MIT Press, N.Y., 2000. Spirtes, P. and Zhang, J. (2014) A uniformly consistent estimator of causal effects under the k-triangle-faithfulness assumption. Statistical Science 29(4): Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823 11/25/2018

Introductory Books on Graphical Causal Inference
Causation, Prediction, and Search, by P. Spirtes, C. Glymour, R. Scheines, MIT Press, 2000. Causality: Models, Reasoning, and Inference by J. Pearl, Cambridge University Press, 2000. Elements of Causal Inference by J. Peters, D. Janzing, and B. Scholkopf Computation, Causation, and Discovery (Paperback) , ed. by C. Glymour and G. Cooper, AAAI Press, 1999. 11/25/2018

Causal Inference and Graphical Models

Similar presentations

Presentation on theme: "Causal Inference and Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Causal Inference and Graphical Models

Similar presentations

Presentation on theme: "Causal Inference and Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback