3 Fundamental problem  As we have all heard many times… “Correlation is not causation!”

4 Fundamental problem  Why is this slogan correct?  Causal hypotheses make implicit claims about the effects of intervening (manipulating) one or more variables  Hypotheses about association or correlation make no such claims Correlation or probabilistic dependence can be produced in many ways

5 Fundamental problem  Some of the possible reasons why X and Y might be associated are:  Sheer chance  X causes Y  Y causes X  Some third variable Z influences X and Y  The value of X (or a cause of X) and the value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)

6 Fundamental problem  Fundamental problem of causal search:  For any particular set of data, there are often many different causal structures that could have produced that data  Causation → Association map is many → one

7 Fundamental problem  Okay, so what can we do about this?  Use the data to figure out as much as possible (though it usually won’t be everything) Requires developing search procedures  And then try to narrow the possibilities Use other knowledge (e.g., time order, interventions) Get better / different data (e.g., run an experiment)

8 Always remember… Even if we cannot discover the whole truth, we might be able to find some of the truth!

9 Markov equivalence  Formally, we say that:  Two causal graphs are members of the same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables By the Markov and Faithfulness assumptions  Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies

10 Markov equivalence  The “Fundamental Problem of Causal Inference” can be restated as:  For some sets of independence relations, the Markov equivalence class is not a singleton  Markov equivalence classes give a precise characterization of what can be inferred from independencies alone

11 Markov equivalence  Examples:  X {Y, Z} ⇒  X Y | Z ⇒  X Y ⇒ X Y Z X Y Z X Y Z X Y Z Y Z X Y Z X

12 Markov equivalence  Two more examples:  Are these graphs Markov equivalent?  Are these two graphs? X Y Z X Y Z X Y Z X Y Z

13 Shared structure  What is shared by all of the graphs in a Markov equivalence class?  Same “skeleton” I.e., they all have the same adjacency relations  Same “unshielded colliders” I.e., X → Y ← Z with no edge between X and Z  Sometimes, other edges have same direction In these last two cases, we can infer that the true graph contains the shared directed edges.

14 Shared structure as patterns  Since every Markov equivalent graph has the same adjacencies, we can represent the whole class using a pattern  A pattern is itself a graph, but the edges represent edges in other graphs

15 Shared structure as patterns  A pattern can have directed and undirected edges  It represents all graphs that can be created by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider  Let’s try some examples…

16 Shared structure as patterns Nitrogen — PlantGrowth — Bees Nitrogen → PlantGrowth → Bees Nitrogen ← PlantGrowth → Bees Nitrogen ← PlantGrowth ← Bees

17 Shared structure as patterns Nitrogen → PlantGrowth ← Bees

18 Formal problem of search  Given some dataset D, find:  Markov equivalence class, represented as a pattern P, that predicts exactly the independence relations found in the data  More colloquially, find the causal graphs that could have produced data like this

19 Hard to find a pattern  “Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!”  Big problem: # of independencies to test is super- exponential in # of variables:  2 variables ⇒ 1 test5 variables ⇒ 80 tests  3 variables ⇒ 6 tests6 variables ⇒ 240 tests  4 variables ⇒ 24 testsand so on…

20 General features of causal search  Huge model and parameter spaces  Even when we (necessarily) use prior information about the family of probability distributions.  Relevant statistics must be rapidly computed  But substantive knowledge about the domain may restrict the space of alternative models  Time order of variables  Required cause/effect relationships  Existence or non-existence of latent variables

21 Three schemata for search  Bayesian / score-based  Find the graph(s) with highest P(graph | data)  Constraint-based  Find the graph(s) that predict exactly the observed associations and independencies  Combined  Get “close” with constraint-based, and then find the best graph using score-based

22 Bayesian / score-based  Informally:  Give each model an initial score using “prior beliefs”  Update each score based on the likelihood of the data if the model were true  Output the highest-scoring model  Formally:  Specify P(M, v) for all models M and possible parameter values v of M  For any data D, P(D | M, v) can easily be calculated  P(M | D) ∝ v P(D | M, v)P(M, v)

23 Bayesian / score-based  In practice, this strategy is completely computationally intractable  There are too many graphs to check them all  So, we use a greedy search strategy  Start with an initial graph  Iteratively compare the current graph’s score ( ∝ posterior probability) with that of each 1- or 2-step modification of that graph By edge addition, deletion or reversal

24 Bayesian / score-based  Problem #1: Local maxima  Often, greedy searches get stuck  Solution:  Greedy search over Markov equivalence classes, rather than graphs (Meek) Has a proof of correctness and convergence (Chickering) But it gets to the right answer slowly

25 Bayesian / score-based  Problem #2: Unobserved variables  Huge number of graphs  Huge number of different parameterizations  No fast, general way to compute likelihoods from latent variable models  Partial solution:  Focus on a small, “plausible” set of models for which we can compute scores

26 Constraint-based  Implementation of the earlier idea  “Build” the Markov equivalence class that predicts the pattern of association actually found in the data Compatible with a variety of statistical techniques Note that we might have to introduce a latent variable to explain the pattern of statistics  Important constraints on search: Minimize the number of statistical tests Minimize the size of the conditioning sets (Why?)

27 Constraint-based  Algorithm step #1: Discover the adjacencies  Create the complete graph with undirected edges  Test all pairs X, Y for unconditional independence Remove X—Y edge if they are independent  Test all adjacent X, Y for independence given single N Remove X—Y edge if they are independent  Test adjacent pairs given two neighbors ……

28 Constraint-based  Algorithm step #2: (Try to) Orient edges  “Unshielded triple”: X — C — Y, but X, Y not adjacent  If X & Y independent given S containing C, then C must be a non-collider Since we have to condition on it to achieve d-separation  If X & Y independent given S not containing C, then C must be a collider Since the path is not active when not conditioning on C  And then do further orientations to ensure acyclicity and nodes being non-colliders

29 Constraint-based example  Variables are {X, Y, Z, W}  Only independencies are:  X Y  X W | Z  Y W | Z

30 Constraint-based example  Step 1: Form the complete graph using undirected edges X Y Z W

31 Constraint-based example  Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent X Y ⇒ X Y Z W

32 Constraint-based example  Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them {X, Y} W | Z ⇒ X Y Z W

33 Constraint-based example  Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables X Y Z W

34 Constraint-based example  Step 5: Orientation  For X – Z – Y, since X Y without conditioning on Z, then make Z a collider  Since Z is a non-collider between X and W, though, we must orient Z – W away from Z X Y Z W

35 Constraint-based output  Searches that allow for latent variables can also have edges of the form X o → Y  This indicates one of three possibilities: X → YX → Y  At least one unobserved common cause of X and Y  Both of these

36 Interventions to the rescue?  Interventions helped us solve an earlier equivalence class problem  Randomization meant that: Treatment-Effect association ⇒ T → E  Interventions alter equivalence classes, but don’t make them all into singletons  The fundamental problem of search remains

37 Before X-intervention X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z Y Z X Y Z X Y Z X Y Z X Y Z X Y Z Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X X X Y Z

38 After X-intervention X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z Y Z X Y Z X Y Z X Y Z X Y Z X Y Z Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X X X Y Z

39 Search with interventions  Search with interventions is the same as search with observations, except  We adjust the graphs in the search space to account for the intervention  For multiple experiments, we search for graphs in every output equivalence class  More complicated than this in the real world due to sampling variation

40 Example  Observation  Y Z | X ⇒  Intervention on X  Y {X, Z} ⇒ &  Only possible graph: X Y Z X Y Z X Y Z X Y Z Y Z X X Y Z

41 Looking ahead…  Have:  Basic formal representation for causation  Fundamental causal asymmetry (of intervention)  Inference & reasoning methods  Search & causal discovery principles  Need:  Search & causal discovery methods that work in the real world

