CAUSAL INFERENCE IN STATISTICS THE MATHEMATICS OF CAUSAL INFERENCE IN STATISTICS Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) The subject of my lecture this evening is CAUSALITY. It is not an easy topic to speak about, but it is a fun topic to speak about. It is not easy because, like religion, sex and intelligence, causality was meant to be practiced, not analyzed. And it is fun, because, like religion, sex and intelligence, emotions run high, examples are plenty, there are plenty of interesting people to talk to, and above all, an exhilarating experience of watching our private thoughts magnified under the microscope of formal analysis.
OUTLINE Modeling: Statistical vs. Causal Causal Models and Identifiability Inference to three types of claims: Effects of potential interventions Claims about attribution (responsibility) Claims about direct and indirect effects
TRADITIONAL STATISTICAL INFERENCE PARADIGM Data Inference Q(P) (Aspects of P) P Joint Distribution e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)
FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES Probability and statistics deal with static relations P Joint Distribution P Joint Distribution Q(P) (Aspects of P) Data change Inference What happens when P changes? e.g., Infer whether customers who bought product A would still buy A if we were to double the price.
FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES What remains invariant when P changes say, to satisfy P (price=2)=1 P Joint Distribution P Joint Distribution Q(P) (Aspects of P) Data change Inference Note: P (v) P (v | price = 2) P does not tell us how it ought to change e.g. Curing symptoms vs. curing diseases e.g. Analogy: mechanical deformation
FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT) Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix.
} FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT) Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix. No causes in – no causes out (Cartwright, 1989) statistical assumptions + data causal assumptions causal conclusions } Causal assumptions cannot be expressed in the mathematical language of standard statistics.
} FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT) Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix. No causes in – no causes out (Cartwright, 1989) statistical assumptions + data causal assumptions causal conclusions } Causal assumptions cannot be expressed in the mathematical language of standard statistics. Non-standard mathematics: Structural equation models (Wright, 1920; Simon, 1960) Counterfactuals (Neyman-Rubin (Yx), Lewis (x Y))
WHY CAUSALITY NEEDS SPECIAL MATHEMATICS Y = 2X X = 1 X = 1 Y = 2 SEM Equations are Non-algebraic: Y = 2X X = 1 X = 1 Y = 2 Process information Static information Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1.
FROM STATISTICAL TO CAUSAL ANALYSIS: 2. THE MENTAL BARRIERS Every exercise of causal analysis must rest on untested, judgmental causal assumptions. Every exercise of causal analysis must invoke non-standard mathematical notation.
TWO PARADIGMS FOR CAUSAL INFERENCE Observed: P(X, Y, Z,...) Conclusions needed: P(Yx=y), P(Xy=x | Z=z)... How do we connect observables, X,Y,Z,… to counterfactuals Yx, Xz, Zy,… ? N-R model Counterfactuals are primitives, new variables Super-distribution P*(X, Y,…, Yx, Xz,…) X, Y, Z constrain Yx, Zy,… Structural model Counterfactuals are derived quantities Subscripts modify a data-generating model
THE STRUCTURAL MODEL PARADIGM Data Joint Generating Q(M) Distribution (Aspects of M) Data Inference M – Oracle for computing answers to Q’s. e.g., Infer whether customers who bought product A would still buy A if we were to double the price.
ORACLE FOR MANIPILATION FAMILIAR CAUSAL MODEL ORACLE FOR MANIPILATION X Y Here is a causal model we all remember from high-school -- a circuit diagram. There are 4 interesting points to notice in this example: (1) It qualifies as a causal model -- because it contains the information to confirm or refute all action, counterfactual and explanatory sentences concerned with the operation of the circuit. For example, anyone can figure out what the output would be like if we set Y to zero, or if we change this OR gate to a NOR gate or if we perform any of the billions combinations of such actions. (2) Logical functions (Boolean input-output relation) is insufficient for answering such queries (3)These actions were not specified in advance, they do not have special names and they do not show up in the diagram. In fact, the great majority of the action queries that this circuit can answer have never been considered by the designer of this circuit. (4) So how does the circuit encode this extra information? Through two encoding tricks: 4.1 The symbolic units correspond to stable physical mechanisms (i.e., the logical gates) 4.2 Each variable has precisely one mechanism that determines its value. Z INPUT OUTPUT
STRUCTURAL CAUSAL MODELS Definition: A structural causal model is a 4-tuple V,U, F, P(u), where V = {V1,...,Vn} are observable variables U = {U1,...,Um} are background variables F = {f1,..., fn} are functions determining V, vi = fi(v, u) P(u) is a distribution over U P(u) and F induce a distribution P(v) over observable variables
CAUSAL MODELS AND COUNTERFACTUALS Definition: The sentence: “Y would be y (in situation u), had X been x,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y. Joint probabilities of counterfactuals: The super-distribution P* is derived from M. Parsimonious, consistent, and transparent
APPLICATIONS . Predicting effects of actions and policies . Learning causal relationships from assumptions and data . Troubleshooting physical systems and plans . Finding explanations for reported events . Generating verbal explanations . Understanding causal talk . Formulating theories of causal thinking Let us talk about item 1 for a minute. We saw that if we have a causal model M, then predicting the ramifications of an action is trivial -- mutilate and solve. If instead of a complete model we only have a probabilistic model, it is again trivial: we mutilate and propagate probabilities in the resultant causal network. The important point is that we can specify knowledge using causal vocabulary, and can handle actions that are specified as modalities. But what if we do not have even a probabilistic model? This is where item 2 comes in. In certain applications we are lucky to have data that may supplement missing fragments of the model, and the question is whether the data available is sufficient for computing the effect of actions. Let us illustrate this possibility in a simple example taken from economics:
AXIOMS OF CAUSAL COUNTERFACTUALS Y would be y, had X been x (in state U = u) Definiteness Uniqueness Effectiveness Composition Reversibility
RULES OF CAUSAL CALCULUS Rule 1: Ignoring observations P(y | do{x}, z, w) = P(y | do{x}, w) Rule 2: Action/observation exchange P(y | do{x}, do{z}, w) = P(y | do{x},z,w) Rule 3: Ignoring actions P(y | do{x}, do{z}, w) = P(y | do{x}, w)
DERIVATION IN CAUSAL CALCULUS Genotype (Unobserved) Smoking Tar Cancer P (c | do{s}) = t P (c | do{s}, t) P (t | do{s}) Probability Axioms = t P (c | do{s}, do{t}) P (t | do{s}) Rule 2 = t P (c | do{s}, do{t}) P (t | s) Rule 2 Here we see how one can prove that the effect of smoking on cancer can be determined from data on three variables: Smoking, Tar and Cancer. The question boils down to computing P(cancer) under the hypothetical action do(smoking), from non-experimental data, namely, from expressions involving NO ACTIONS. Or: we need to eliminate the "do" symbol from the initial expression. The elimination proceeds like ordinary solution of algebraic equation -- in each stage, a new rule is applied, licensed by some subgraph of the diagram, until eventually leading to a formula involving only WHITE SYMBOLS, meaning an expression computable from non-experimental data. Now, if I were not a modest person, I would say that this is an amazing result. Watch what is going on here: we are not given any information whatsoever on the hidden genotype, it may be continuous or discrete, unidimensional or multidimensional. Yet, measuring an auxiliary variable TAR someplace else in the system, enables us to predict what the world would be like in the hypothetical situation where people were free of the influence of this hidden genotype. Data on the visible allows us to infer the effects of the invisible. Moreover, a person can also figure out the answer to the question: "I am about to smoke -- should I"? I think it is amazing, because I cannot do this calculation in my head. It demonstrates the immense power of having a formal language in an area that many respectable scientists prefer to see handled by unaided judgment. = t P (c | do{t}) P (t | s) Rule 3 = st P (c | do{t}, s) P (s | do{t}) P(t |s) Probability Axioms Rule 2 = st P (c | t, s) P (s | do{t}) P(t |s) = s t P (c | t, s) P (s) P(t |s) Rule 3
THE BACK-DOOR CRITERION Graphical test of identification P(y | do(x)) is identifiable in G if there is a set Z of variables such that Z d-separates X from Y in Gx. G Gx Z1 Z1 Z2 Z2 Z Z3 Z3 Z4 Z5 Z5 Z4 X Z6 Y X Z6 Y Moreover, P(y | do(x)) = å P(y | x,z) P(z) (“adjusting” for Z) z
RECENT RESULTS ON IDENTIFICATION do-calculus is complete Complete graphical criterion for identifying causal effects (Shpitser and Pearl, 2006). Complete graphical criterion for empirical testability of counterfactuals (Shpitser and Pearl, 2007).
DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) Your Honor! My client (Mr. A) died BECAUSE he used that drug.
DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) Your Honor! My client (Mr. A) died BECAUSE he used that drug. Court to decide if it is MORE PROBABLE THAN NOT that A would be alive BUT FOR the drug! P(? | A is dead, took the drug) > 0.50 PN =
THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.”
THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Answer: Computable from M
THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Under what condition can PN(x,y) be learned from statistical data, i.e., observational, experimental and combined. Analytical Problem:
TYPICAL THEOREMS (Tian and Pearl, 2000) Bounds given combined nonexperimental and experimental data Identifiability under monotonicity (Combined data) corrected Excess-Risk-Ratio
CAN FREQUENCY DATA DECIDE LEGAL RESPONSIBILITY? Experimental Nonexperimental do(x) do(x) x x Deaths (y) 16 14 2 28 Survivals (y) 984 986 998 972 1,000 1,000 1,000 1,000 Nonexperimental data: drug usage predicts longer life Experimental data: drug has negligible effect on survival Plaintiff: Mr. A is special. He actually died He used the drug by choice Court to decide (given both data): Is it more probable than not that A would be alive but for the drug?
SOLUTION TO THE ATTRIBUTION PROBLEM WITH PROBABILITY ONE 1 P(yx | x,y) 1 Combined data tell more that each study alone
EFFECT DECOMPOSITION What is the semantics of direct and indirect effects? What are their policy-making implications? Can we estimate them from data? Experimental data?
WHY DECOMPOSE EFFECTS? Direct (or indirect) effect may be more transportable. Indirect effects may be prevented or controlled. Direct (or indirect) effect may be forbidden Pill Pregnancy + + Thrombosis Gender Qualification Hiring
SEMANTICS BECOMES NONTRIVIAL IN NONLINEAR MODELS (even when the model is completely specified) X Z z = f (x, 1) y = g (x, z, 2) Y Dependent on z? Void of operational meaning?
THE OPERATIONAL MEANING OF DIRECT EFFECTS X Z z = f (x, 1) y = g (x, z, 2) Y “Natural” Direct Effect of X on Y: The expected change in Y per unit change of X, when we keep Z constant at whatever value it attains before the change. In linear models, NDE = Controlled Direct Effect
THE OPERATIONAL MEANING OF INDIRECT EFFECTS X Z z = f (x, 1) y = g (x, z, 2) Y “Natural” Indirect Effect of X on Y: The expected change in Y when we keep X constant, say at x0, and let Z change to whatever value it would have under a unit change in X. In linear models, NIE = TE - DE
POLICY IMPLICATIONS OF INDIRECT EFFECTS indirect What is the direct effect of X on Y? The effect of Gender on Hiring if sex discrimination is eliminated. GENDER QUALIFICATION HIRING X Z IGNORE f Y
SEMANTICS AND IDENTIFICATION OF NESTED COUNTERFACTUALS Consider the quantity Given M, P(u), Q is well defined Given u, Zx*(u) is the solution for Z in Mx*, call it z is the solution for Y in Mxz Can Q be estimated from data?
GENERAL PATH-SPECIFIC EFFECTS (Def.) X x* X W Z W Z z* = Zx* (u) Y Y Form a new model, , specific to active subgraph g Definition: g-specific effect Nonidentifiable even in Markovian models
EFFECT DECOMPOSITION SUMMARY Graphical conditions for estimability from experimental / nonexperimental data. Graphical conditions hold in Markovian models Useful in answering new type of policy questions involving mechanism blocking instead of variable fixing.
CONCLUSIONS Structural-model semantics, enriched with logic and graphs, provides: Complete formal basis for causal reasoning Powerful and friendly causal calculus Lays the foundations for asking more difficult questions: What is an action? What is free will? Should robots be programmed to have this illusion?