THE MATHEMATICS OF CAUSE AND EFFECT: Thinking Nature and Talking Counterfactuals Judea Pearl Departments of Computer Science and Statistics UCLA
1. From Turing test to Bayes networks 2. From Bayes networks to do -calculus 3. From messy science to counterfactuals 4. From counterfactuals to practical victories a)policy evaluation b)attribution c)mediation d)generalizability – extend validity e)new thrills – missing data OUTLINE
CAN MACHINE THINK? Alan M. Turing (1912 – 1954) The Turing Test “Computing Machinery and Intelligence” (1950) Turing: Yes, if it acts like it thinks. Acts = It answers non-trivial questions about a story, a topic or a situation?
HOW TURING ENVISIONED THE TEST CONVERSATION QPlease write me a sonnet on the subject of the Forth Bridge. ACount me out on this one. I never could write poetry. QAdd 34,957 and 70,764. A(Pause about 30 seconds and then give an answer) 105,721. QDo you play chess? AYes. QI have K at my K 1, and no other pieces. You have only K at K 6 and R at R 1. It is your move. What do you play? A(After a pause of 15 seconds) R - R 8: mate!
A “MINI” TURING TEST IN CAUSAL CONVERSATION Image adapted from Saygin, Q1:If the season is dry, and the pavement is slippery did it rain? A1:Unlikely, it is more likely the sprinkler was ON. with a very slight possibility that it is not even wet. Input: Story Question: What is? What if? Why? Answers: I believe that... The Story
Q2:But what if we SEE that the sprinkler is OFF? A2:Then it is more likely that it rained. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000.
Q3:Do you mean that if we actually turn the sprinkler ON, the rain will be less likely? A3:No, the likelihood of rain would remain the same but the pavement will surely get wet. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000.
Q4:Suppose we SEE that the sprinkler is ON and the pavement wet. What if the sprinkler were OFF? A4:The pavement would be dry, because the season is likely dry. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000.
SEARLE’S CHINESE ROOM ARGUMENT
WHAT’S IN SEARLE’S RULE BOOK? Searle's oversight: there are not enough molecules in the universe to make the book. Even for the sprinkler example. Why causal conversation.
IS PARSIMONY NECESSARY (SUFFICIENT) FOR UNDERSTANDING? Understanding requires translating world constraints into a grammar (contraints over symbol strings) and harnessing it to answer queries swiftly and reliably. Parsimony can only be achieved by exploiting the constraints in the world to beat the combinatorial explosion.
THE PLURALITY OF MINI TURING TESTS Data-intensive Scientific applications Robotics Human Cognition and Ethics Thousands of Hungry and aimless customers Scientific thinking Turing Test Causal Reasoning Poetry Medicine Chess Stock market....
THE PLURALITY OF MINI TURING TESTS Human Cognition and Ethics Turing Test Causal Reasoning Poetry Medicine Chess Stock market....
Causal Explanation “ She handed me the fruit and I ate ” “ The serpent deceived me, and I ate ”
COUNTERFACTUALS AND OUR SENSE OF JUSTICE Abraham: Are you about to smite the righteous with the wicked? What if there were fifty righteous men in the city? And the Lord said, “ If I find in the city of Sodom fifty good men, I will pardon the whole place for their sake. ” Genesis 18:26
THE PLURALITY OF MINI TURING TESTS Human Cognition and Ethics Scientific thinking Turing Test Causal Reasoning Poetry Medicine Chess Stock market....
Y = 2X WHY PHYSICS IS COUNTERFACTUAL Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1. X = 1 Y = 2 The solution Process information Y : = 2X Correct notation: X = 1 e.g., Length (Y) equals a constant (2) times the weight (X) Scientific Equations (e.g., Hooke’s Law) are non-algebraic X = 3 X = ½ Y Y = X+1 Alternative X = 3
(or) Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1. Correct notation: e.g., Length (Y) equals a constant (2) times the weight (X) Scientific Equations (e.g., Hooke’s Law) are non-algebraic WHY PHYSICS IS COUNTERFACTUAL X = 1 Y = 2 The solution Process information X = 1 X = 3 X = ½ Y Y = X+1 Alternative X = 3 Y 2X
THE PLURALITY OF MINI TURING TESTS Robotics Human Cognition and Ethics Scientific thinking Turing Test Causal Reasoning Poetry Medicine Chess Stock market....
Input: 1. “If the grass is wet, then it rained” 2. “if we break this bottle, the grass will get wet” Output: “If we break this bottle, then it rained” CAUSATION AS A PROGRAMMER'S NIGHTMARE
WHAT KIND OF QUESTIONS SHOULD THE ROBOT ANSWER? Observational Questions: “What if we see A” Action Questions: “What if we do A?” Counterfactuals Questions: “What if we did things differently?” Options: “With what probability?” (What is?) (What if?) (Why?) THE CAUSAL HIERARCHY P(y | A) P(y | do(A) P(y A ’ | A) - SYNTACTIC DISTINCTION
THE PLURALITY OF MINI TURING TESTS Data-intensive Scientific applications Robotics Human Cognition and Ethics Thousands of Hungry and aimless customers Scientific thinking Turing Test Causal Reasoning Poetry Medicine Chess Stock market....
STRUCTURAL CAUSAL MODELS: THE WORLD AS A COLLECTION OF SPRINGS Definition: A structural causal model is a 4-tuple, where V = {V 1,...,V n } are endogenous variables U = {U 1,...,U m } are background variables F = {f 1,..., f n } are functions determining V, v i = f i (v, u) P(u) is a distribution over U P(u) and F induce a distribution P(v) over observable variables e.g.,
TRADITIONAL STATISTICAL INFERENCE PARADIGM Data Inference Q(P) (Aspects of P ) P Joint Distribution e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)
Data Inference Q(M) (Aspects of M ) Data Generating Model M – Invariant strategy (mechanism, recipe, law, protocol) by which Nature assigns values to variables in the analysis. Joint Distribution THE STRUCTURAL MODEL PARADIGM M “A painful de-crowning of a beloved oracle!”
Definition: The sentence: “ Y would be y (in situation u ), had X been x,” denoted Y x (u) = y, means: The solution for Y in a mutilated model M x, (i.e., the equations for X replaced by X = x ) with input U=u, is equal to y. The Fundamental Equation of Counterfactuals: COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE
READING COUNTERFACTUALS FROM SEM Data shows: A student named Joe, measured X = 0.5, Z = 1.0, Y = 1.9 Q 1 : What would Joe’s score be had he doubled his study time? Answer: 2.30
THE TWO FUNDAMENTAL LAWS OF CAUSAL INFERENCE 1.The Law of Counterfactuals ( M generates and evaluates all counterfactuals.) 2.The Law of Conditional Independence ( d -separation) (Separation in the model ⇒ independence in the distribution.)
THE LAW OF CONDITIONAL INDEPENDENCE Each function summarizes millions of micro processes. C (Climate) R (Rain) S (Sprinkler) W (Wetness) U1U1 U2U2 U3U3 U4U4 CS
Each function summarizes millions of micro processes. Still, if the U 's are independent, the observed distribution P ( C,R,S,W ) must satisfy certain constraints that are: (1) independent of the f ‘s and of P ( U ) and (2) can be read from the structure of the graph. C (Climate) R (Rain) S (Sprinkler) W (Wetness) U1U1 U2U2 U3U3 U4U4 CS THE LAW OF CONDITIONAL INDEPENDENCE
D -SEPARATION: NATURE’S LANGUAGE FOR COMMUNICATING ITS STRUCTURE C (Climate) R (Rain) S (Sprinkler) W (Wetness) Every missing arrow advertises an independency, conditional on a separating set. Applications 1.Structure learning 2.Model testing 3.Reducing "what if I do" questions to symbolic calculus 4.Reducing scientific questions to symbolic calculus S R | C e.g., C W | ( S,R )
SEEING VS. DOING Effect of turning the sprinkler ON
Define: Assume: Identify: Estimate: Test: THE FIVE NECESSARY STEPS FOR CAUSAL INFERENCE Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications
THE FIVE NECESSARY STEPS FOR EFFECT ESTIMATION Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications THE FIVE NECESSARY STEPS FOR AVERAGE TREATMENT EFFECT Define: Assume: Identify: Estimate: Test:
THE FIVE NECESSARY STEPS FOR DYNAMIC POLICY ANALYSIS Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
THE FIVE NECESSARY STEPS FOR TIME VARYING POLICY ANALYSIS Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
THE FIVE NECESSARY STEPS FOR TREATMENT ON TREATED Express the target quantity Q a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
THE FIVE NECESSARY STEPS FOR INDIRECT EFFECTS Express the target quantity Q a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
THE FIVE NECESSARY STEPS FROM DEFINITION TO ASSUMPTIONS Express the target quantity Q as a property of the model M. Express causal assumptions in structural or graphical form. Determine if Q is identifiable. Estimate Q if it is identifiable; approximate it, if it is not. If M has testable implications Define: Assume: Identify: Estimate: Test:
Q(P) - Identified estimands T(M A ) - Testable implications A* - Logical implications of A Causal inference Statistical inference A - CAUSAL ASSUMPTIONS Q Queries of interest Data ( D ) THE LOGIC OF CAUSAL ANALYSIS Goodness of fit Model testingProvisional claims Q - Estimates of Q(P) CAUSAL MODEL (M A )
THE MACHINERY OF CAUSAL CALCULUS Rule 1: Ignoring observations P(y | do{x}, z, w) = P(y | do{x}, w) Rule 2: Action/observation exchange P(y | do{x}, do{z}, w) = P(y | do{x},z,w) Rule 3: Ignoring actions P(y | do{x}, do{z}, w) = P(y | do{x}, w) Completeness Theorem (Shpitser, 2006)
DERIVATION IN CAUSAL CALCULUS Smoking Tar Cancer Probability Axioms Rule 2 Rule 3 Rule 2 Genotype (Unobserved)
EFFECT OF WARM-UP ON INJURY (After Shrier & Platt, 2008) No, no!
DETERMINING CAUSES OF EFFECTS A COUNTERFACTUAL VICTORY Your Honor! My client (Mr. A) died BECAUSE he used that drug. Court to decide if it is MORE PROBABLE THAN NOT that A would be alive BUT FOR the drug! P (? | A is dead, took the drug) > 0.50 PN =
THE ATTRIBUTION PROBLEM Definition: 1.What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Answer: Computable from M
THE ATTRIBUTION PROBLEM Definition: 1.What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” 2.Under what condition can PN(x,y) be learned from statistical data, i.e., observational, experimental and combined. Identification:
ATTRIBUTION MATHEMATIZED (Tian and Pearl, 2000) Bounds given combined nonexperimental and experimental data ( P ( y,x ), P ( y x ), for all y and x ) Identifiability under monotonicity (Combined data)
CAN FREQUENCY DATA DECIDE LEGAL RESPONSIBILITY? Nonexperimental data: drug usage predicts longer life Experimental data: drug has negligible effect on survival ExperimentalNonexperimental do(x) do(x ’ ) x x ’ Deaths (y) Survivals (y ’ ) ,0001,0001,0001,000 1.He actually died 2.He used the drug by choice Court to decide (given both data): Is it more probable than not that A would be alive but for the drug? Plaintiff: Mr. A is special.
SOLUTION TO THE ATTRIBUTION PROBLEM WITH PROBABILITY ONE Combined data tell more that each study alone
MEDIATION: ANOTHER COUNTERFACTUAL TRIUMPH Why decompose effects? 1.To understand how Nature works 2.To comply with legal requirements 3.To predict the effects of new type of interventions: Signal re-routing and mechanism deactivating, rather than variable fixing
z = f (x, u) y = g (x, z, u) XZ Y COUNTERFACTUAL DEFINITION OF INDIRECT EFFECTS Indirect Effect of X on Y : The expected change in Y when we keep X constant, say at x 0, and let Z change to whatever value it would have attained had X changed to x 1. In linear models, IE = TE - DE No Controlled Indirect Effect
POLICY IMPLICATIONS OF INDIRECT EFFECTS f GENDERQUALIFICATION HIRING What is the indirect effect of X on Y ? The effect of Gender on Hiring if sex discrimination is eliminated. XZ Y IGNORE Deactivating a link – a new type of intervention
THE MEDIATION FORMULAS IN UNCONFOUNDED MODELS X Z Y Fraction of responses explained by mediation (sufficient) Fraction of responses owed to mediation (necessary) z = f (x, u 1 ) y = g (x, z, u 2 ) u 1 independent of u 2
THE MEDIATION FORMULAS IN UNCONFOUNDED MODELS X Z Y Fraction of responses explained by mediation (sufficient) Fraction of responses owed to mediation (necessary) z = f (x, u 1 ) y = g (x, z, u 2 ) u 1 independent of u 2 Complete identification conditions for confounded models with multiple mediators.
TRANSPORTABILITY OF KNOWLEDGE ACROSS DOMAINS (with E. Bareinboim) 1.A Theory of causal transportability When can causal relations learned from experiments be transferred to a different environment in which no experiment can be conducted? 2.A Theory of statistical transportability When can statistical information learned in one domain be transferred to a different domain in which a.only a subset of variables can be observed? Or, b.only a few samples are available?
MOTIVATION WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC? Experimental study in LA Measured: Needed: Observational study in NYC Measured: X (Intervention) Y (Outcome) Z (Age) X (Observation) Y (Outcome) Z (Age) Transport Formula (calibration):
TRANSPORT FORMULAS DEPEND ON THE STORY a) Z represents age b) Z represents language skill X Y Z (b) S X Y Z (a) S ? S S Factors producing differences
X TRANSPORT FORMULAS DEPEND ON THE STORY a) Z represents age b) Z represents language skill c) Z represents a bio-marker X Y Z (b) S (a) X Y (c) Z S ? Y Z S
U W GOAL: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE X Y Z V S T INPUT: Annotated Causal Graph OUTPUT: 1.Transportable or not? 2.Measurements to be taken in the experimental study 3.Measurements to be taken in the target population 4.A transport formula S Factors creating differences
TRANSPORTABILITY REDUCED TO CALCULUS Theorem A causal relation R is transportable from ∏ to ∏* if and only if it is reducible, using the rules of do -calculus, to an expression in which S is separated from do ( ). X Y Z S W
U W RESULT: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE X Y Z V S T INPUT: Annotated Causal Graph OUTPUT: 1.Transportable or not? 2.Measurements to be taken in the experimental study 3.Measurements to be taken in the target population 4.A transport formula 5.Completeness (Bareinboim, 2012) S Factors creating differences
XY (f) Z S XY (d) Z S W WHICH MODEL LICENSES THE TRANSPORT OF THE CAUSAL EFFECT X → Y XY (e) Z S W (c) XYZ S XYZ S WXYZ S W (b) YX S (a) YX S S External factors creating disparities Yes No YesNoYes XY (f) Z S
STATISTICAL TRANSPORTABILITY (Transfer Learning) Why should we transport statistical information? i.e., Why not re-learn things from scratch ? 1.Measurements are costly. Limit measurements to a subset V * of variables called “scope”. 2.Samples are scarce. Pooling samples from diverse populations will improve precision, if differences can be filtered out.
R=P* (y | x) is transportable over V* = {X,Z}, i.e., R is estimable without re-measuring Y Transfer Learning If few samples (N 2 ) are available from ∏* and many samples (N 1 ) from ∏ then estimating R = P*(y | x) by achieves a much higher precision STATISTICAL TRANSPORTABILITY Definition: (Statistical Transportability) A statistical relation R(P) is said to be transportable from ∏ to ∏* over V * if R(P*) is identified from P, P * (V * ), and D where P*(V *) is the marginal distribution of P * over a subset of variables V *. XYZ S XYZ S
META-ANALYSIS OR MULTI-SOURCE LEARNING XY (f) Z W XY (b) Z WXY (c) Z S WXY (a) Z W XY (g) Z W XY (e) Z W S S Target population R = P*(y | do(x)) XY (h) Z WXY (i) Z S W S XY (d) Z W
CAN WE GET A BIAS-FREE ESTIMATE OF THE TARGET QUANTITY? XY (a) Z W Target population R = P*(y | do(x)) XY (d) Z W Is R identifiable from (d) and (h) ? R(∏*) is identifiable from studies (d) and (h). R(∏*) is not identifiable from studies (d) and (i). XY (h) Z W S XY (i) Z W S S
FROM META-ANALYSIS TO META-SYNTHESIS The problem How to combine results of several experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to construct an aggregate measure of effect size that is "better" than any one study in isolation.
META-SYNTHESIS REDUCED TO CALCULUS Theorem {∏ 1, ∏ 2,…,∏ K } – a set of studies. {D 1, D 2,…, D K } – selection diagrams (relative to ∏*). A relation R( ∏* ) is "meta estimable" if it can be decomposed into terms of the form: such that each Q k is transportable from D k. Open-problem: Systematic decomposition
Principle 1: Calibrate estimands before pooling (to minimize bias) Principle 2: Decompose to sub-relations before calibrating (to improve precision) Pooling Calibration BIAS VS. PRECISION IN META-SYNTHESIS WWWXY (a) Z XY (h) Z S XY (i) Z WXY (d) Z W S XY (g) Z
Pooling WWWXYXYXYWXYWXY (a) Z (h) Z S (i) Z (d) Z S (g) Z BIAS VS. PRECISION IN META-SYNTHESIS Composition Pooling
MISSING DATA: A SEEMINGLY STATISTICAL PROBLEM (Mohan & Pearl, 2012) Pervasive in every experimental science. Huge literature, powerful software industry, deeply entrenched culture. Current practices are based on statistical characterization (Rubin, 1976) of a problem that is inherently causal. Consequence: Like Alchemy before Boyle and Dalton, the field is craving for (1) theoretical guidance and (2) performance guarantees.
ESTIMATE P(X,Y,Z) Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m
Q-1. What should the world be like, for a given statistical procedure to produce the expected result? Q-2. Can we tell from the postulated world whether any method can produce a bias-free result? How? Q-3. Can we tell from data if the world does not work as postulated? None of these questions can be answered by statistical characterization of the problem. All can be answered using causal models (existence, guarantees, algorithms, testable implications). WHAT CAUSAL THEORY CAN DO FOR MISSING DATA
Causal inference is a missing data problem. (Rubin 2012) Missing data is a causal inference problem. (Pearl 2012) Why is missingness a causal problem? Which mechanism causes missingness makes a difference in whether / how we can recover information from the data. Mechanisms require causal language to be properly described – statistics are not sufficient. Different causal assumptions lead to different routines for recovering information from data, even when the assumptions are indistinguishable by any statistical means. MISSING DATA: TWO PERSPECTIVES
ESTIMATE P(X,Y,Z) Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m
Row # X*Y*Z*RxRx RyRy RzRz Line deletion estimate is generally biased. ESTIMATE P(X,Y,Z) Complete Cases Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m
ESTIMATE P(X,Y,Z) RyRy RxRx RzRz Z XY Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m
Compute P(Z|X,Y,R x =0,R y =0,R z =0) Compute P(X|Y,R x =0,R y =0) Compute P(Y|R y =0) Row # X*Y*Z* mm 401m 5m1m 6m01 7mm0 801m 900m 1010m Row # Y* Row # X*Y* Row # X*Y*Z* ESTIMATE P(X,Y,Z)
RyRy RxRx RzRz Z XY
Statistically indistinguishable graphs, yet (a) permits recoverability, and (b) does not. Consulting the wrong graph leads to the wrong deletion order and biases the estimates. ESTIMATE P(X,Y,Z) RyRy RxRx RzRz Z XY RyRy RxRx RzRz Z XY (a) (b)
CONCLUSIONS Counterfactuals are the building blocks of scientific thought, free will and moral behavior. The algorithmization of counterfactuals has benefited several problem areas in the empirical sciences, including policy evaluation, mediation analysis, generalizability, and credit / blame determination. This brings us a step closer to achieving cooperative behavior among computers and humans.
Thank you