CAUSAL INFERENCE IN STATISTICS

Slides:



Advertisements
Similar presentations
Department of Computer Science
Advertisements

RELATED CLASS CS 262 Z – SEMINAR IN CAUSAL MODELING CURRENT TOPICS IN COGNITIVE SYSTEMS INSTRUCTOR: JUDEA PEARL Spring Quarter Monday and Wednesday, 2-4pm.
1 WHAT'S NEW IN CAUSAL INFERENCE: From Propensity Scores And Mediation To External Validity Judea Pearl University of California Los Angeles (
CAUSES AND COUNTERFACTUALS Judea Pearl University of California Los Angeles (
1 THE SYMBIOTIC APPROACH TO CAUSAL INFERENCE Judea Pearl University of California Los Angeles (
TRYGVE HAAVELMO AND THE EMERGENCE OF CAUSAL CALCULUS Judea Pearl University of California Los Angeles (
THE MATHEMATICS OF CAUSAL MODELING Judea Pearl Department of Computer Science UCLA.
COMMENTS ON By Judea Pearl (UCLA). notation 1990’s Artificial Intelligence Hoover.
Visual Recognition Tutorial
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
Judea Pearl University of California Los Angeles CAUSAL REASONING FOR DECISION AIDING SYSTEMS.
Grade 8 – Module 5 Module Focus Session
1 CAUSAL INFERENCE: MATHEMATICAL FOUNDATIONS AND PRACTICAL APPLICATIONS Judea Pearl University of California Los Angeles (
SIMPSON’S PARADOX, ACTIONS, DECISIONS, AND FREE WILL Judea Pearl UCLA
CAUSES AND COUNTERFACTUALS OR THE SUBTLE WISDOM OF BRAINLESS ROBOTS.
1 WHAT'S NEW IN CAUSAL INFERENCE: From Propensity Scores And Mediation To External Validity Judea Pearl University of California Los Angeles (
CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (
1 REASONING WITH CAUSES AND COUNTERFACTUALS Judea Pearl UCLA (
Judea Pearl Computer Science Department UCLA DIRECT AND INDIRECT EFFECTS.
Judea Pearl University of California Los Angeles ( THE MATHEMATICS OF CAUSE AND EFFECT.
Judea Pearl University of California Los Angeles ( THE MATHEMATICS OF CAUSE AND EFFECT.
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
V13: Causality Aims: (1) understand the causal relationships between the variables of a network (2) interpret a Bayesian network as a causal model whose.
REASONING WITH CAUSE AND EFFECT Judea Pearl Department of Computer Science UCLA.
CAUSES AND COUNTERFACTIALS IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (
REASONING WITH CAUSE AND EFFECT Judea Pearl Department of Computer Science UCLA.
THE MATHEMATICS OF CAUSE AND COUNTERFACTUALS Judea Pearl University of California Los Angeles (
Impact Evaluation Sebastian Galiani November 2006 Causal Inference.
Judea Pearl Computer Science Department UCLA ROBUSTNESS OF CAUSAL CLAIMS.
CAUSAL REASONING FOR DECISION AIDING SYSTEMS COGNITIVE SYSTEMS LABORATORY UCLA Judea Pearl, Mark Hopkins, Blai Bonet, Chen Avin, Ilya Shpitser.
Mediation: The Causal Inference Approach David A. Kenny.
#1 Make sense of problems and persevere in solving them How would you describe the problem in your own words? How would you describe what you are trying.
Identification in Econometrics: A Way to Get Causal Information from Observations? Damien Fennell, LSE UCL, May 27, 2005.
Lecture 3:Elasticities. 2 This lecture covers: Why elasticities are useful Estimating elasticities Arc formula Deriving elasticities from demand functions.
CAUSAL INFERENCE IN STATISTICS: A Gentle Introduction Judea Pearl Departments of Computer Science and Statistics UCLA.
Copyright © Cengage Learning. All rights reserved.
Math 6330: Statistical Consulting Class 6
Chapter 7. Classification and Prediction
Unit 3 Hypothesis.
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
CJT 765: Structural Equation Modeling
Chapter 4: The Nature of Regression Analysis
Department of Computer Science
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
Judea Pearl University of California Los Angeles
Chen Avin Ilya Shpitser Judea Pearl Computer Science Department UCLA
Department of Computer Science
A MACHINE LEARNING EXERCISE
CS 188: Artificial Intelligence Fall 2008
CAP 5636 – Advanced Artificial Intelligence
University of California
Honors Statistics From Randomness to Probability
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Recurrences (Method 4) Alexandra Stefan.
CS 188: Artificial Intelligence
What LIMIT Means Given a function: f(x) = 3x – 5 Describe its parts.
Computer Science and Statistics
Principles of Microeconomics
What determines Sex Ratio in Mammals?
11. Conditional Density Functions and Conditional Expected Values
From Propensity Scores And Mediation To External Validity
11. Conditional Density Functions and Conditional Expected Values
Chapter 4: The Nature of Regression Analysis
THE MATHEMATICS OF PROGRAM EVALUATION
16. Mean Square Estimation
Department of Computer Science
CAUSAL REASONING FOR DECISION AIDING SYSTEMS
Presentation transcript:

CAUSAL INFERENCE IN STATISTICS THE MATHEMATICS OF CAUSAL INFERENCE IN STATISTICS Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) The subject of my lecture this evening is CAUSALITY. It is not an easy topic to speak about, but it is a fun topic to speak about. It is not easy because, like religion, sex and intelligence, causality was meant to be practiced, not analyzed. And it is fun, because, like religion, sex and intelligence, emotions run high, examples are plenty, there are plenty of interesting people to talk to, and above all, an exhilarating experience of watching our private thoughts magnified under the microscope of formal analysis.

OUTLINE Modeling: Statistical vs. Causal Causal Models and Identifiability Inference to three types of claims: Effects of potential interventions Claims about attribution (responsibility) Claims about direct and indirect effects

TRADITIONAL STATISTICAL INFERENCE PARADIGM Data Inference Q(P) (Aspects of P) P Joint Distribution e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)

FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES Probability and statistics deal with static relations P Joint Distribution P Joint Distribution Q(P) (Aspects of P) Data change Inference What happens when P changes? e.g., Infer whether customers who bought product A would still buy A if we were to double the price.

FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES What remains invariant when P changes say, to satisfy P (price=2)=1 P Joint Distribution P Joint Distribution Q(P) (Aspects of P) Data change Inference Note: P (v)  P (v | price = 2) P does not tell us how it ought to change e.g. Curing symptoms vs. curing diseases e.g. Analogy: mechanical deformation

FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT) Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix.

} FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT)  Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix. No causes in – no causes out (Cartwright, 1989) statistical assumptions + data causal assumptions causal conclusions  } Causal assumptions cannot be expressed in the mathematical language of standard statistics.

} FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT)  Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score Causal and statistical concepts do not mix. No causes in – no causes out (Cartwright, 1989) statistical assumptions + data causal assumptions causal conclusions  } Causal assumptions cannot be expressed in the mathematical language of standard statistics. Non-standard mathematics: Structural equation models (Wright, 1920; Simon, 1960) Counterfactuals (Neyman-Rubin (Yx), Lewis (x Y))

WHY CAUSALITY NEEDS SPECIAL MATHEMATICS Y = 2X X = 1 X = 1 Y = 2 SEM Equations are Non-algebraic: Y = 2X X = 1 X = 1 Y = 2 Process information Static information Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1.

FROM STATISTICAL TO CAUSAL ANALYSIS: 2. THE MENTAL BARRIERS Every exercise of causal analysis must rest on untested, judgmental causal assumptions. Every exercise of causal analysis must invoke non-standard mathematical notation.

TWO PARADIGMS FOR CAUSAL INFERENCE Observed: P(X, Y, Z,...) Conclusions needed: P(Yx=y), P(Xy=x | Z=z)... How do we connect observables, X,Y,Z,… to counterfactuals Yx, Xz, Zy,… ? N-R model Counterfactuals are primitives, new variables Super-distribution P*(X, Y,…, Yx, Xz,…) X, Y, Z constrain Yx, Zy,… Structural model Counterfactuals are derived quantities Subscripts modify a data-generating model

THE STRUCTURAL MODEL PARADIGM Data Joint Generating Q(M) Distribution (Aspects of M) Data Inference M – Oracle for computing answers to Q’s. e.g., Infer whether customers who bought product A would still buy A if we were to double the price.

ORACLE FOR MANIPILATION FAMILIAR CAUSAL MODEL ORACLE FOR MANIPILATION X Y Here is a causal model we all remember from high-school -- a circuit diagram. There are 4 interesting points to notice in this example: (1) It qualifies as a causal model -- because it contains the information to confirm or refute all action, counterfactual and explanatory sentences concerned with the operation of the circuit. For example, anyone can figure out what the output would be like if we set Y to zero, or if we change this OR gate to a NOR gate or if we perform any of the billions combinations of such actions. (2) Logical functions (Boolean input-output relation) is insufficient for answering such queries (3)These actions were not specified in advance, they do not have special names and they do not show up in the diagram. In fact, the great majority of the action queries that this circuit can answer have never been considered by the designer of this circuit. (4) So how does the circuit encode this extra information? Through two encoding tricks: 4.1 The symbolic units correspond to stable physical mechanisms (i.e., the logical gates) 4.2 Each variable has precisely one mechanism that determines its value. Z INPUT OUTPUT

STRUCTURAL CAUSAL MODELS Definition: A structural causal model is a 4-tuple V,U, F, P(u), where V = {V1,...,Vn} are observable variables U = {U1,...,Um} are background variables F = {f1,..., fn} are functions determining V, vi = fi(v, u) P(u) is a distribution over U P(u) and F induce a distribution P(v) over observable variables

CAUSAL MODELS AND COUNTERFACTUALS Definition: The sentence: “Y would be y (in situation u), had X been x,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y. Joint probabilities of counterfactuals: The super-distribution P* is derived from M. Parsimonious, consistent, and transparent

APPLICATIONS . Predicting effects of actions and policies . Learning causal relationships from assumptions and data . Troubleshooting physical systems and plans . Finding explanations for reported events . Generating verbal explanations . Understanding causal talk . Formulating theories of causal thinking Let us talk about item 1 for a minute. We saw that if we have a causal model M, then predicting the ramifications of an action is trivial -- mutilate and solve. If instead of a complete model we only have a probabilistic model, it is again trivial: we mutilate and propagate probabilities in the resultant causal network. The important point is that we can specify knowledge using causal vocabulary, and can handle actions that are specified as modalities. But what if we do not have even a probabilistic model? This is where item 2 comes in. In certain applications we are lucky to have data that may supplement missing fragments of the model, and the question is whether the data available is sufficient for computing the effect of actions. Let us illustrate this possibility in a simple example taken from economics:

AXIOMS OF CAUSAL COUNTERFACTUALS Y would be y, had X been x (in state U = u) Definiteness Uniqueness Effectiveness Composition Reversibility

RULES OF CAUSAL CALCULUS Rule 1: Ignoring observations P(y | do{x}, z, w) = P(y | do{x}, w) Rule 2: Action/observation exchange P(y | do{x}, do{z}, w) = P(y | do{x},z,w) Rule 3: Ignoring actions P(y | do{x}, do{z}, w) = P(y | do{x}, w)

DERIVATION IN CAUSAL CALCULUS Genotype (Unobserved) Smoking Tar Cancer P (c | do{s}) = t P (c | do{s}, t) P (t | do{s}) Probability Axioms = t P (c | do{s}, do{t}) P (t | do{s}) Rule 2 = t P (c | do{s}, do{t}) P (t | s) Rule 2 Here we see how one can prove that the effect of smoking on cancer can be determined from data on three variables: Smoking, Tar and Cancer. The question boils down to computing P(cancer) under the hypothetical action do(smoking), from non-experimental data, namely, from expressions involving NO ACTIONS. Or: we need to eliminate the "do" symbol from the initial expression. The elimination proceeds like ordinary solution of algebraic equation -- in each stage, a new rule is applied, licensed by some subgraph of the diagram, until eventually leading to a formula involving only WHITE SYMBOLS, meaning an expression computable from non-experimental data. Now, if I were not a modest person, I would say that this is an amazing result. Watch what is going on here: we are not given any information whatsoever on the hidden genotype, it may be continuous or discrete, unidimensional or multidimensional. Yet, measuring an auxiliary variable TAR someplace else in the system, enables us to predict what the world would be like in the hypothetical situation where people were free of the influence of this hidden genotype. Data on the visible allows us to infer the effects of the invisible. Moreover, a person can also figure out the answer to the question: "I am about to smoke -- should I"? I think it is amazing, because I cannot do this calculation in my head. It demonstrates the immense power of having a formal language in an area that many respectable scientists prefer to see handled by unaided judgment. = t P (c | do{t}) P (t | s) Rule 3 = st P (c | do{t}, s) P (s | do{t}) P(t |s) Probability Axioms Rule 2 = st P (c | t, s) P (s | do{t}) P(t |s) = s t P (c | t, s) P (s) P(t |s) Rule 3

THE BACK-DOOR CRITERION Graphical test of identification P(y | do(x)) is identifiable in G if there is a set Z of variables such that Z d-separates X from Y in Gx. G Gx Z1 Z1 Z2 Z2 Z Z3 Z3 Z4 Z5 Z5 Z4 X Z6 Y X Z6 Y Moreover, P(y | do(x)) = å P(y | x,z) P(z) (“adjusting” for Z) z

RECENT RESULTS ON IDENTIFICATION do-calculus is complete Complete graphical criterion for identifying causal effects (Shpitser and Pearl, 2006). Complete graphical criterion for empirical testability of counterfactuals (Shpitser and Pearl, 2007).

DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) Your Honor! My client (Mr. A) died BECAUSE he used that drug.

DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) Your Honor! My client (Mr. A) died BECAUSE he used that drug. Court to decide if it is MORE PROBABLE THAN NOT that A would be alive BUT FOR the drug! P(? | A is dead, took the drug) > 0.50 PN =

THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.”

THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Answer: Computable from M

THE PROBLEM Semantical Problem: What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Under what condition can PN(x,y) be learned from statistical data, i.e., observational, experimental and combined. Analytical Problem:

TYPICAL THEOREMS (Tian and Pearl, 2000) Bounds given combined nonexperimental and experimental data Identifiability under monotonicity (Combined data) corrected Excess-Risk-Ratio

CAN FREQUENCY DATA DECIDE LEGAL RESPONSIBILITY? Experimental Nonexperimental do(x) do(x) x x Deaths (y) 16 14 2 28 Survivals (y) 984 986 998 972 1,000 1,000 1,000 1,000 Nonexperimental data: drug usage predicts longer life Experimental data: drug has negligible effect on survival Plaintiff: Mr. A is special. He actually died He used the drug by choice Court to decide (given both data): Is it more probable than not that A would be alive but for the drug?

SOLUTION TO THE ATTRIBUTION PROBLEM WITH PROBABILITY ONE 1  P(yx | x,y)  1 Combined data tell more that each study alone

EFFECT DECOMPOSITION What is the semantics of direct and indirect effects? What are their policy-making implications? Can we estimate them from data? Experimental data?

WHY DECOMPOSE EFFECTS? Direct (or indirect) effect may be more transportable. Indirect effects may be prevented or controlled. Direct (or indirect) effect may be forbidden Pill  Pregnancy + + Thrombosis Gender Qualification Hiring

SEMANTICS BECOMES NONTRIVIAL IN NONLINEAR MODELS (even when the model is completely specified) X Z z = f (x, 1) y = g (x, z, 2) Y Dependent on z? Void of operational meaning?

THE OPERATIONAL MEANING OF DIRECT EFFECTS X Z z = f (x, 1) y = g (x, z, 2) Y “Natural” Direct Effect of X on Y: The expected change in Y per unit change of X, when we keep Z constant at whatever value it attains before the change. In linear models, NDE = Controlled Direct Effect

THE OPERATIONAL MEANING OF INDIRECT EFFECTS X Z z = f (x, 1) y = g (x, z, 2) Y “Natural” Indirect Effect of X on Y: The expected change in Y when we keep X constant, say at x0, and let Z change to whatever value it would have under a unit change in X. In linear models, NIE = TE - DE

POLICY IMPLICATIONS OF INDIRECT EFFECTS indirect What is the direct effect of X on Y? The effect of Gender on Hiring if sex discrimination is eliminated. GENDER QUALIFICATION HIRING X Z IGNORE f Y

SEMANTICS AND IDENTIFICATION OF NESTED COUNTERFACTUALS Consider the quantity Given M, P(u), Q is well defined Given u, Zx*(u) is the solution for Z in Mx*, call it z is the solution for Y in Mxz Can Q be estimated from data?

GENERAL PATH-SPECIFIC EFFECTS (Def.) X x* X W Z W Z z* = Zx* (u) Y Y Form a new model, , specific to active subgraph g Definition: g-specific effect Nonidentifiable even in Markovian models

EFFECT DECOMPOSITION SUMMARY Graphical conditions for estimability from experimental / nonexperimental data. Graphical conditions hold in Markovian models Useful in answering new type of policy questions involving mechanism blocking instead of variable fixing.

CONCLUSIONS Structural-model semantics, enriched with logic and graphs, provides: Complete formal basis for causal reasoning Powerful and friendly causal calculus Lays the foundations for asking more difficult questions: What is an action? What is free will? Should robots be programmed to have this illusion?