THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA.

Slides:



Advertisements
Similar presentations
Department of Computer Science
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
RELATED CLASS CS 262 Z – SEMINAR IN CAUSAL MODELING CURRENT TOPICS IN COGNITIVE SYSTEMS INSTRUCTOR: JUDEA PEARL Spring Quarter Monday and Wednesday, 2-4pm.
Holland on Rubin’s Model Part II. Formalizing These Intuitions. In the 1920 ’ s and 30 ’ s Jerzy Neyman, a Polish statistician, developed a mathematical.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
The World Bank Human Development Network Spanish Impact Evaluation Fund.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 WHAT'S NEW IN CAUSAL INFERENCE: From Propensity Scores And Mediation To External Validity Judea Pearl University of California Los Angeles (
CAUSES AND COUNTERFACTUALS Judea Pearl University of California Los Angeles (
1 THE SYMBIOTIC APPROACH TO CAUSAL INFERENCE Judea Pearl University of California Los Angeles (
TRYGVE HAAVELMO AND THE EMERGENCE OF CAUSAL CALCULUS Judea Pearl University of California Los Angeles (
THE MATHEMATICS OF CAUSAL MODELING Judea Pearl Department of Computer Science UCLA.
COMMENTS ON By Judea Pearl (UCLA). notation 1990’s Artificial Intelligence Hoover.
Causal Networks Denny Borsboom. Overview The causal relation Causality and conditional independence Causal networks Blocking and d-separation Excercise.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Doing Social Psychology Research
Judea Pearl University of California Los Angeles CAUSAL REASONING FOR DECISION AIDING SYSTEMS.
Scientific Thinking - 1 A. It is not what the man of science believes that distinguishes him, but how and why he believes it. B. A hypothesis is scientific.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Science and Engineering Practices
Introduction to the design (and analysis) of experiments James M. Curran Department of Statistics, University of Auckland
1 CAUSAL INFERENCE: MATHEMATICAL FOUNDATIONS AND PRACTICAL APPLICATIONS Judea Pearl University of California Los Angeles (
CAUSES AND COUNTERFACTUALS OR THE SUBTLE WISDOM OF BRAINLESS ROBOTS.
1 WHAT'S NEW IN CAUSAL INFERENCE: From Propensity Scores And Mediation To External Validity Judea Pearl University of California Los Angeles (
Chapter 5 Research Methods in the Study of Abnormal Behavior Ch 5.
CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (
Determining Sample Size
1 REASONING WITH CAUSES AND COUNTERFACTUALS Judea Pearl UCLA (
Chapter 1: Introduction to Statistics
Judea Pearl Computer Science Department UCLA DIRECT AND INDIRECT EFFECTS.
by B. Zadrozny and C. Elkan
Judea Pearl University of California Los Angeles ( THE MATHEMATICS OF CAUSE AND EFFECT.
Judea Pearl University of California Los Angeles ( THE MATHEMATICS OF CAUSE AND EFFECT.
1 Science as a Process Chapter 1 Section 2. 2 Objectives  Explain how science is different from other forms of human endeavor.  Identify the steps that.
REASONING WITH CAUSE AND EFFECT Judea Pearl Department of Computer Science UCLA.
1 G Lect 14M Review of topics covered in course Mediation/Moderation Statistical power for interactions What topics were not covered? G Multiple.
Experimental Method. METHODS IN PSYCHOLOGY 1.Experimental Method 2.Observation Method 3.Clinical Method.
CAUSES AND COUNTERFACTIALS IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (
Copyright © Cengage Learning. All rights reserved.
Copyright © Allyn & Bacon 2008 Intelligent Consumer Chapter 14 This multimedia product and its contents are protected under copyright law. The following.
THE MATHEMATICS OF CAUSE AND COUNTERFACTUALS Judea Pearl University of California Los Angeles (
WERST – Methodology Group
Impact Evaluation Sebastian Galiani November 2006 Causal Inference.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Probabilistic Robotics Introduction.  Robotics is the science of perceiving and manipulating the physical world through computer-controlled devices.
Judea Pearl Computer Science Department UCLA ROBUSTNESS OF CAUSAL CLAIMS.
CAUSAL REASONING FOR DECISION AIDING SYSTEMS COGNITIVE SYSTEMS LABORATORY UCLA Judea Pearl, Mark Hopkins, Blai Bonet, Chen Avin, Ilya Shpitser.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
1 CONFOUNDING EQUIVALENCE Judea Pearl – UCLA, USA Azaria Paz – Technion, Israel (
Copyright © 2015 Inter-American Development Bank. This work is licensed under a Creative Commons IGO 3.0 Attribution-Non Commercial-No Derivatives (CC-IGO.
Variable selection in Regression modelling Simon Thornley.
THE MATHEMATICS OF CAUSE AND EFFECT: Thinking Nature and Talking Counterfactuals Judea Pearl Departments of Computer Science and Statistics UCLA.
CAUSAL INFERENCE IN STATISTICS: A Gentle Introduction Judea Pearl Departments of Computer Science and Statistics UCLA.
Intro to Research Methods
VALIDITY by Barli Tambunan/
Department of Computer Science
Chen Avin Ilya Shpitser Judea Pearl Computer Science Department UCLA
Department of Computer Science
A MACHINE LEARNING EXERCISE
CAP 5636 – Advanced Artificial Intelligence
Evaluating agricultural value chain programs: How we mix our methods
CS 188: Artificial Intelligence
Computer Science and Statistics
CAUSAL INFERENCE IN STATISTICS
Total Energy is Conserved.
THE FOUNDATIONS OF CAUSAL INFERENCE
From Propensity Scores And Mediation To External Validity
THE MATHEMATICS OF PROGRAM EVALUATION
Department of Computer Science
STAT 515 Statistical Methods I Chapter 1 Data Types and Data Collection Brian Habing Department of Statistics University of South Carolina Redistribution.
Presentation transcript:

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA

1.The causal revolution – from statistics to counterfactuals 2.The fundamental laws of causal inference 3.From counterfactuals to practical victories a)policy evaluation b)attribution c)mediation d)generalizability – external validity e)missing data OUTLINE

TRADITIONAL STATISTICAL INFERENCE PARADIGM Data Inference Q(P) (Aspects of P ) P Joint Distribution e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)

How does P change to P ′? New oracle e.g., Estimate P ′(cancer) if we ban smoking. FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES Data Inference Q(P′) (Aspects of P ′) P′ Joint Distribution P Joint Distribution change

e.g., Estimate the probability that a customer who bought A would buy B if we were to double the price. FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES Data Inference Q(P′) (Aspects of P ′) P′ Joint Distribution P Joint Distribution change

Data Inference Q(M) (Aspects of M ) Data Generating Model M – Invariant strategy (mechanism, recipe, law, protocol) by which Nature assigns values to variables in the analysis. Joint Distribution THE STRUCTURAL MODEL PARADIGM M “A painful de-crowning of a beloved oracle!”

WHAT KIND OF QUESTIONS SHOULD THE ORACLE ANSWER? Observational Questions: “What if we see A” Action Questions: “What if we do A?” Counterfactuals Questions: “What if we did things differently?” Options: “With what probability?” (What is?) (What if?) (Why?) THE CAUSAL HIERARCHY P(y | A) P(y | do(A) P(y A ’ | A) - SYNTACTIC DISTINCTION

STRUCTURAL CAUSAL MODELS: THE WORLD AS A COLLECTION OF SPRINGS Definition: A structural causal model is a 4-tuple, where V = {V 1,...,V n } are endogenous variables U = {U 1,...,U m } are background variables F = {f 1,..., f n } are functions determining V, v i = f i (v, u) P(u) is a distribution over U P(u) and F induce a distribution P(v) over observable variables e.g.,

Definition: The sentence: “ Y would be y (in situation u ), had X been x,” denoted Y x (u) = y, means: The solution for Y in a mutilated model M x, (i.e., the equations for X replaced by X = x ) with input U=u, is equal to y. The Fundamental Equation of Counterfactuals: COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE

THE TWO FUNDAMENTAL LAWS OF CAUSAL INFERENCE 1.The Law of Counterfactuals ( M generates and evaluates all counterfactuals.) 2.The Law of Conditional Independence ( d -separation) (Separation in the model ⇒ independence in the distribution.)

THE LAW OF CONDITIONAL INDEPENDENCE Each function summarizes millions of micro processes. C (Climate) R (Rain) S (Sprinkler) W (Wetness) U1U1 U2U2 U3U3 U4U4 CS

Each function summarizes millions of micro processes. Still, if the U 's are independent, the observed distribution P ( C,R,S,W ) must satisfy certain constraints that are: (1) independent of the f ‘s and of P ( U ) and (2) can be read from the structure of the graph. C (Climate) R (Rain) S (Sprinkler) W (Wetness) U1U1 U2U2 U3U3 U4U4 CS THE LAW OF CONDITIONAL INDEPENDENCE

D -SEPARATION: NATURE’S LANGUAGE FOR COMMUNICATING ITS STRUCTURE C (Climate) R (Rain) S (Sprinkler) W (Wetness) Every missing arrow advertises an independency, conditional on a separating set. Applications: 1.Model testing 2.Structure learning 3.Reducing "what if I do" questions to symbolic calculus 4.Reducing scientific questions to symbolic calculus

SEEING VS. DOING Effect of turning the sprinkler ON

Q(P) - Identified estimands T(M A ) - Testable implications A* - Logical implications of A Causal inference Statistical inference A - CAUSAL ASSUMPTIONS Q Queries of interest Data ( D ) THE LOGIC OF CAUSAL ANALYSIS Goodness of fit Model testingProvisional claims Q - Estimates of Q(P) CAUSAL MODEL (M A )

THE MACHINERY OF CAUSAL CALCULUS Rule 1: Ignoring observations P(y | do{x}, z, w) = P(y | do{x}, w) Rule 2: Action/observation exchange P(y | do{x}, do{z}, w) = P(y | do{x},z,w) Rule 3: Ignoring actions P(y | do{x}, do{z}, w) = P(y | do{x}, w) Completeness Theorem (Shpitser, 2006)

DERIVATION IN CAUSAL CALCULUS Smoking Tar Cancer Probability Axioms Rule 2 Rule 3 Rule 2 Genotype (Unobserved)

EFFECT OF WARM-UP ON INJURY (After Shrier & Platt, 2008) No, no!

TRANSPORTABILITY OF KNOWLEDGE ACROSS DOMAINS (with E. Bareinboim) 1.A Theory of causal transportability When can causal relations learned from experiments be transferred to a different environment in which no experiment can be conducted? 2.A Theory of statistical transportability When can statistical information learned in one domain be transferred to a different domain in which a.only a subset of variables can be observed? Or, b.only a few samples are available?

Extrapolation across studies requires “some understanding of the reasons for the differences.” (Cox, 1958) “`External validity’ asks the question of generalizability: To what population, settings, treatment variables, and measurement variables can this effect be generalized?” (Shadish, Cook and Campbell, 2002) “An experiment is said to have “external validity” if the distribution of outcomes realized by a treatment group is the same as the distribution of outcome that would be realized in an actual program.” (Manski, 2007) "A threat to external validity is an explanation of how you might be wrong in making a generalization." (Trochin, 2006) EXTERNAL VALIDITY (how transportability is seen in other sciences)

MOVING FROM THE LAB TO THE REAL WORLD... Real world Everything is assumed to be the same, trivially transportable! Everything is assumed to be the different, not transportable! XY Z W XY Z W XY Z W Lab H1H1 H2H2

MOTIVATION WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC? Experimental study in LA Measured: Needed: Observational study in NYC Measured: X (Intervention) Y (Outcome) Z (Age) X (Observation) Y (Outcome) Z (Age) Transport Formula (calibration):

TRANSPORT FORMULAS DEPEND ON THE STORY a) Z represents age b) Z represents language skill X Y Z (b) S X Y Z (a) S ? S S Factors producing differences

X TRANSPORT FORMULAS DEPEND ON THE STORY a) Z represents age b) Z represents language skill c) Z represents a bio-marker X Y Z (b) S (a) X Y (c) Z S ? Y Z S

U W GOAL: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE X Y Z V S T INPUT: Annotated Causal Graph OUTPUT: 1.Transportable or not? 2.Measurements to be taken in the experimental study 3.Measurements to be taken in the target population 4.A transport formula S Factors creating differences

TRANSPORTABILITY REDUCED TO CALCULUS Theorem A causal relation R is transportable from ∏ to ∏*  if  and only if it is reducible, using the rules of do -calculus, to an expression in which S is separated from do ( ). X Y Z S W

U W RESULT: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE X Y Z V S T INPUT: Annotated Causal Graph OUTPUT: 1.Transportable or not? 2.Measurements to be taken in the experimental study 3.Measurements to be taken in the target population 4.A transport formula 5.Completeness (Bareinboim, 2012) S Factors creating differences

XY (f) Z S XY (d) Z S W WHICH MODEL LICENSES THE TRANSPORT OF THE CAUSAL EFFECT X → Y XY (e) Z S W (c) XYZ S XYZ S WXYZ S W (b) YX S (a) YX S S External factors creating disparities Yes No YesNoYes XY (f) Z S

STATISTICAL TRANSPORTABILITY (Transfer Learning) Why should we transport statistical information? i.e., Why not re-learn things from scratch ? 1.Measurements are costly. Limit measurements to a subset V * of variables called “scope”. 2.Samples are scarce. Pooling samples from diverse populations will improve precision, if differences can be filtered out.

R=P* (y | x) is transportable over V* = {X,Z}, i.e., R is estimable without re-measuring Y Transfer Learning If few samples (N 2 ) are available from ∏* and many samples (N 1 ) from ∏ then estimating R = P*(y | x) by achieves a much higher precision STATISTICAL TRANSPORTABILITY Definition: (Statistical Transportability) A statistical relation R(P) is said to be transportable from ∏ to ∏* over V * if R(P*) is identified from P, P * (V * ), and D where P*(V *) is the marginal distribution of P * over a subset of variables V *. XYZ S XYZ S

META-ANALYSIS OR MULTI-SOURCE LEARNING XY (f) Z W XY (b) Z WXY (c) Z S WXY (a) Z W XY (g) Z W XY (e) Z W S S Target population R = P*(y | do(x)) XY (h) Z WXY (i) Z S W S XY (d) Z W

CAN WE GET A BIAS-FREE ESTIMATE OF THE TARGET QUANTITY? XY (a) Z W Target population R = P*(y | do(x)) XY (d) Z W Is R identifiable from (d) and (h) ? R(∏*) is identifiable from studies (d) and (h). R(∏*) is not identifiable from studies (d) and (i). XY (h) Z W S XY (i) Z W S S

FROM META-ANALYSIS TO META-SYNTHESIS The problem How to combine results of several experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to construct an aggregate measure of effect size that is "better" than any one study in isolation.

META-SYNTHESIS REDUCED TO CALCULUS Theorem {∏ 1, ∏ 2,…,∏ K } – a set of studies. {D 1, D 2,…, D K } – selection diagrams (relative to ∏*). A relation R( ∏* ) is "meta estimable" if it can be decomposed into terms of the form: such that each Q k is transportable from D k. Open-problem: Systematic decomposition

Principle 1: Calibrate estimands before pooling (to minimize bias) Principle 2: Decompose to sub-relations before calibrating (to improve precision) Pooling Calibration BIAS VS. PRECISION IN META-SYNTHESIS WWWXY (a) Z XY (h) Z S XY (i) Z WXY (d) Z W S XY (g) Z

Pooling WWWXYXYXYWXYWXY (a) Z (h) Z S (i) Z (d) Z S (g) Z BIAS VS. PRECISION IN META-SYNTHESIS Composition Pooling

MISSING DATA: A SEEMINGLY STATISTICAL PROBLEM (Mohan & Pearl, 2012) Pervasive in every experimental science. Huge literature, powerful software industry, deeply entrenched culture. Current practices are based on statistical characterization (Rubin, 1976) of a problem that is inherently causal. Consequence: Like Alchemy before Boyle and Dalton, the field is craving for (1) theoretical guidance and (2) performance guarantees.

ESTIMATE P(X,Y,Z) Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m

Q-1. What should the world be like, for a given statistical procedure to produce the expected result? Q-2. Can we tell from the postulated world whether any method can produce a bias-free result? How? Q-3. Can we tell from data if the world does not work as postulated? None of these questions can be answered by statistical characterization of the problem. All can be answered using causal models WHAT CAN CAUSAL THEORY DO FOR MISSING DATA?

Causal inference is a missing data problem. (Rubin 2012) Missing data is a causal inference problem. (Pearl 2012) Why is missingness a causal problem? Which mechanism causes missingness makes a difference in whether / how we can recover information from the data. Mechanisms require causal language to be properly described – statistics is not sufficient. Different causal assumptions lead to different routines for recovering information from data, even when the assumptions are indistinguishable by any statistical means. MISSING DATA: TWO PERSPECTIVES

ESTIMATE P(X,Y,Z) Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m X * X RXRX Missingness graph {

Row # XYZRxRx RyRy RzRz Line deletion estimate is generally biased. NAIVE ESTIMATE OF P(X,Y,Z) Complete Cases Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m RzRz RyRy RxRx X YZ MCAR

SMART ESTIMATE OF P(X,Y,Z) RzRz RyRy RxRx Z XY Sam-ObservationsMissingness ple #X*Y*Z*RxRx RyRy RzRz mm m001 5m1m101 6m mm m m m

Sam- ple # X*Y*Z* mm 401m 5m1m 6m01 7mm0 801m 900m 1010m SMART ESTIMATE OF P(X,Y,Z)

Compute P(Y|R y =0) Sam- ple # X*Y*Z* mm 401m 5m1m 6m01 7mm0 801m 900m 1010m Row # Y* SMART ESTIMATE OF P(X,Y,Z)

Compute P(X|Y,R x =0,R y =0) Compute P(Y|R y =0) Sam- ple # X*Y*Z* mm 401m 5m1m 6m01 7mm0 801m 900m 1010m Row # Y* Row # X*Y* SMART ESTIMATE OF P(X,Y,Z)

Compute P(Z|X,Y,R x =0,R y =0,R z =0) Compute P(X|Y,R x =0,R y =0) Compute P(Y|R y =0) Sam- ple # X*Y*Z* mm 401m 5m1m 6m01 7mm0 801m 900m 1010m Row # Y* Row # X*Y* Row # X*Y*Z* SMART ESTIMATE OF P(X,Y,Z)

INESTIMABLE P(X,Y,Z) RyRy RxRx RzRz Z XY

Definition: Given a missingness model M, a probabilistic quantity Q is said to be recoverable if there exists an algorithm that produces a consistent estimate of Q for every dataset generated by M. Theorem: Q is recoverable iff it is decomposable into terms of the form Q j =P(S j | T j ), such that: 1.For each variable V that is in T j, R V is also in T j. 2.For each variable V that is in S j, R V is either in T j or in S j. e.g., (That is, in the limit of large sample, Q is estimable as if no data were missing.) RECOVERABILITY FROM MISSING DATA

Two statistically indistinguishable models, yet P(X,Y) is recoverable in (a) and not in (b). No universal algorithm exists that decides recoverability (or guarantees unbiased results) without looking at the model. AN IMPOSSIBILITY THEOREM FOR MISSING DATA RxRx X Y (a)(b) RxRx Y X AccidentInjury Treatment Missing (X) Education Missing (X)

Two statistically indistinguishable models, P(X) is recoverable in both, but through two different methods: No universal algorithm exists that produces an unbiased estimate whenever such exists. NO ESTIMATION WITHOUT CAUSATION RxRx X Y (a)(b) RxRx Y X

CONCLUSIONS Counterfactuals, the building blocks of scientific thought, are encoded meaningfully and conveniently in structural models. Identifiability and testability are computable tasks. The counterfactual-graphical symbiosis has led to major advances in the empirical sciences, including policy evaluation, mediation analysis, generalizability, credit-blame determination, missing data, and heterogeneity. THINK NATURE, NOT DATA

Thank you